Termextraction with Europarl as an example
The EUROPARL corpus was used to test the bilingual term extraction.
Procedure
The EUROPARL corpus was downloaded from
Europarl Corpus Website.
For Europarl see: Europarl: A Multilingual Corpus for Evaluation of Machine Translation Philipp Koehn, Draft, Unpublished.
Next the German - English Corpus was converted into TMX files based on the sentence aligned documents using Arayas QuickAlign tool, which essentially just reads the corresponding DE and EN files and creates tu elements from the segments. In order to test the stability of the extraction 10 TMX EN-DE files of similar size (about 66562 entries for each file) were created. Four files have been chosen and individually a term extraction was run. This resulted in 4 different extraction files which in the next step have been merged into one file. This resulted overall in 468 term pairs extracted. A batch file was written to run the extractions.
The following parameters have been used:
- only one word terms
- maximum of two translations per term
- minimum frequency 70
- maximum frequency 500
- English terms have been lowercased, German terms remained unchanged
The full extraction file can be downloaded from:
Europarl Extraction
Further extractions with different language combinations
Annotation: Please note that due to the way Europarl was translated in some cases French terms may appear for both source and target terms.
First analysis
Further extractions - English always as target - are underway. The settings used for those extractions are:
- only one and two word terms
- maximum of two translations per term
- minimum frequency 20
- maximum frequency 500
- all terms lowercased
Second analysis
- only one, two and three word terms
- maximum of three translations per term
- minimum frequency 3
- maximum frequency 500
- all terms lowercased
Language Pair | Term Number | Download 2007/2009 |
German (lower cased!)- English | 45761 101110 | DE-EN (2007) DE-EN (2009) |
Danish - English | 102727 | DA-EN |
Swedish - English | 73768 | SV-EN |
Italian - English | 87777 | IT-EN |
French - English | 42618 | FR-EN |
Finish - English | 35548 | FI-EN |
Dutch - English | 47545 | NL-EN |
Greek - English | 72485 | EL-EN |
Spanish - English | 74100 | ES-EN |
Next steps
Next we will run several tests trying to check the stability of the extracted terms using different combinations of the tmx files.
It will be interesting to see how those terms can be used for phrase translation based tools or statistical machine translation.
In addition we will run the same tests with other languages.
Some examples of the extracted terms
In the following the first 50 extracted terms are shown. The full extraction file can be downloaded from:
Europarl Extraction
nr;score;status;term1.LangCode;term1.wordGroup;term1.wordGroupLen;term1.wFreq;term2.LangCode;term2.wordGroup;term2.wordGroupLen;term2.wFreq;sentLinked
0;0.99404764;unapproved;de;OSZE;1;86;en;osce;1;88;83
1;0.9935065;unapproved;de;CNS;1;78;en;cns;1;79;76
2;0.99025977;unapproved;de;Barcelona;1;157;en;barcelona;1;160;151
3;0.99014777;unapproved;de;Seattle;1;206;en;seattle;1;207;201
4;0.98876405;unapproved;de;ECHO;1;93;en;echo;1;94;88
5;0.988333;unapproved;de;OLAF;1;321;en;olaf;1;326;181
6;0.9860111;unapproved;de;Helsinki;1;529;en;helsinki;1;526;317
7;0.985;unapproved;de;LEADER;1;123;en;leader;1;119;97
8;0.9845339;unapproved;de;Bourlanges;1;207;en;bourlanges;1;207;92
9;0.982683;unapproved;de;Kyoto;1;116;en;kyoto;1;120;113
10;0.9812194;unapproved;de;Byrne;1;106;en;byrne;1;108;104
11;0.98017544;unapproved;de;Rambouillet;1;76;en;rambouillet;1;77;74
12;0.9788914;unapproved;de;Nizza;1;397;en;nice;1;398;236
13;0.9757222;unapproved;de;Feira;1;230;en;feira;1;228;221
14;0.9725971;unapproved;de;Tampere;1;336;en;tampere;1;335;319
15;0.9712733;unapproved;de;MEDA;1;143;en;meda;1;146;135
16;0.9693323;unapproved;de;INTERREG;1;120;en;interreg;1;118;110
17;0.96875;unapproved;de;Michelin;1;78;en;michelin;1;83;75
18;0.9651604;unapproved;de;Kosovo;1;537;en;kosovo;1;524;311
19;0.9625;unapproved;de;Roth;1;75;en;roth;1;81;74
20;0.96031624;unapproved;de;Straßburg;1;415;en;strasbourg;1;446;136
21;0.9596327;unapproved;de;Montag;1;160;en;monday;1;170;85
22;0.9557449;unapproved;de;Kinnock;1;274;en;kinnock;1;273;109
23;0.95454544;unapproved;de;URBAN;1;89;en;urban;1;90;84
24;0.95303506;unapproved;de;Patten;1;350;en;patten;1;334;125
25;0.9518263;unapproved;de;Prodi;1;901;en;prodi;1;940;408
26;0.9511972;unapproved;de;Saddam;1;81;en;saddam;1;85;77
27;0.95040244;unapproved;de;Barnier;1;72;en;barnier;1;71;67
28;0.94457006;unapproved;de;April;1;446;en;april;1;465;76
29;0.94290334;unapproved;de;Lissabon;1;378;en;lisbon;1;423;267
30;0.9420494;unapproved;de;Betrifft;1;533;en;subject;1;539;96
31;0.9417781;unapproved;de;H;1;580;en;h;1;574;100
32;0.93754536;unapproved;de;Schweiz;1;136;en;switzerland;1;123;110
33;0.93558306;unapproved;de;Türkei;1;1025;en;turkey;1;1041;269
34;0.9336505;unapproved;de;Marokko;1;243;en;morocco;1;261;145
35;0.933623;unapproved;de;Solana;1;171;en;solana;1;182;77
36;0.9334416;unapproved;de;Kroatien;1;81;en;croatia;1;87;75
37;0.93;unapproved;de;Mexiko;1;79;en;mexico;1;86;72
38;0.929566;unapproved;de;SOKRATES;1;70;en;socrates;1;79;69
39;0.92853904;unapproved;de;Juli;1;334;en;july;1;331;77
40;0.9265501;unapproved;de;TACIS;1;78;en;tacis;1;75;69
41;0.9264815;unapproved;de;Tschetschenien;1;336;en;chechnya;1;353;216
42;0.92510605;unapproved;de;Renault;1;144;en;renault;1;160;136
43;0.9224414;unapproved;de;Bosnien;1;190;en;bosnia;1;200;94
44;0.9221227;unapproved;de;Februar;1;433;en;february;1;447;88
45;0.9218446;unapproved;de;September;1;380;en;september;1;380;58
46;0.92183656;unapproved;de;Belgien;1;162;en;belgium;1;169;79
47;0.9208842;unapproved;de;Dänemark;1;264;en;denmark;1;283;69
48;0.91880286;unapproved;de;Mosambik;1;87;en;mozambique;1;101;81
49;0.91612965;unapproved;de;Serbien;1;192;en;serbia;1;221;75
50;0.9138126;unapproved;de;März;1;694;en;march;1;703;191
Batch file for extractions
REM 1 directory
REM 2 source language
REM 3 target language
REM 4 lower phrase length limit
REM 5 upper phrase length limit
REM 6 minimum frequency
REM 7 maximum frequency
REM 8 maximum translations
REM 9 source term to lower (true/false)
REM 10 target term to lower (true/false)
set ARAYAPATH=c:/araya
SET DIRECTORY=%1
SHIFT
SET JARS=%ARAYAPATH%\lib\Win32\swt.jar;%ARAYAPATH%\lib\arayaserver.jar;%ARAYAPATH%\lib\external.jar
SET CALLING=java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -batch
set RESTARGS=%DIRECTORY% %1 %2 %3 %4 %5 %6 %7 %8 %9 ""
call %CALLING% %RESTARGS%
call java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -merge "europarl-%1-%2.csv" "%DIRECTORY%"