Heartsome Europe GmbH
 
» English

Termextraction with Europarl as an example

The EUROPARL corpus was used to test the bilingual term extraction.

Procedure

The EUROPARL corpus was downloaded from Europarl Corpus Website.

For Europarl see: Europarl: A Multilingual Corpus for Evaluation of Machine Translation Philipp Koehn, Draft, Unpublished.

Next the German - English Corpus was converted into TMX files based on the sentence aligned documents using Arayas QuickAlign tool, which essentially just reads the corresponding DE and EN files and creates tu elements from the segments. In order to test the stability of the extraction 10 TMX EN-DE files of similar size (about 66562 entries for each file) were created. Four files have been chosen and individually a term extraction was run. This resulted in 4 different extraction files which in the next step have been merged into one file. This resulted overall in 468 term pairs extracted. A batch file was written to run the extractions.

The following parameters have been used:

The full extraction file can be downloaded from: Europarl Extraction

Further extractions with different language combinations

Annotation: Please note that due to the way Europarl was translated in some cases French terms may appear for both source and target terms.

First analysis Further extractions - English always as target - are underway. The settings used for those extractions are:

Language PairTerm NumberDownload 2007Download 2009
German - English3384DE-ENDE-EN
Danish - English5109DA-ENDA-EN
Swedish - English3310SV-ENSV-EN
Italian - English5010IT-ENIT-EN
French - English4359FR-ENFR-EN
Finish - English1315FI-ENFI-EN
Dutch - English4886NL-ENNL-EN
Greek - English4962EL-ENEL-EN
Spanish - English6152ES-ENES-EN

Second analysis

Language PairTerm NumberDownload 2007/2009
German (lower cased!)- English45761
101110
DE-EN (2007)
DE-EN (2009)
Danish - English102727DA-EN
Swedish - English73768SV-EN
Italian - English87777IT-EN
French - English42618FR-EN
Finish - English35548FI-EN
Dutch - English47545NL-EN
Greek - English72485EL-EN
Spanish - English74100ES-EN

Next steps

Next we will run several tests trying to check the stability of the extracted terms using different combinations of the tmx files.

It will be interesting to see how those terms can be used for phrase translation based tools or statistical machine translation.

In addition we will run the same tests with other languages.

Some examples of the extracted terms

In the following the first 50 extracted terms are shown. The full extraction file can be downloaded from: Europarl Extraction nr;score;status;term1.LangCode;term1.wordGroup;term1.wordGroupLen;term1.wFreq;term2.LangCode;term2.wordGroup;term2.wordGroupLen;term2.wFreq;sentLinked
0;0.99404764;unapproved;de;OSZE;1;86;en;osce;1;88;83
1;0.9935065;unapproved;de;CNS;1;78;en;cns;1;79;76
2;0.99025977;unapproved;de;Barcelona;1;157;en;barcelona;1;160;151
3;0.99014777;unapproved;de;Seattle;1;206;en;seattle;1;207;201
4;0.98876405;unapproved;de;ECHO;1;93;en;echo;1;94;88
5;0.988333;unapproved;de;OLAF;1;321;en;olaf;1;326;181
6;0.9860111;unapproved;de;Helsinki;1;529;en;helsinki;1;526;317
7;0.985;unapproved;de;LEADER;1;123;en;leader;1;119;97
8;0.9845339;unapproved;de;Bourlanges;1;207;en;bourlanges;1;207;92
9;0.982683;unapproved;de;Kyoto;1;116;en;kyoto;1;120;113
10;0.9812194;unapproved;de;Byrne;1;106;en;byrne;1;108;104
11;0.98017544;unapproved;de;Rambouillet;1;76;en;rambouillet;1;77;74
12;0.9788914;unapproved;de;Nizza;1;397;en;nice;1;398;236
13;0.9757222;unapproved;de;Feira;1;230;en;feira;1;228;221
14;0.9725971;unapproved;de;Tampere;1;336;en;tampere;1;335;319
15;0.9712733;unapproved;de;MEDA;1;143;en;meda;1;146;135
16;0.9693323;unapproved;de;INTERREG;1;120;en;interreg;1;118;110
17;0.96875;unapproved;de;Michelin;1;78;en;michelin;1;83;75
18;0.9651604;unapproved;de;Kosovo;1;537;en;kosovo;1;524;311
19;0.9625;unapproved;de;Roth;1;75;en;roth;1;81;74
20;0.96031624;unapproved;de;Straßburg;1;415;en;strasbourg;1;446;136
21;0.9596327;unapproved;de;Montag;1;160;en;monday;1;170;85
22;0.9557449;unapproved;de;Kinnock;1;274;en;kinnock;1;273;109
23;0.95454544;unapproved;de;URBAN;1;89;en;urban;1;90;84
24;0.95303506;unapproved;de;Patten;1;350;en;patten;1;334;125
25;0.9518263;unapproved;de;Prodi;1;901;en;prodi;1;940;408
26;0.9511972;unapproved;de;Saddam;1;81;en;saddam;1;85;77
27;0.95040244;unapproved;de;Barnier;1;72;en;barnier;1;71;67
28;0.94457006;unapproved;de;April;1;446;en;april;1;465;76
29;0.94290334;unapproved;de;Lissabon;1;378;en;lisbon;1;423;267
30;0.9420494;unapproved;de;Betrifft;1;533;en;subject;1;539;96
31;0.9417781;unapproved;de;H;1;580;en;h;1;574;100
32;0.93754536;unapproved;de;Schweiz;1;136;en;switzerland;1;123;110
33;0.93558306;unapproved;de;Türkei;1;1025;en;turkey;1;1041;269
34;0.9336505;unapproved;de;Marokko;1;243;en;morocco;1;261;145
35;0.933623;unapproved;de;Solana;1;171;en;solana;1;182;77
36;0.9334416;unapproved;de;Kroatien;1;81;en;croatia;1;87;75
37;0.93;unapproved;de;Mexiko;1;79;en;mexico;1;86;72
38;0.929566;unapproved;de;SOKRATES;1;70;en;socrates;1;79;69
39;0.92853904;unapproved;de;Juli;1;334;en;july;1;331;77
40;0.9265501;unapproved;de;TACIS;1;78;en;tacis;1;75;69
41;0.9264815;unapproved;de;Tschetschenien;1;336;en;chechnya;1;353;216
42;0.92510605;unapproved;de;Renault;1;144;en;renault;1;160;136
43;0.9224414;unapproved;de;Bosnien;1;190;en;bosnia;1;200;94
44;0.9221227;unapproved;de;Februar;1;433;en;february;1;447;88
45;0.9218446;unapproved;de;September;1;380;en;september;1;380;58
46;0.92183656;unapproved;de;Belgien;1;162;en;belgium;1;169;79
47;0.9208842;unapproved;de;Dänemark;1;264;en;denmark;1;283;69
48;0.91880286;unapproved;de;Mosambik;1;87;en;mozambique;1;101;81
49;0.91612965;unapproved;de;Serbien;1;192;en;serbia;1;221;75
50;0.9138126;unapproved;de;März;1;694;en;march;1;703;191

Batch file for extractions

REM 1 directory
REM 2 source language
REM 3 target language
REM 4 lower phrase length limit
REM 5 upper phrase length limit
REM 6 minimum frequency
REM 7 maximum frequency
REM 8 maximum translations
REM 9 source term to lower (true/false)
REM 10 target term to lower (true/false)
set ARAYAPATH=c:/araya
SET DIRECTORY=%1
SHIFT
SET JARS=%ARAYAPATH%\lib\Win32\swt.jar;%ARAYAPATH%\lib\arayaserver.jar;%ARAYAPATH%\lib\external.jar
SET CALLING=java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -batch
set RESTARGS=%DIRECTORY% %1 %2 %3 %4 %5 %6 %7 %8 %9 ""
call %CALLING% %RESTARGS%
call java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -merge "europarl-%1-%2.csv" "%DIRECTORY%" 

© Heartsome Europe GmbH, letzte Aktualisierung: 26.07.2009 Home info@heartsome.de
Skype: Heartsome Europe / Friedrichstr. 17 - 90574 Roßtal / Germany / +49 9127 579001