Termextraction with Multilingual Translation Memory of the Acquis Communautaire: DGT-TM as an example

The Multilingual Translation Memory of the Acquis Communautaire: DGT-TMcorpus was used to test the bilingual term extraction.

Procedure

The Acquis Communautaire: DGT-TMcorpus was downloaded from Multilingual Translation Memory of the Acquis Communautaire: DGT-TMcorpus Website. The tmx files are contained in 12 zip files. Each file contains tmx entries in various languages.

For a first test the file 32002R2342.tmx was used (about 26.4 MB of size). The term extraction was done using English (en) as source language with all the other languages in the file. In a first step the language codes have been normalized to 2 char codes (e.g. en, de etc.). As a result 26 lang pair files have been produced. The can be downloaded using the links below. The files have the pattern: 32002R2342.<tmx.source-language>.<target-language>.csv

32002R2342.tmx.en.bg.csv
32002R2342.tmx.en.cs.csv
32002R2342.tmx.en.da.csv
32002R2342.tmx.en.de.csv
32002R2342.tmx.en.el.csv
32002R2342.tmx.en.es.csv
32002R2342.tmx.en.et.csv
32002R2342.tmx.en.fi.csv
32002R2342.tmx.en.fr.csv
32002R2342.tmx.en.hu.csv
32002R2342.tmx.en.it.csv
32002R2342.tmx.en.lt.csv
32002R2342.tmx.en.lv.csv
32002R2342.tmx.en.mt.csv
32002R2342.tmx.en.nl.csv
32002R2342.tmx.en.pl.csv
32002R2342.tmx.en.pt.csv
32002R2342.tmx.en.ro.csv
32002R2342.tmx.en.sk.csv
32002R2342.tmx.en.sl.csv
32002R2342.tmx.en.sv.csv

Batch file for extractions

call extractall.bat "." en -all 1 2 2 500 2 true true > allout
and where extractall.bat looks like that:
REM 1 directory
REM 2 source language
REM 3 target language
REM 4 lower phrase length limit
REM 5 upper phrase length limit
REM 6 minimum frequency
REM 7 maximum frequency
REM 8 maximum translations
REM 9 source term to lower (true/false)
REM 10 target term to lower (true/false)
set ARAYAPATH=c:/araya
SET DIRECTORY=%1
SHIFT
SET JARS=%ARAYAPATH%\lib\Win32\swt.jar;%ARAYAPATH%\lib\arayaserver.jar;%ARAYAPATH%\lib\external.jar
SET CALLING=java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -batch
set RESTARGS=%DIRECTORY% %1 %2 %3 %4 %5 %6 %7 %8 %9 ""
call %CALLING% %RESTARGS%
REM call java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -merge "europarl-%1-%2.csv" "%DIRECTORY%"

© Heartsome Europe GmbH, last change: 22.07.2009 Home info@heartsome.de
Skype: Heartsome Europe / Friedrichstr. 17 - 90574 Roßtal / Germany / +49 9127 579001