For a first test the file 32002R2342.tmx was used (about 26.4 MB of size). The term extraction was done using English (en) as source language with all the other languages in the file. In a first step the language codes have been normalized to 2 char codes (e.g. en, de etc.). As a result 26 lang pair files have been produced. The can be downloaded using the links below. The files have the pattern: 32002R2342.<tmx.source-language>.<target-language>.csv
32002R2342.tmx.en.bg.csv 32002R2342.tmx.en.cs.csv 32002R2342.tmx.en.da.csv 32002R2342.tmx.en.de.csv 32002R2342.tmx.en.el.csv 32002R2342.tmx.en.es.csv 32002R2342.tmx.en.et.csv 32002R2342.tmx.en.fi.csv 32002R2342.tmx.en.fr.csv 32002R2342.tmx.en.hu.csv 32002R2342.tmx.en.it.csv 32002R2342.tmx.en.lt.csv 32002R2342.tmx.en.lv.csv 32002R2342.tmx.en.mt.csv 32002R2342.tmx.en.nl.csv 32002R2342.tmx.en.pl.csv 32002R2342.tmx.en.pt.csv 32002R2342.tmx.en.ro.csv 32002R2342.tmx.en.sk.csv 32002R2342.tmx.en.sl.csv 32002R2342.tmx.en.sv.csv
call extractall.bat "." en -all 1 2 2 500 2 true true > allout and where extractall.bat looks like that: REM 1 directory REM 2 source language REM 3 target language REM 4 lower phrase length limit REM 5 upper phrase length limit REM 6 minimum frequency REM 7 maximum frequency REM 8 maximum translations REM 9 source term to lower (true/false) REM 10 target term to lower (true/false) set ARAYAPATH=c:/araya SET DIRECTORY=%1 SHIFT SET JARS=%ARAYAPATH%\lib\Win32\swt.jar;%ARAYAPATH%\lib\arayaserver.jar;%ARAYAPATH%\lib\external.jar SET CALLING=java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -batch set RESTARGS=%DIRECTORY% %1 %2 %3 %4 %5 %6 %7 %8 %9 "" call %CALLING% %RESTARGS% REM call java -Xmx1024m -cp .;%JARS%; -Djava.library.path=%ARAYAPATH%\lib\Win32 com.araya.BiExtractor.Editor -merge "europarl-%1-%2.csv" "%DIRECTORY%"