XLIFF to TMX

One of the recent scripts published here allowed OmegaT users who wanted their project to be worked on in a different CAT tool, to export the whole OmegaT project to an XLIFF file. To get the completed work back to OmegaT, one had to run Okapi Rainbow to convert XLIFF to TMX, possibly using the Rainbow settings file created by the script.

In this post I’ll share how to convert those OmegaT-created XLIFF files finished (or partly finished) in Trados/MemoQ/Deja Vu/WhatNotCAT back to TMX that can be used in OmegaT (all tags preserved, of course, that was the whole point), right from within OmegaT, without running Rainbow manually. Continue reading

Convert OmegaT project to XLIFF for other CAT tools

I’m back with another little script that might be pretty handy for those who need to work on the same material in different CAT tools, or for translation agencies who use OmegaT as their main CAT application but farm out the work to translators using their CAT tools of choice. As a matter of fact, the script was requested by translation agency Velior for this very reason.
When the script is invoked, it writes out a file named PROJECTNAME.xlf (PROJECTNAME is the actual name of the project, not this loudly yelled word, of course), and the file is located in script_output subfolder of the current project. It exports both translated (they get “final” state in the resultant XLF file) and untranslated segments, and for untranslated segments the source is copied to the target, and such segments get “needs-translation” state. OmegaT segmentation and tags are preserved. Tags get enveloped in <ph id=”x”> and </ph>, so that they are treated as tags in other CAT tools. Continue reading

Export relavant TU’s from legacy TMX files in OmegaT

Situation

You have a new project and legacy translation memory files that usually go to /tm folder of OmegaT project. You need to give this project to someone else, but you don’t want to give away all of your previous translation. Somehow you need to extract from your TMX files only those TU’s that have matches in the current project.

Problem

The problem is evident — you need OmegaT to get matches for each segment, and if they are any good, store them somewhere handy, in a separate TMX file.
The problem has been (or still is) discussed on OmegaT Yahoo Group.

Solution

Continue reading

Export TMX for new translations

Here’s a script that lets you export translation units that are newer than a specified date. It might be useful if you get a project with a legacy TMX containing some perfect matches (that memory should be put into /tm/auto), and you want to be able to run QA tests only on new TU’s that have been translated since you’ve started to work on the project, or you want to keep your own translations for later and don’t really care for what has been translated before.

If the script is invoked without changing or specifying anything, it looks for TU’s that are one day old and works globally on the whole project. If you need to filter TU’s older or newer than that, you’ll have to specify the date on line 21.
The date should be in “yyyy-MM-dd HH:mm” (4_digit_year-2_digit_month-2_digit_day space 2_digit_hour_in_military_notation:2_digit_minute) format. If not specified properly, it will fall back to the default one-day-old value.
Beside that, the user can specify whether the script should work globally on the entire project, or only on selected files. To enable file selection, the line 27 should read:
select_files = 'yes'
Anything other than ‘yes’ means the script will work globally. Continue reading

Export TMX for selected files

OmegaT exports TMX for current files in the project every time translated documents are created. It writes three TMX files in the root of the project. But what if you need a translation memory file that contains translation units of only one or several files, not all that are currently present in the project. One solution is to temporarily move unneeded files out of source folder, reload the project and then create translated documents. But it is somewhat awkward and time consuming.
Here’s a groovy script that lets you select one or several files located in the same subfolder of the current project’s /source. Once they are selected, the script writes selected_files.[date_time].tmx in the project root. This TMX-file contains TU’s only for the selected files.

  • write_sel_files2TMX.groovy
    /*
     * Purpose:	Export source and translation segments of user selected 
     *	files into TMX-file
     * #Files:	Writes 'selected_files_<date_time>.tmx' in the current project's root
     * #File format:	TMX v.1.4
     * #Details:	http:/ /wp.me / p3fHEs-6g
     *
     * @author  Kos Ivantsov
     * @date    2013-08-12
     * @version 0.3
     */
    
    import javax.swing.JFileChooser
    import org.omegat.util.StaticUtils
    import org.omegat.util.TMXReader
    import static javax.swing.JOptionPane.*
    import static org.omegat.util.Platform.*
    
    def prop = project.projectProperties
    if (!prop) {
    	final def title = 'Export TMX from selected files'
    	final def msg   = 'Please try again after you open a project.'
    	showMessageDialog null, msg, title, INFORMATION_MESSAGE
    	return
    }
    
    def curtime = new Date().format("MMM-dd-yyyy_HH.mm")
    srcroot = new File(prop.getSourceRoot())
    def fileloc = prop.projectRoot+'selected_files_'+curtime+'.tmx'
    exportfile = new File(fileloc)
    def sourceroot = prop.getSourceRoot().toString() as String
    
    JFileChooser fc = new JFileChooser(
    	currentDirectory: srcroot,
    	dialogTitle: "Choose files to export",
    	fileSelectionMode: JFileChooser.FILES_ONLY, 
    	//the file filter must show also directories, in order to be able to look into them
    	multiSelectionEnabled: true)
    
    if(fc.showOpenDialog() != JFileChooser.APPROVE_OPTION) {
    console.println "Canceled"
    return
    }
    
    if (!(fc.selectedFiles =~ sourceroot.replaceAll(/\\+/, '\\\\\\\\'))) {
    		console.println "Selection outside of ${prop.getSourceRoot()} folder"
    		final def title = 'Wrong file(s) selected'
    		final def msg   = "Files must be in ${prop.getSourceRoot()} folder."
    		showMessageDialog null, msg, title, INFORMATION_MESSAGE
    		return
    	}
    
    if (prop.isSentenceSegmentingEnabled())
    	segmenting = TMXReader.SEG_SENTENCE
    	else
    	segmenting = TMXReader.SEG_PARAGRAPH
    
    def sourceLocale = prop.getSourceLanguage().toString()
    def targetLocale = prop.getTargetLanguage().toString()
    
    exportfile.write("", 'UTF-8')
    exportfile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", 'UTF-8')
    exportfile.append("<!DOCTYPE tmx SYSTEM \"tmx11.dtd\">\n", 'UTF-8')
    exportfile.append("<tmx version=\"1.4\">\n", 'UTF-8')
    exportfile.append(" <header\n", 'UTF-8')
    exportfile.append("  creationtool=\"OmegaTScripting\"\n", 'UTF-8')
    exportfile.append("  segtype=\"" + segmenting + "\"\n", 'UTF-8')
    exportfile.append("  o-tmf=\"OmegaT TMX\"\n", 'UTF-8')
    exportfile.append("  adminlang=\"EN-US\"\n", 'UTF-8')
    exportfile.append("  srclang=\"" + sourceLocale + "\"\n", 'UTF-8')
    exportfile.append("  datatype=\"plaintext\"\n", 'UTF-8')
    exportfile.append(" >\n", 'UTF-8')
    fc.selectedFiles.each{
    	fl = "${it.toString()}" - "$sourceroot"
    	exportfile.append("  <prop type=\"Filename\">" + fl + "</prop>\n", 'UTF-8')
    }
    exportfile.append(" </header>\n", 'UTF-8')
    exportfile.append("  <body>\n", 'UTF-8')
    
    def count = 0
    fc.selectedFiles.each{
    	fl = "${it.toString()}" - "$sourceroot" 
    	files = project.projectFiles
    	files.each{
    		if ( "${it.filePath}" != "$fl" ) {
    		println "Skipping to the next file"
    		}else{
    	it.entries.each {
    	def info = project.getTranslationInfo(it)
    	def changeId = info.changer
    	def changeDate = info.changeDate
    	def creationId = info.creator
    	def creationDate = info.creationDate
    	def alt = 'unknown'
    	if (info.isTranslated()) {
    		source = StaticUtils.makeValidXML(it.srcText)
    		target = StaticUtils.makeValidXML(info.translation)
    		exportfile.append("    <tu>\n", 'UTF-8')
    		exportfile.append("      <tuv xml:lang=\"" + sourceLocale + "\">\n", 'UTF-8')
    		exportfile.append("        <seg>" + "$source" + "</seg>\n", 'UTF-8')
    		exportfile.append("      </tuv>\n", 'UTF-8')
    		exportfile.append("      <tuv xml:lang=\"" + targetLocale + "\"", 'UTF-8')
    		exportfile.append(" changeid=\"${changeId ?: alt }\"", 'UTF-8')
    		exportfile.append(" changedate=\"${ changeDate > 0 ? new Date(changeDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    		exportfile.append(" creationid=\"${creationId ?: alt }\"", 'UTF-8')
    		exportfile.append(" creationdate=\"${ creationDate > 0 ? new Date(creationDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    		exportfile.append(">\n", 'UTF-8')
    		exportfile.append("        <seg>" + "$target" + "</seg>\n", 'UTF-8')
    		exportfile.append("      </tuv>\n", 'UTF-8')
    		exportfile.append("    </tu>\n", 'UTF-8')
    		count++;
    				}
    			}
    		}
    	}
    }
    exportfile.append("  </body>\n", 'UTF-8')
    exportfile.append("</tmx>", 'UTF-8')
    
    final def title = 'TMX file written'
    final def msg   = "$count TU's written to " + exportfile.toString()
    console.println msg
    showMessageDialog null, msg, title, INFORMATION_MESSAGE
    return
    

    The TMX file is rewritten each time the script in invoked.

Big thank you goes to Roman Mironov and Velior Translation Agency for the idea and comprehensive support.
Suggestions and comments are always welcome.
But as of now,


wordpress visitor

Good luck!

OmegaT match insert/replace without tags

Situation

After having translated a complete user manual that you converted from PDF to ODT to be able to work on it in OmegaT, you receive another manual from the same client, but this time it’s a DOCX file. Great! You can start right away, without converting anything. That should be a peace of cake — half of the manual looks almost the same as the one you have just done.

Problem

After starting to work with it you find out that getting a lot of 95-97% would be really awesome, if it wasn’t for all those nasty tags that are very different in the source and in the match. And there is no “Insert match without tags” menu item in OmegaT (yet).

Continue reading

Авто-пам’ять та пам’ять зі штрафом

Ситуація

На руках маєш файли пам’яті TMX з попередніх проектів, у них є 100% відповідності, однак точність перекладу залишає бажати кращого. Така ситуація може бути, наприклад, при наявності файлів пам’яті після машинного перекладу, або при отриманні проекту, в якому значна частина матеріалів перекладалась колись кимось іншим, а відтак була змінена тощо.
Крім того, маєш інші файли пам’яті, у яких теж присутні 100% відповідності, проте в цих файлах якість перекладу тебе задовольняє і ти хотів би при 100% відповідностях автоматично отримувати переклад сегментів без зайвих рухів.
Звісно, щоб OmegaT могла ці файли TMX використовувати, маєш їх помістити у теку /tm всередині проекту (чи у іншу теку, але у властивостях проекту вказати, куди саме OmegaT повинна дивитися по пам’ять).

Проблема

У OmegaT є можливість автоматично вставляти відповідності, що мають певний відсоток подібності. В залежності від проекту і якості додаткової пам’яті цей відсоток може бути різним (тобі вирішувати, який саме), але він не повинен бути низьким (70% — це дуже низький). Крім того, при автоматичній вставці OmegaT може додавати ще й префікс до вставлених таким чином сегментів, щоб потім їх потім можна було швидко знайти і підправити. Активується це через меню «Options» → «Editing Behaviour», прапорець на «Insert the best fuzzy match».
Коли така можливість активована, з’являється проблема — «якісні» відповідності вставляються із «зайвим» префіксом, вимагаючи додаткового редагування. Якщо ж префікс забрати, то «неякісні» відповідності вставляються, нічим свою низьку якість тобі не показуючи. Можна, звісно, відключити цю опцію зовсім, однак тоді треба буде дивитись за кожною відповідністю та по кожній приймати рішення. Втім, OmegaT дозволяє цю проблему елегантно вирішити.

Continue reading