Up till now OmegaT users who translated .docx documents infested with nasty tags needed to turn to a third-party solution to clean the files. There are a few very good and not so good ones out there:
- CodeZapperThis is a set of macros for MS Word. Linux users can use it in MS Word executed in Wine. CodeZapper is a proprietary software, license is €20.
- Document cleaner (part of TransTools)This is also a set of tools and macros for MS Word; runs well in Wine. Proprietary software, freeware.
- wipe.pl scriptThis is a Perl script developed a few years ago. There has been a Windows wrapper developed for it, and later a Linux Tcl/Tk frondend. The script, at least when executed via the Linux frontend, could damage the documents being cleaned. The original script and the wrappers are open source.
- OmegaT-DGT Tagwipe utility.This is very similar to the previous one, but the way more comprehensive and dependable. It’s based on a Perl script, and comes coupled with either a Windows or Linux wrapper. The Linux wrapper isn’t usable right away when installed the recommended way, it needs a few lines fixed. If the path to the file to be cleaned contains spaces, the wrapper won’t be able to process it. This is solvable, but not addressed in the package provided for download. Once fixed, it becomes a nice and easy to use tool to clean the documents. The utility is open source.
There are other ways to get rid of those tags, for instance, saving the .docx file as .doc or .odt, and then resaving back as .docx, but it may lead to some formatting loss.
Another option is do disable tags altogether in OmegaT, but if there’s any inline formatting, it will need to be recreated manually in the target document.
But here’s a great news!
Recently the Perl script that was used in OmegaT-DGT Tagwipe utility was rewritten in Groovy by Briac Pilpré. The script is made specifically to be run in OmegaT, and has a GUI that lets the user select different options with which it should be run.
- There are 9 levels of cleaning which the user can select. The levels are explained here.
- The script can clean either the current document, or all .docx documents in the open OmegaT project.
- The script optionally backs up the processed documents into <project_folder>/tagwipe folder recreating the folder structure in <project_folder>/source. By default the option is enabled.
- It also lets you “beautify” the output, meaning that the underlying document.xml inside the .docx file will be a bit more readable to human eyes, but it won’t make any difference to the way MS Word or any other word processor sees the file. This option would be pretty useful if there are things the user wants to change manually in the xml.
- The project will be reloaded if there were files cleaned, so the result is presented immediately. If the project doesn’t contain .docx files, or the current document isn’t .docx, it will put a line in the status bar saying that nothing was changed, and quit.
- User selected options are remembered between the runs.
The script has been added to /trunk so it will be bundled with a new OmegaT release. It also can be found in the latest nightly build.
Besides, it can be downloaded separately from the SF.net repository.
Update: As of OmegaT 4.1.5 the script is included in the installation bundle and doesn’t need to be downloaded separately.