Update: Most of the post ramains true, but make sure you download these scripts from the SF.net repository.
In this post I want to share three scripts that can do an extended search and replace in OmegaT project. Search and replace templates for each script are specified in external plain text files located in project’s root folder, so these scripts without any modifications can be used for different projects with different sets of search and replace patterns — the user needs to modify only those plain text files as needed. On top of text modification there is a possibility to do a simple math on what is being found by the script thus enabling the user to have a per project unit converter.
Each script should be accompanied by its own external file located in a subfolder named .ini in the project’s root (details under each script further on). The format of these files is the same for all three:
- Only one empty line in the file — the very last one
- Each line consists of tree blocks:
- Search pattern (regex aware)
- Tab
- Replace pattern
So, if you need to replace “Владимир Владимирович” (taking into consideration different cases of Russian nouns) with “the President of Russian Federation“, here’s what you need to specify in the substitution file:
Владимир\p{L}?+\sВладимирович\p{L}?+ the President of Russian Federation
If you need to convert miles into kilometers, here’s what you specify:
(\d+)(\s?)mile(s?) ${(it[1] as int) * 1.6}$2km
Likewise, if you want to convert Fahrenheit into Centigrade, here’s the line:
(\d+)°F ${((it[1] as int) - 32 ) * 5 / 9 }°C
To make sure abbreviation “EU” is always spelled in uppercase, you can either put:
\b([Ee][Uu])\b ${(it[1] as String).toUpperCase()}
or
\b([Ee][Uu])\b EU
There’s a TAB character in each of the above examples. It separates search pattern from replace pattern.
So, now to the scripts themselves. As usual, each heading is a link to pastebin.com where you can download the scripts, and under each heading there is a listing of the script.
- search_replace_batch.groovy
/* * #Purpose: Batch search and replace in the whole project * #Files: Requires 'search_replace.ini' in the current project's root * #File format: Plain text, where *each* line is: * [Search Pattern] [Tab] [Replace Pattern]; * only the last line *must* be empty. * #Details: http: // wp.me/p3fHEs-5e * * * @author Kos Ivantsov * @based on Didier Briel's "search and replace script" * @date 2013-07-22 * @version 0.1 */ import static javax.swing.JOptionPane.* import static org.omegat.util.Platform.* import groovy.swing.SwingBuilder import java.awt.Component import javax.swing.JButton import javax.swing.JTable import javax.swing.table.* import javax.swing.event.* import java.awt.event.* import java.awt.BorderLayout as BL /* * set 'GUI' to anything but 'true' (with quotes) in the next line * if you don't need a GUI listing of all changed segments */ def GUI = 'true' def prop = project.projectProperties if (!prop) { final def title = 'Batch search and replace' final def msg = 'Please try again after you open a project.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } //def folder = prop.projectRoot //def fileloc = folder+'/subst_template.txt' subst_file = new File(prop.projectRoot+'/search_replace.ini') if (! subst_file.exists()) { final def title = 'No file' final def msg = 'Template file for batch search and replace'+'\n' + subst_file +'\n'+'doesn\'t exist.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } length = subst_file.readLines().size() search_array = [] replace_array = [] data = [] def count = 0 while ( count source = ste.getSrcText(); target = project.getTranslationInfo(ste) ? project.getTranslationInfo(ste).translation : null; initial_target = target // Skip untranslated segments if (target == null) return for ( i in range) { ser = search_array[i] rep = (replace_array[i] =~ /\$(\d+)/ ).replaceAll( '\\${(it[$1] as String) }' ) shell = new GroovyShell() eval = {statement, arg -> shell.setVariable 'it', arg; shell.evaluate '"' + statement + '"' } target = target.replaceAll(/null/, 'repl0') target = target.replaceAll(ser) { eval rep, it } target = target.replaceAll(/null/, '') target = target.replaceAll(/repl0/, 'null') } if (initial_target != target) { data.add([ seg: ste.entryNum(), source: source, in_target: initial_target, target: target ]) segment_count++ editor.gotoEntry(ste.entryNum()) console.println(ste.entryNum() + "\t" + ste.srcText + "\t" + target ) editor.replaceEditText(target) } } swing = new SwingBuilder() frame = swing.frame(title:'Changed segments', preferredSize: [1024, 500]) { scrollPane { table() { tableModel(list:data) { propertyColumn(editable: true, header:'Segment', propertyName:'seg', minWidth: 80, maxWidth: 80, preferredWidth: 80, cellEditor: new TableCellEditor() { public void cancelCellEditing() {} public boolean stopCellEditing() { return false; } public Object getCellEditorValue() { return value; } public boolean isCellEditable(EventObject anEvent) { return true; } public boolean shouldSelectCell(EventObject anEvent) {return true; } public void addCellEditorListener(CellEditorListener l) {} public void removeCellEditorListener(CellEditorListener l) {} public Component getTableCellEditorComponent(JTable table, Object value, boolean isSelected, int row, int column) { println("value: " + value); org.omegat.core.Core.getEditor().gotoEntry(value); } }, cellRenderer: new TableCellRenderer() { public Component getTableCellRendererComponent(JTable table, Object value, boolean isSelected, boolean hasFocus, int row, int column) { def btn = new JButton() btn.setText(value.toString()) return btn } } ) propertyColumn(editable: false, header:'Source',propertyName:'source', minWidth: 150, preferredWidth: 350) propertyColumn(editable: false, header:'Initial Target',propertyName:'in_target', minWidth: 150, preferredWidth: 350) propertyColumn(editable: false, header:'Target',propertyName:'target', minWidth: 150, preferredWidth: 350) } } } panel(constraints: BL.SOUTH){ button('Quit', actionPerformed:{ frame.visible = false }) } } console.println "\n${'-'*10}"+"\n"+"Segments modified: " + segment_count if (segment_count == 0){ final def title = 'Search and Replace' final def msg = 'No segments can be changed.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } if (GUI == 'true') { if (segment_count != 0){ frame.pack() frame.show() } }
This script performs a global search and replace in the whole project. The file where search and replace patterns are specified should be named search_replace.ini. The sequence in which these patterns are listed matters. You should back up your project before using the script, as there’s no way to revert the changes.
- search_replace_pretranslate.groovy
/* * #Purpose: Pretranslate certain untranslated segments * #Files: Requires 'pretranslate.ini' .ini subfolder of the current project's root * #File format: Plain text, where *each* line is: * [Search Pattern] [Tab] [Replace Pattern]; * only the last line *must* be empty. * #Note: Only those segment where a [Search Pattern] is found, * are pretranslated and prefixed with {PRETRAN} * #Details: http://wp.me/p3fHEs-5e * * @author Kos Ivantsov * @author Yu Tang * @based on Didier Briel's "search and replace script" * @date 2013-12-28 * @version 0.2 */ import static javax.swing.JOptionPane.* import static org.omegat.util.Platform.* import groovy.swing.SwingBuilder import java.awt.Component import javax.swing.JButton import javax.swing.JTable import javax.swing.table.* import javax.swing.event.* import java.awt.event.* import java.awt.BorderLayout as BL /* * set 'GUI' to anything but 'true' (with quotes) in the next line * if you don't need a GUI listing of all changed segments */ def GUI = 'true' /* * set true to process current document only, * otherwise process whole project. */ def CURRENT_FILE_ONLY = false def prop = project.projectProperties if (!prop) { final def title = 'Pretranslate' final def msg = 'Please try again after you open a project.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } subst_file = new File(prop.projectRoot, '.ini/pretranslate.ini') if (! subst_file.exists()) { final def title = 'No file' final def msg = 'Template file for pretranslation'+'\n' + subst_file +'\n'+'doesn\'t exist.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } length = subst_file.readLines().size() search_array = [] replace_array = [] data = [] def count = 0 while ( count // Skip translated segments if (project.getTranslationInfo(ste).translation) { return } initial_target = target = source = ste.srcText for ( i in range) { ser = search_array[i] rep = (replace_array[i] =~ /\$(\d+)/ ).replaceAll( '\\${(it[$1] as String) }' ) shell = new GroovyShell() eval = {statement, arg -> shell.setVariable 'it', arg; shell.evaluate '"' + statement + '"' } target = target.replaceAll(/null/, 'repl0') target = target.replaceAll(ser) { eval rep, it } target = target.replaceAll(/null/, '') target = target.replaceAll(/repl0/, 'null') } if (initial_target != target) { data.add([ seg: ste.entryNum(), source: source, in_target: initial_target, target: target ]) segment_count++ editor.gotoEntry(ste.entryNum()) console.println(ste.entryNum() + "\t" + ste.srcText + "\t" + target ) target = target.replaceAll(/(^)/, '\\{PRETRAN\\}$1') editor.replaceEditText(target) } } swing = new SwingBuilder() frame = swing.frame(title:'Changed segments', preferredSize: [1024, 500]) { scrollPane { table() { tableModel(list:data) { propertyColumn(editable: true, header:'Segment', propertyName:'seg', minWidth: 80, maxWidth: 80, preferredWidth: 80, cellEditor: new TableCellEditor() { public void cancelCellEditing() {} public boolean stopCellEditing() { return false; } public Object getCellEditorValue() { return value; } public boolean isCellEditable(EventObject anEvent) { return true; } public boolean shouldSelectCell(EventObject anEvent) {return true; } public void addCellEditorListener(CellEditorListener l) {} public void removeCellEditorListener(CellEditorListener l) {} public Component getTableCellEditorComponent(JTable table, Object value, boolean isSelected, int row, int column) { println("value: " + value); org.omegat.core.Core.editor.gotoEntry value } }, cellRenderer: new TableCellRenderer() { public Component getTableCellRendererComponent(JTable table, Object value, boolean isSelected, boolean hasFocus, int row, int column) { def btn = new JButton() btn.setText(value.toString()) return btn } } ) propertyColumn(editable: false, header:'Source',propertyName:'source', minWidth: 150, preferredWidth: 350) propertyColumn(editable: false, header:'Initial Target',propertyName:'in_target', minWidth: 150, preferredWidth: 350) propertyColumn(editable: false, header:'Target',propertyName:'target', minWidth: 150, preferredWidth: 350) } } } panel(constraints: BL.SOUTH){ button('Quit', actionPerformed:{ frame.dispose() }) } } console.println "\n${'-'*10}"+"\n"+"Segments modified: " + segment_count if (segment_count == 0){ final def title = 'Pretranslate' final def msg = 'No segments can be pretranslated.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } if (GUI == 'true' && segment_count) { frame.pack() frame.show() }
The external file for this script should be named pretranslate.ini. This script works only on those segments that don’t have translation. If the source segment contains what is specified as a search pattern, then target gets populated with the source text with all the possible substitutions and prefixed with {PRETRAN}. Other segments where nothing was found are left intact.
- replace_with_template.groovy
/* :name=Autotext :description=Text manipulations in the current segment * Purpose: Modify current target sequentially replacing [Search patterns] * with [Replace patterns], specified in external file * #Files: Requires 'segment_substitution.ini' in '.ini' subfolder * of the current project's root * #Format: Plain text, where *each* line is: * [Search Pattern] [Tab] [Replace Pattern]; * only the last line *must* be empty. * #Details: http://wp.me/p3fHEs-5e * * @author Kos Ivantsov (with major contributions by Yu Tang) * @date 2014-05-30 * @version 0.4 */ import static javax.swing.JOptionPane.* import static org.omegat.util.Platform.* def prop = project.projectProperties if (!prop) { final def title = 'Replace using substitution file' final def msg = 'Please try again after you open a project.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } def folder = prop.projectRoot def fileloc = folder+'.ini/segment_substitution.ini' subst_file = new File(fileloc) if (! subst_file.exists()) { final def title = 'No file' final def msg = 'Substitution file ' + subst_file + ' doesn\'t exist.' showMessageDialog null, msg, title, INFORMATION_MESSAGE return } length = subst_file.readLines().size() search_array = [] replace_array = [] def count = 0 while ( count < length ) { ln = subst_file.readLines().get(count).tokenize('\t') sr = ln[0] rp = ln[1] search_array.add(sr) replace_array.add(rp) count++ } def range = 0.. shell.setVariable 'it', arg; shell.evaluate '"' + statement + '"' } target = target.replaceAll(/null/, 'repl0') target = target.replaceAll(ser) { eval rep, it } target = target.replaceAll(/null/, '') target = target.replaceAll(/repl0/, 'null') } if (editor.selectedText){ editor.insertText(target) }else editor.replaceEditText(target)
This one needs a plain text file named segment_substitution.ini. I published it before, but here I include it with some additions and enhancements. This script works only on the current segment. One of the things it can be used for is inserting additional characters not present in you keyboard layout (even though I’m not a big supporter of such use). So, if you need to insert a copyright symbol, you may specify the following line:
\*co\* ©
which will substitute “*co*” with “©“.
You can also use it for something like autotext function. You can specify your own abbreviation and the way they should expand. Being able to use regex allows one to do nifty things for languages where you need to take care of word cases. Here’s an example:
I want to be able to insert a compound term using my own abbreviation. Each part of the term should agree in number and case. The term is “Провідний регіональний cлюсар-сантехнік” (Just-made-up Ukrainian term that means “Lead regional fitter-plumber”)
The line can look like this:
\b([Пп])н(\p{L}+)-р(\p{L}+)?-к(\p{L}+)? $1ровідн$2 регіональн$2 слюсар$3-сантехнік$4
When I type “пний-р-к” and invoke the script, it will be expanded into “провідний регіональний слюсар-сантехнік“. If I start that abbreviation with uppercase, the expended line will start with uppercase too. And then I can relatively easy use cases: “Пними-рями-ками” will expand into “Провідними регіональними слюсарями-сантехніками“. It’s rather geeky, I realize that, but it just shows that it can be used for a number of things.UPDATE: This script has been updated on October 15, 2013 (the listing here and on pastebin.com are up to date, you may need to update your local copy)
The update includes new location of the
segment_substitution.ini,
and new behavior when the script is invoked while some text is selected — the selection gets inserted into the target at the cursor position after all the possible transformations. If no transformations are possible, the text inserts as is, which might be a nifty way to insert things without changing your clipboard content.
Big thanks goes to Yu Tang who helped me figure out how to make math possible in these scripts.
But as of now,
Good luck
Here’s a short script to make a tab-separated glossary into a pretranslate.ini file, which will only search for whole words/phrases. It’s sorted from longest to shortest entries in the source language.
#! /bin/sh
awk -F\\t ‘{ print length($1), $0 | “sort -rn” }’ YOUR_GLOSSARY > pretranslate.tmp
sed ‘
s/[0-9]//g # strip out all numbers
s/^\ *//g # strip leading spaces — all lines have them due to the sort
/^$/d # delete any blank lines
s/^/\\b/g
s/\t/\\b\t/g
‘ pretranslate.tmp > pretranslate.ini
rm pretranslate.tmp
Is this place for feature requests? I’d be most grateful for two things:
– The ability to run this only on the working file, not the whole project, and…
– A “cancel” button.
And, unfortunately, I’m not a good enough programmer to be of much help. Sorry.
Thanks for good suggestions, Steve. I think I’ll manage to make it possible to limit the script’s functionality to current/selected file, but not being a programmer and trying to avoid scripting as much as I can, I can tell you right away there won’t be a “cancel” button any time soon. Somehow GUI programming with Swing isn’t the easiest thing for my feeble mind. I’ll email you when I have updated the script.
A question: when you are running replace_with_template and more than one replacement strings apply to the same selected texts… how does the script manages that? Does it apply only the first replacement it finds in segment_substitution? Does it let you select which replacement to use? I am sorry to ask, but my ability to construe groovy statements is limited 🙂
The script reads change and replace templates (or expressions) defined in segment_substitution.ini one by one as they’re listed, and applies them as soon as the match is found. It applies the respective template to all the found matches within the segment simultaneously. It means that if you have the same search expression used more than once, it is going, most likely, to be applied only once — the very first time it’s found, unless the replacement(s) that was done before the next iterations produce the match again. It also applies to substrings, so you have to be rather careful about tailoring segment_substitution.ini, and usually it’s a good idea to list longer search expressions first.
I hope it answers your question, but if not, I’d be happy to try to clarify more.
Thank you, it does answer my question. I am glad the script is smart enough to apply changes recursively. Kudos for you and Yu for that! I will take note of your suggestion of listing longer search expressions first. Cheers!
Below is my segment_substitution that performs date conversion on selected text.
It converts strings such as:
January 1, 2014
Into:
1 de enero de 2014
That is, it converts English date format into Spanish date format.
([a-zA-z]+)\s(\d+),\s(\d+) $2 de $1 de $3
January enero
February febrero
March marzo
April abril
May mayo
June junio
July julio
August agosto
September septiembre
October octubre
November noviembre
December diciembre
Thanks again Kos and Yu for the script!
You’re welcome, and thanks for your comments.
Hi Kos, after a while not using the replace_with_template script and a few OmegaT updates in the meantime, I tried to use the script again and run into the following error:
java.lang.NullPointerException
I can’t figure out what is causing the error. My segment_substitution is the same posted above.
Maybe it throws that error if none of the declared search strings is present in the current segment? I don’t think so.
Hi Hector, usually this error happens when there’s an extra empty line in segment_substitution. There shouldn’t be any empty lines except the last one. If you’re sure you don’t have any empty lines, and still get this error, you can send me or share somewhere your segment_substitution.ini and source+target text, I’ll take a look.
Thanks, Kos. Curiously, my segment_substitution had no empty lines. I found that the error arose after I added an empty line at the end. Once removed, the script worked again. I’m in a Win 7 machine.
Maybe different OS’es treat the end of the file differently. Or maybe it’s something that happened with the updated Groovy scripting engine in OmegaT. Not being a programmer, I don’t even know how to investigate it, nor I feel appealed to do so.
Hi Kos, maybe this question is meant for Yu, but I don’t know how to ask him, so maybe you would do the honors?
I tried the replacement you provided for converting miles into kilometers, but something is going wrong. Whenever I tell the script to convert 1.5 miles in to kms, it gives me this:
1.8.0 km
Not being a programmer nor a mathematician, I can’t figure out why… any ideas?
Hi Kos, Thank you for sharing the script, it’s a real life saver!
Now I can remove leading and trailing spaces, double spaces, and capitalize all words at the start of a sentence – all at once. This will save me a lot of time.
Hi Kos, thank you for sharing the scripts.
There is only one prob here:
The third script works well except that the replacement always goes to the beginning of the current segment no matter where the cursor is.
There’s very little I can do about it. Of course, the script can try to figure out where the cursor is, and then put it to the retrieved position when the target is replaced with the texts submitted to all the substitutions, but probably after all the changes the script does, the retrieved location won’t be the same as when it was invoked.
I use this script all the time to clean up translated segments—all the capitalization and everything—it saves hours. There is one “bug” I can’t figure out how to fix: it doesn’t seem to work with dollar signs($).
For example, if I want to batch replace anything like: $ 100 (which the google translate API keeps putting in) to $100 I tried this:
\$\s ——> $
This seems to make it terminate. Do you think there is an easy way to fix this? I looked at the code but couldn’t figure it out. Thx.
Hi Kirk,
It looks like you need to escape the dollar sign in the replacement section too:
\$\s → \$
This makes it work in my tests.
Oh wow, that worked. Thanks! A while back I was having a similar conundrum and tried an escape on the replace side. That time it printed out the slash too, so I never thought to try it again.
I have a related question. I use this script to do all the sentence capitalization. And, please don’t laugh at my noobness, but this is my statement:
\.\sa → . A
\.\sb → . B
\.\sc → . C
So there is 26 lines, one for each alphabet character. I have a sneaking suspicion that they can all be combined into one statement. I tried some different ways, but having such limited knowledge I could’t figure it out. Am I doing this wrong?
It’s not wrong per se, as it works and gets the thing done. But you are right, it can be combined into a one neat line:
\.\s(\p{Ll}) → . ${((it[1] as String).toUpperCase())}
Works with any alphabet where there’s a concept of lower and upper cases.
Hi Kos,
No wonder I couldn’t figure it out – look at that thing! I will give that a shot, and look at it more closely so it can also be a learning opportunity for me. Thanks so much for your advice and help! I think your blog is great and always look forward to seeing new recipes – please keep it up. Best regards.
Yeah, it looks a bit complicated because the script is written in such a way that it can execute subscripts defined on each line, as in the one above.
By using
()
in the search we can capture and group things, and then refer to them in the replacement. The first group is referred to by$1
, the second — by$2
and so on.So we can search for
[uU](ncle) [sS](am)
and replace with
U$1 S$2
to make sure “Uncle Sam” is always capitalized.
To run subscripts, we have to use
${}
(that’s where the subscript is going to be run) and “it
” (in this case it’s the result of the search). So,it[1]
is basically$1
for the subscript, but in the subscript it can be declared as any type, not only String, which allows you to do unit conversions or almost anything else groovy is capable of.(\d+)°F → ${((it[1] as int) - 32 ) * 5 / 9 }°C
(°F to °C)
(\d+)(\s?)mile(s?) ${(it[1] as int) * 1.6}$2km
(miles to kilometers)
My favorite recipe stopped working today when I upgraded to OmegaT 4.0.0. / I get an ever-spinning cursor, no response, and have to use the task manager to kill the program. If you update to 4.0 and solve this mystery, please post the solution. Thanks, as always, for the great recipes.
Can anyone kindly suggest how to make this search & replace script work on Mac OS? Have been trying all options but nothing seems to be working…
Thnaks