This article is about performing merging and splitting segments in OmegaT using a convenient groovy script.
In the beginning, it covers some basics of segmentation in OmegaT. You may also jump straight to the script.
Basic principles of segmentation
OmegaT breaks text into segments according to the segmentation rules. These rules are gathered into a set or collection, and only one such collection can be applied at a time: per-project, a.k.a. local (i.e. applying only to the current project), or global (i.e. applying to every project which doesn’t have a local set of segmentation rules).
To enable per-project segmentation rules:
- Click Project → Properties…
- In Project Properties dialog window, click Local Segmentation Rules…
- In Local Segmentation Rules… tick Use local segmentation rules
If local segmentation rules are enabled, editing the rules is done through Project → Properties… → Local Segmentation Rules…, otherwise — through Options → Global Segmentation Rules…
Rules in this local or global collection are organized by language (remember that the source text is going to be segmented, so the language code of a group has to match the project’s source lanuage), and then rules are applied in sequence: each group’s language code will be checked, and if matches, then each rule in that group is going to be applied in sequence. If the language of the group doesn’t match, that whole group is going to be skipped.
It’s possible to add your own group, give it any name and language code. These groups (and the rules within them) can be moved up and down to change the order.
Rules in groups can be added, removed, edited, and rearranged. There are two types of rules: rules to split segments into two parts, and rules to merge two separate parts into one segment. Both rules have Pattern before and Pattern after parts. These refer to the place in the text where the split or the merge is supposed to happen. The ticked Break/exception box means the split right where the text described in Pattern before ends, and the text described in Pattern after begins; no tick means those two described chunks will be fused together.
Both types of rules use regular expressions. In many cases typing or pasting the exact/verbatim part of the text you want to split or merge does the trick, but some characters have special meanings in RegEx, so you may not get the results you expect.
There are a few limitations where no rules could help. One of them is merging two segments from different paragraphs. It simply won’t work, paragraph break in the source file always results in a new segment in OmegaT, no matter what.
Very rarely rules may not work as expected because something that should be split according to one rule that you added, is merged by another rule somewhere down the chain, or vice versa. If you’re careful with your segmentation rules, it shouldn’t happen to you.
Merging and Splitting Segments without Editing the Rules
History
In OmegaT, there’s no easy way to merge or split segments right from the editor. A workaround was proposed back in 2016, through a script. The script simply copied the parts of the actual text in OmegaT where the split or the merging were requested and added the rules based on that text. It was much better than nothing, but there was a lot of room for improvement: it wouldn’t work if the project didn’t have local segmentation rules; if the rules had no group for the project source language, it wouldn’t add it; to do the split, you had to carefully select a part of the source text exactly from its end to the split point; it would try to merge segments over paragraph breaks or oven segments from different files (which would never work, rule or no rule)…
That script was given a thorough rework, and now most of those issues have been addressed.
Download
First of all, the script is available on GitHub and FS.net
Script Features
- By default, the script enables a per-project segmentation group and creates a new group (called MergeSplit) for your project’s source language. That can be disabled,
true
needs to be changed tofalse
in these lines (enforceProjectSRX
is about enabling per-projects rules,separateMappingRule
is about creating a separate group):
enforceProjectSRX = true
separateMappingRule = true
- When the script is run in a project with no per-project rules, and the default behavior of the script wasn’t changed, it will reload the project after the first run because newly enabled per-project rules need to be activated. It will happen only once (unless you disable per-project segmentation and later run the script again), so in order to do that first split or merge, you might need to run the script twice.
- To perform a split, simply place the cursor inside the current segment’s source sentence where you want the split to occur. No need to select anything.
- To perform a merge, just run the script with the text cursor anywhere but in the current segment’s source text. It will try to merge the current segment with the next.
- Before performing merges or splits, it will show the preview and ask for confirmation.
- Even though you need to reload the project for each new rule to be applied, you may choose not to do so if you’re performing several splits or merges in a row. Reload only after the last one in the series is done — it will save you a few seconds.
- If the merge is going to be over a paragraph break or between the last and the first segments in two files, the script will inform you about the impossibility of such a merge and won’t add the rule.
- A few more features added later.
Script’s GUI
If you want the script’s messages to be shown in the same language as OmegaT runs in, download the merge_split.properties (included in the archive on SF.net) and translate the file. But, unlike the earlier version, the script can work even without it, though its messages will be only in English.
Misc
If you need more info about installing and using OmegaT scripts, see this quick guide.
If you find this script useful, leave a comment.
There’s also a very easy way to say thank you.
Your support will make more scripts like this possible.
Happy merging and splitting!
This is *extremely* useful. I translate patents and related material, which frequently contain extremely lengthy segments, so I often find it easier to break up one large segment into several smaller ones to make it easier to parse the sentence as a whole. OmegaT is far and away my favorite CAT tool, but I’ve always envied how Trados, clunky as it is otherwise, allows for splitting and merging of segments on the fly for this reason.
This script is definitely going to make my life easier. Thank you!
Thank you, Jonathan. It’s a true joy to read happy comments from satisfied users. I use the script almost daily myself and sometimes marvel how well it’s done 🙂