Merge and split segments in OmegaT

This article is about performing merging and splitting segments in OmegaT using a convenient groovy script.

It starts with some basics of segmentation in OmegaT. You can also jump directly to the script.

Basics of segmentation

OmegaT breaks text into segments according to the segmentation rules. These rules are grouped into sets which in turn are grouped into a collection, and only one such collection can be applied at a time: per-project, a.k.a. local (i.e. applying only to the current project), or global (i.e. applying to every project that doesn’t have a local set of segmentation rules).

To enable per-project segmentation rules:

  • Click Project → Properties…
  • In the Project Properties dialog box, click Local Segmentation Rules…
  • In the Local Segmentation Rules window, select Use local segmentation rules

If local segmentation rules are enabled, the rules are edited via Project → Properties… → Local Segmentation Rules…, otherwise via Options → Global Segmentation Rules…

The sets of rules in such local or global collections are organized by language (remember that the source text is segmented, so the language code of a set must match the source language of the project), and then the rules are applied in sequence: each set’s language code is checked, and if it matches, then each rule in that set is applied in sequence. If the language of the set doesn’t match, the whole set it skipped.

It’s possible to add your own set, giving it any name and language code. Sets of rules (and the rules within them) can be moved up and down to change the order.

Rules in sets can be added, removed, edited, and rearranged. There are two types of rules: rules that split segments into two parts, and rules that merge two separate parts into one segment. Both rules have Pattern before and Pattern after parts. These refer to the place in the text where the split or merge should occur. A checked Break/exception box means the split will occur exactly where the text described in Pattern before ends, and the text described in Pattern after begins; an unchecked box means these two described chunks will be fused together.

Both rule types use regular expressions. In many cases, typing or pasting the exact/verbatim part of the text you want to split or merge will do the trick, but some characters have special meanings in RegEx, so you may not get the results you expect.

There are a few limitations where no rules can help. One of them is merging two segments from different paragraphs. It simply won’t work, a paragraph break in the source file always results in a new segment in OmegaT, no matter what.

Very rarely, rules may not work as expected because something that should be split according to one rule you added is merged by another rule somewhere in the chain, or vice versa. If you are careful with your segmentation rules, this shouldn’t happen to you.

Merging and splitting segments without editing rules

History

In OmegaT, there’s no easy way to merge or split segments directly from the editor. A workaround was proposed back in 2016 through a script. The script simply copied the parts of the actual text in OmegaT where the split or the merge was requested and added the rules based on that text. It was much better than nothing, but there was a lot of room for improvement: the script wouldn’t work if the project didn’t have local segmentation rules; if there was no set of rules for the project’s source language, the script wouldn’t add one; to do the split, you had to carefully select a part of the source text exactly from its end up to the split point; the script would try to merge segments over paragraph breaks or oven segments from different files (which would never work, rule or no rule)…

Now that script was given a thorough rework, and now most of the above issues have been addressed.

Download

First of all, the script is available on GitHub and FS.net

Script Features

  • By default, the script enables a per-project segmentation collection of rules and creates a new set (called MergeSplit) for your project’s source language. This can be disabled by changing true to false in the lines below (enforceProjectSRX is about enabling per-projects rules, separateMappingRule is about creating a separate group):
enforceProjectSRX = true
separateMappingRule = true
  • If the script is run in a project with no per-project rules, and the script’s default behavior hasn’t been changed, it will reload the project after the first run because newly enabled per-project rules need to be activated. This happens only once (unless you disable per-project segmentation and run the script again later), so you may need to run the script twice to perform this first split or merge.
  • To perform a split, simply place the cursor inside the current segment’s source sentence where you want the split to occur. You don’t need to select anything.
  • To perform a merge, simply run the script with the text cursor anywhere but in the current segment’s source text. It will attempt to merge the current segment with the next one.
  • Before performing merging or splitting, it will show the preview and ask for confirmation.
  • Although you must reload the project for each new rule to be applied, you can choose not to reload if you’re performing several merges or splits in a row. Reload after the last one in the series is complete — it will save you a few seconds.
  • If the merge is over a paragraph break or between the last and the first segment in two files, the script will inform you that such a merge is impossible and won’t add the rule.
  • A few more features added later.

Script’s GUI

If you want the script’s messages to be displayed in the same language as the OmegaT interface, download the merge_split.properties file (included in the archive on SF.net) and translate it. But, unlike the earlier version, the script will work even without it, although its messages will be in English only.

Misc

If you need more info on how to install and use OmegaT scripts, see this quick guide.

If you find this script useful, please leave a comment.

There’s also a very easy way to say thank you.
Your support makes more scripts like this possible.

Happy merging and splitting!

4 thoughts on “Merge and split segments in OmegaT

  1. Pingback: Merge and split segments in OmegaT (update) | True Translation
  2. This is *extremely* useful. I translate patents and related material, which frequently contain extremely lengthy segments, so I often find it easier to break up one large segment into several smaller ones to make it easier to parse the sentence as a whole. OmegaT is far and away my favorite CAT tool, but I’ve always envied how Trados, clunky as it is otherwise, allows for splitting and merging of segments on the fly for this reason.

    This script is definitely going to make my life easier. Thank you!

  3. Pingback: Merge and split segments in OmegaT (update) | True Translation

Leave a comment