Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle combining characters correctly in a codec #261

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nickjwhite
Copy link

Using combining characters in a codec would result in the combining
codepoint(s) being separated from their base character. Now
combining characters will be included with their base character, so
the whole grapheme will be considered as one.

This means that complex characters, which are made up of multiple
unicode codepoints, can now be correctly used and represented with
Ocropus.

Note that I don't think this covers all possible Unicode "grapheme
clusters" as defined here:
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

It only works with combining characters. But that still satisfies a
lot of use cases.

Using combining characters in a codec would result in the combining
codepoint(s) being separated from their base character. Now
combining characters will be included with their base character, so
the whole grapheme will be considered as one.

This means that complex characters, which are made up of multiple
unicode codepoints, can now be correctly used and represented with
Ocropus.

Note that I don't think this covers all possible Unicode "grapheme
clusters" as defined here:
http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

It only works with combining characters. But that still satisfies a
lot of use cases.
@amitdo
Copy link
Contributor

amitdo commented Nov 20, 2017

Tesseract 4 is doing something like that, but more complex, and there are special treatment for some scripts.

@kba
Copy link
Collaborator

kba commented Dec 8, 2017

Can you explain how this relates to #257 ? Do you have data to test this?

@nickjwhite
Copy link
Author

@kba this PR is independent of #257. It concerns how unicode characters followed by combining character codepoints are handled.

I could generate some test data for this. I haven't done much software unit testing stuff before (to my shame). Do you think a sensible approach would be to create a line of ground truth with combining characters and corresponding image, then run rtrain, and check the debug output of the sorted codec is correct. Prior to this pull request the combining characters would be moved around independently of their "parent" characters, so I could create ground truth ensuring that sorting them separately would ensure they would move? Does that sound reasonable? One issue is that it depends on the debug output, so if that changed the test would have to change, but the only alternative I can think of would be to train a little model and check its output, which sounds rather too time-consuming for a test.

Thanks for the feedback!

@mittagessen
Copy link

Grapheme clusters are to some extend language specific, as they represent a cultural notion of "characterness" so even implementing the Unicode algorithm won't solve all issues. Anyway there's a python implement at link that should resolve all common cases.

I thought about doing the same in kraken and decided that it isn't worth the trouble, especially as characters are seldomly placed at their exact location anyway and are only useful in larger aggregations (words/segments). Something that doesn't break anything parsing the output and requires subjective ever-expanding lists of exceptions might not be worth merging?

@nickjwhite
Copy link
Author

@mittagessen Your statement that the current state of affairs is "something that doesn't break anything parsing the output" is untrue. As I mentioned above, a patch which takes combining characters into account is the only way I can get Ocropus to recognise various historical graphemes which don't have single unicode codepoints, which is a lot. There are plenty of characters like this "pͥ" (p with combining i over it) that require combining character support to use, and which I need to recognise in a project I'm working on.

@mittagessen
Copy link

@nickjwhite Arrgh sorry I'm just now seeing what you're trying to achieve but my point still stands: You can recognize a grapheme cluster on the page as two separate labels and code points without any trouble (how do you think Uwe Springmann does all his Incunabula work?). The only reason I could think of to have this functionality was to have more accurate bounding boxes. From my experience with polytonic Greek the recognition accuracy will actually be somewhat higher because the codec is so much smaller. Of course it only works up to a certain extend, e.g. expanding ﷺ won't work reliably but typical agglomerations of combining characters (<5) can be trained that way.

I'm not able to check this right now but have you confirmed that this change doesn't break existing models? Objects are serialized by reference and changing the data type of the only thing that actually gets serialized might break stuff.

@wrznr
Copy link

wrznr commented Feb 13, 2018

+1 for this one. @zuphilip It would be great if this PR could be merged.

@mittagessen
Copy link

@wrznr do you have actual data where the current behavior causes degradation in recognition accuracy? As already mentioned in my experience recognizing a single grapheme cluster as multiple labels is no issue at all. In addition, digging in the source code tesseract 4's behavior is in fact the exact reverse of this patch; grapheme clusters from large scripts (chinese etc.) are split into a sequence of 4-5 labels and later merged by the codec, the reasoning being that a large number of possible output labels will impact accuracy more than teaching the network to map grapheme clusters to a sequence of labels.

@wrznr
Copy link

wrznr commented Feb 13, 2018

@mittagessen Actually, I have! I'll extract a sample from my hebrew with nikkud experiments and paste it here later this week.

@mittagessen
Copy link

@wrznr Thanks. BTW with your nikkud experiments in mind I wrote a many-to-many codec a while ago that allows arbitrary mapping between label and code point sequences. Unfortunately, it isn't directly applicable to ocropy as the mapping dictionary is differently structured but if you don't mind about model compatibility it should be fairly straightforward to plug in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants