Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to NFC normalisation by default #257

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

nickjwhite
Copy link

Switch from NFKC normalisation to NFC normalisation by default. NFC
normalisation is more appropriate for OCR, as different characters
which may be semantically similar are nevertheless often useful to
capture and output in their original form.

Switch from NFKC normalisation to NFC normalisation by default. NFC
normalisation is more appropriate for OCR, as different characters
which may be semantically similar are nevertheless often useful to
capture and output in their original form.
@kba
Copy link
Collaborator

kba commented Dec 8, 2017

Can you elaborate on why NFKC is better suited for OCR than NFC? Can you give example data where NFKC is superior and ideally test data for CI? How will this influence recognition with the widely used en-default and fraktur models?

I'm reluctant to merge this until I fully understand the consequences.

@nickjwhite
Copy link
Author

nickjwhite commented Dec 12, 2017

Thanks for commenting @kba.

My use case for NFC is to recognise and differentiate long s (ſ) from a short s (s) in old Latin documents. NFKC recognises that they're semantically interchangable, and so changes the long s into a short s, meaning that I can't differentiate them in the OCR output. I realised this as I found that the long s I had in my codec input text wasn't included in the codec debug output, but it also meant that the long s characters in my ground truth were silently changed to short s characters.

It makes sense to me to only normalise down to glyphs that are identical, and not that are deemed by Unicode to be "equivalent", so that other such characters can be differentiated. To use the example above of long and short s, with this patch one could still just transcribe all long s characters in ground truth with a short s if the differentiation wasn't important, the difference is that if it is relevant then the different glyphs can be preserved and correctly represented with an appropriate model.

I can't think of any cases where this could cause a regression in other training qualities, unless I suppose the ground truth files depended on using different unicode characters that would be normalised to the same character when training. I am not expert in non-Latin or Greek scripts, so perhaps that could be an issue, but that would surprise me.

I hope this all makes sense. Do ping me for clarification if I have been too verbose and unclear!

(edited as I got NFKC and NFC the wrong way around in this comment initially - sorry!)

@nickjwhite
Copy link
Author

nickjwhite commented Dec 12, 2017

I just tested all the ground truth that was linked to from the wiki, comparing differences between NFC and NFKC: https://github.com/tmbdev/ocropy/wiki/Models

https://github.com/ChillarAnand/likitham (Telugu): no difference
https://github.com/jze/ocropus-model_fraktur (Fraktur): no difference
https://github.com/jze/ocropus-model_oesterreich-ungarn (German): no difference

https://github.com/zuphilip/ocropy-french-models (French): Only difference is several instances of this:
NFC:
NFKC: ...

https://github.com/jze/ocropus-model_cyrillic (Cyrillic): 1 instance of a difference
NFC: ¾
NFKC: 3⁄4

https://github.com/isaomatsunami/clstm-Japanese (Japanese): many differences, but all small variants of which these seem representative:
NFC: 膀紛惟賢筋紬鉱確板藪垢ト木撰詠腹蒋〉筋静ぞ⑬端宥罷橿懇培上鋼
NFKC: 膀紛惟賢筋紬鉱確板藪垢ト木撰詠腹蒋〉筋静ぞ13端宥罷橿懇培上鋼

NFC: 盗紺r貧双員急媚牙鳥忠傑勃或架届循諭繁敢想疫別緯容く茶b游髙
NFKC: 盗紺r貧双員急媚牙鳥忠傑勃或架届循諭繁敢想疫別緯容く茶b游髙

NFC: 染し璧ク偏郁承枚莉析蒋呪拡霊繊覚産⓯+剝匿節¥司借垂彌備貴菊
NFKC: 染し璧ク偏郁承枚莉析蒋呪拡霊繊覚産⓯+剝匿節¥司借垂彌備貴菊

NFC: 仰佛ル祖留器碕膚蛛納准嚥廠㍑抑抜提騙ん禍駐尽陵繭弩未汰責脳商
NFKC: 仰佛ル祖留器碕膚蛛納准嚥廠リットル抑抜提騙ん禍駐尽陵繭弩未汰責脳商

I also checked the text in ocropy's tests/ directory, and there was no difference between NFC and NFKC.

From looking at all of these, it still seems to me that NFC is the best option, as it follows the principle of least surprise: it will ensure that whatever glyph is encoded in the ground truth will be used for the model. I suspect that the people creating these models didn't expect the OCR to alter their characters from the ground truth the way NFKC does.

I couldn't see the ground truth for the English and Fraktur models you mention, but I'd be happy to compare them too if they're available.

@amitdo
Copy link
Contributor

amitdo commented Dec 13, 2017

@Beckenb
Copy link
Contributor

Beckenb commented Dec 13, 2017

About 'long-s':
https://www.mail-archive.com/[email protected]/msg01569.html

I would recommend just recognizing it with the default Fraktur model and
choosing long/short s based on context; there are very few cases where the
choice can't be made programmatically, and you should be able to find those
with a simple script.

Tom is proposing to change "short s" to "long s" after the ocr; while this may work relatively easy on more recent fraktur texts (eg. 19th century) the older the texts get the more difficult this is; for example incunabula (1450-1500) have no uniform grammar or spelling to follow, so it is crucial that the ocr reflects the print as closely as possible.

@nickjwhite
Copy link
Author

@Beckenb is correct, and moreover capturing places where long s is used in a way that is not "correct" can be useful information to capture, in some cases.

There will also be cases of other characters for which it is important to recognise the particular glyph, even if it is considered "equivalent" to a different one by unicode. Long s is just a well-known, obvious example.

@amitdo
Copy link
Contributor

amitdo commented Dec 13, 2017

@nickjwhite,

Maybe you want to explore what Tesseract 4.00 is doing.
https://github.com/tesseract-ocr/tesseract/search?q=UnicodeNormMode

@mittagessen
Copy link

I want to throw in NFD as the default mode as well and strongly agree that any of the canonical normalizations are too lossy for OCR on historical prints. NFD has the benefit over NFC that codecs are smaller resulting in a slight boost in recognition accuracy (<1%) and speed (probably unnoticable). For highly diacritized (?) scripts such as polytonic Greek the codec size will be roughly cut in half.

The only drawback is that polytonic Greek output will look worse on displays as the glyphs in most fonts are only defined for the combined code points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants