Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended hocr #283

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Extended hocr #283

wants to merge 11 commits into from

Conversation

JKamlah
Copy link

@JKamlah JKamlah commented Dec 23, 2017

ocropy.json and extended hocr

These changes will not compromise any older functions, but giving two new features: 1) json output for each line, 2) hocr output with word boxes and probabilities.
Also the added functionality could (!) replace some older stuff, it won't, and so some calculation will
be done twice.

What will the addition do?

The new code produces a *.ocropy.json file for each line,
which contains:

  • fpath
  • id
  • scale
  • padding
  • bboxes (line, word, char)
  • prob (word, char)

These information will be used to produce an extended-hocr file with:

  • word/char probabilities
  • word bboxes, e.g.
    testpage_with_wordboxes.PNG
    (new hocr file of testpage visualized with hocrjs)

How can it be started?

There are new arguments to functions:

./ocropus-gpageseg 'book/????.bin.png' -j or --json 

If gpageseg get started with -j/--json it will produce the first part
of the *.ocropy.json.

The following steps (ocropus-rpred, ocropus-hocr) will recognize that a there is a *.ocropy.json file and will automatically work with it. However, it is also possible to suppress some of these steps individually with some additional argument:

./ocropus-rpred -Q 4 -m models/fraktur.pyrnn.gz 'book/????/??????.bin.png' --nojson 

Stops adding further information to the json-file.
Note, that if this step will be skipped, then the extended hocr file can't be created. And

./ocropus-hocr 'book/????.bin.png' -o ersch.html -n or --normal

will anyway create the hocr file the old way (without probabilities, word boxes).

Finally, there is another parameter -c,--charconfs in ocropus-hocr to output the confidence of every char, but since this is increasing the amount of data massively, the default behaviour is not to do this. For usage of this feature:

./ocropus-hocr 'book/????.bin.png' -o ersch.html -c or --charconfs

Have fun and a Merry Christmas 🎄 :)

@JKamlah
Copy link
Author

JKamlah commented Dec 23, 2017

Big thanks @zuphilip for all the support and @mittagessen (https://github.com/mittagessen/kraken) for the inspiring work.

…tions to create an extended hocr file. For more informations see PR: 'Extended hocr'
@mittagessen
Copy link

Just a small note: You might want to split words at Unicode whitespace characters with something like regex.split('\s+') as there's more than ASCII space out there.

Otherwise it looks fine as the translate_back function has the needed adjustments to not output empty classes, so it shouldn't produce any weirdly offset bounding boxes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants