Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to output <span class=ocr_word> elements to hocr #314

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

brobertson
Copy link

@brobertson brobertson commented Oct 25, 2018

This adds a'-w' switch to ocropus-hocr, which will cause it to generate elements containing each word's text and validly nested within the appropriate element. It depends on the .llocs files generated by ocropus-rpred. If these are not available, or the switch is not turned on, it uses the old behaviour.

It should be noted that text output from ocropus-hocr with and without the -w might differ. In particular, initial and final spaces are stripped from lines when the -w switch is on because this tends to generate poor bounding boxes.

@amitdo
Copy link
Contributor

amitdo commented Oct 25, 2018

It depends on the .llocs files generated by ocropus-gpageseg

You mean ocropus-rpred.

@brobertson
Copy link
Author

Thanks, I've edited the comments. I'll look into how I change this in the source code so that the changes pertain in the pull request.

@kba
Copy link
Collaborator

kba commented Oct 25, 2018

Looks good, thanks. Will there be whitespace between the word spans? To do html2txt, for screenreaders etc.?

ocropus-hocr Outdated
previous_char_x = char_x
PN("</span>")
except:
E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/gpageseg/rpred/

PN("</span>")
except:
E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?")
PN(" class='ocr_line' title='%s'>"%info,text,"</span>")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks off on GitHub.

@brobertson
Copy link
Author

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

@amitdo
Copy link
Contributor

amitdo commented Oct 25, 2018

I think kraken also has something like this feature.

@kba
Copy link
Collaborator

kba commented Oct 25, 2018

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

👍

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

PEP8. We discussed beautifying the whole code base but decided against it at the time, because it change every second line and make blameing harder.

@amitdo
Copy link
Contributor

amitdo commented Oct 25, 2018

Do you take into account the fact that each 'loc' is just one spot that can be in the start / middle / end of glyph?

@brobertson
Copy link
Author

Amit --
Excellent question.

Because of this issue, the assigned break between words is the midpoint of the space between them. (This is noted in the code comments.) This ensures, or tries to, that no part of a cc of a glyph is cut off. It does mean that the word bounding box has some extra space on either side and that each word bbox is adjacent to the next. I feel this is a good compromise, given the data available, since it can be used for retraining, cropping images of words and so forth.

I'll provide a visualization later today.

@brobertson
Copy link
Author

out
This is a visualization of an example of the word breaking behaviour.

The breaks that occur inside words, such as on the first line, are OCR errors: that is, ocropus-rpred finds a space there, so this code dutifully enters a word. Similarly, the marginal numbers, 655 and 665 are incorrect because of upstream errors. (I find that these marginal numbers sometimes get lost or clipped by gpageseg unless I really jam the column parameters.)

I'm processing a few thousand pages in the next couple of days, and I'll pass them through this process to ensure it doesn't throw errors and check the visualizations for good word breaks.

@brobertson
Copy link
Author

This is the corresponding plaintext output, verifying my analysis of the errors above:
aΑ ἄΕ
φύει τ' ἄδηλα καὶ φανέντα κρύπτεται·
κοὐκ ἔστ' ἄελπτον οὐδὲν, ἀλλ' ἀλίσκεται
χώ δεινὸς ὅρκος χαἰ περισκελεῖς φρένες.
Κἀγώ γὰρ, ὃς τὰ δείν' ἐκαρτέρουν τότε, 650
βαφῇ σιδηρος ῶς ἐθηλύνθην στόμα
πρὸς τῆσδε τῆς γυναικός· οἰκτίρω δέ νιν
χήραν παρ' ἐχθροῖς παῖδά τ' ὀρφανὸν λιπεῖν.
Ἀλλ' εἴμι πρός τε λουτρὰ καὶ παρακτίους
λειμῶνας, ὡς ἂν λύμαθ' ἀγνίσας ἐμὰ s
μῆνιν βαρεῖαν ἐξαλύξωμαι θεᾶς·
μολών τε χῶρον ἔνθ' ἀν ἀστιβῆ κίχω,
κρύψω τόδ' ἔγχος τοὐμὸν, ἔχθισ.ον βελῶν,
γαίας ὀρύξας ἔνθα μή τις ὅψεται·
ἀλλ' αὐτὸ νὺξ Ἀιδης τε σῳζόντων κάτω. sn0
Σγὼ γὰρ ἐξ οὖ χειρὶ τοῦτ' ἐδεξάμην
παρ' κτορος δώρημα δυσμενεστάτου,
οὔπω τι κεδνὸν ἔσχον Aργείων πάρα·
ἀλλ' ἔοτ' ἀληθὴς ἡ βροτῶν παροιμία·
ἐχθρῶν ἄδωρα δῶρα κοὐκ δνήσιμα.
Τοιγὰρ τὸ λοιπὸν εἰσόμεσθα μὲν θεοῖς
εκειν, μαθησόμεσθα δ' Ἀτρείδας σέβειν.
Ἀρχοντές εἰσιν, βσθ' ὑπεικτέον· τί μή ;
Καὶ γὰρ τὰ δεινὰ καὶ τὰ καρτερώτατα
τιμαῖς ὑπείκει· τοῦτο μὲν νιφοστιβεῖς 57ο
χειμῶνες ἐκχωροῦσιν εὐκάρπῳ θέρει·
ἐξίσταται δὲ νυκτὸς αἰανὴς κύκλος
τῇ λευκοπώλῳ φέγγος ἡμέρᾳ φλέγειν·
δεινῶν τ' ἄημα πνευμάτων ἐκοίμισε
34) 'ει LA, ποιεῖ Stoὸaeus ‖ 343 κοὐx LA, οx Stobaeus, Suidas
‖ 64 χαἰ Br., καὶ LA, Stcbaeus, Suidas ‖ 350 ἐκαρτέρουν τότε libri, γρ.

@brobertson
Copy link
Author

For what it's worth, it's clear we could improve on this code to generate the 'true' bbox of the word by finding the smallest rectangle around all the ccs within the bbox provided by the routine offered in this pull request. If someone could recommend a library, preferably already imported by Ocropus, that does this or that would be best to modify to this purpose, I'd be happy to work on it for a future pull request.

@amitdo
Copy link
Contributor

amitdo commented Oct 26, 2018

@kba kba requested a review from zuphilip October 29, 2018 10:12
@zuphilip
Copy link
Collaborator

Sorry, I am late to look at this PR... Actually, there is another PR #283 by @JKamlah to extend the hocr output which will include word boxes but also probabilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants