add option to output <span class=ocr_word> elements to hocr #314

brobertson · 2018-10-25T12:21:52Z

This adds a'-w' switch to ocropus-hocr, which will cause it to generate elements containing each word's text and validly nested within the appropriate element. It depends on the .llocs files generated by ocropus-rpred. If these are not available, or the switch is not turned on, it uses the old behaviour.

It should be noted that text output from ocropus-hocr with and without the -w might differ. In particular, initial and final spaces are stripped from lines when the -w switch is on because this tends to generate poor bounding boxes.

amitdo · 2018-10-25T12:43:22Z

It depends on the .llocs files generated by ocropus-gpageseg

You mean ocropus-rpred.

brobertson · 2018-10-25T12:54:47Z

Thanks, I've edited the comments. I'll look into how I change this in the source code so that the changes pertain in the pull request.

kba · 2018-10-25T12:58:19Z

Looks good, thanks. Will there be whitespace between the word spans? To do html2txt, for screenreaders etc.?

kba · 2018-10-25T12:56:03Z

ocropus-hocr

+                            previous_char_x = char_x
+			PN("</span>")
+		    except:
+			E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?")


s/gpageseg/rpred/

kba · 2018-10-25T12:57:02Z

ocropus-hocr

+			PN("</span>")
+		    except:
+			E("Data for ocr_word elements is not available. Did you select --llocs in ocropus-gpageseg?")
+			PN(" class='ocr_line' title='%s'>"%info,text,"</span>")


Indentation looks off on GitHub.

brobertson · 2018-10-25T13:05:06Z

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

amitdo · 2018-10-25T13:13:07Z

I think kraken also has something like this feature.

kba · 2018-10-25T13:16:06Z

Yes, there is whitespace between the elements. There is not whitespace after the final and the that closes the hocr_line.

👍

Apropos formatting, is there a beautifier command that I can run the code through to conform to this project?

PEP8. We discussed beautifying the whole code base but decided against it at the time, because it change every second line and make blameing harder.

amitdo · 2018-10-25T13:52:39Z

Do you take into account the fact that each 'loc' is just one spot that can be in the start / middle / end of glyph?

brobertson · 2018-10-25T14:18:27Z

Amit --
Excellent question.

Because of this issue, the assigned break between words is the midpoint of the space between them. (This is noted in the code comments.) This ensures, or tries to, that no part of a cc of a glyph is cut off. It does mean that the word bounding box has some extra space on either side and that each word bbox is adjacent to the next. I feel this is a good compromise, given the data available, since it can be used for retraining, cropping images of words and so forth.

I'll provide a visualization later today.

brobertson · 2018-10-26T11:43:18Z

This is a visualization of an example of the word breaking behaviour.

The breaks that occur inside words, such as on the first line, are OCR errors: that is, ocropus-rpred finds a space there, so this code dutifully enters a word. Similarly, the marginal numbers, 655 and 665 are incorrect because of upstream errors. (I find that these marginal numbers sometimes get lost or clipped by gpageseg unless I really jam the column parameters.)

I'm processing a few thousand pages in the next couple of days, and I'll pass them through this process to ensure it doesn't throw errors and check the visualizations for good word breaks.

brobertson · 2018-10-26T11:46:23Z

This is the corresponding plaintext output, verifying my analysis of the errors above:
aΑ ἄΕ
φύει τ' ἄδηλα καὶ φανέντα κρύπτεται·
κοὐκ ἔστ' ἄελπτον οὐδὲν, ἀλλ' ἀλίσκεται
χώ δεινὸς ὅρκος χαἰ περισκελεῖς φρένες.
Κἀγώ γὰρ, ὃς τὰ δείν' ἐκαρτέρουν τότε, 650
βαφῇ σιδηρος ῶς ἐθηλύνθην στόμα
πρὸς τῆσδε τῆς γυναικός· οἰκτίρω δέ νιν
χήραν παρ' ἐχθροῖς παῖδά τ' ὀρφανὸν λιπεῖν.
Ἀλλ' εἴμι πρός τε λουτρὰ καὶ παρακτίους
λειμῶνας, ὡς ἂν λύμαθ' ἀγνίσας ἐμὰ s
μῆνιν βαρεῖαν ἐξαλύξωμαι θεᾶς·
μολών τε χῶρον ἔνθ' ἀν ἀστιβῆ κίχω,
κρύψω τόδ' ἔγχος τοὐμὸν, ἔχθισ.ον βελῶν,
γαίας ὀρύξας ἔνθα μή τις ὅψεται·
ἀλλ' αὐτὸ νὺξ Ἀιδης τε σῳζόντων κάτω. sn0
Σγὼ γὰρ ἐξ οὖ χειρὶ τοῦτ' ἐδεξάμην
παρ' κτορος δώρημα δυσμενεστάτου,
οὔπω τι κεδνὸν ἔσχον Aργείων πάρα·
ἀλλ' ἔοτ' ἀληθὴς ἡ βροτῶν παροιμία·
ἐχθρῶν ἄδωρα δῶρα κοὐκ δνήσιμα.
Τοιγὰρ τὸ λοιπὸν εἰσόμεσθα μὲν θεοῖς
εκειν, μαθησόμεσθα δ' Ἀτρείδας σέβειν.
Ἀρχοντές εἰσιν, βσθ' ὑπεικτέον· τί μή ;
Καὶ γὰρ τὰ δεινὰ καὶ τὰ καρτερώτατα
τιμαῖς ὑπείκει· τοῦτο μὲν νιφοστιβεῖς 57ο
χειμῶνες ἐκχωροῦσιν εὐκάρπῳ θέρει·
ἐξίσταται δὲ νυκτὸς αἰανὴς κύκλος
τῇ λευκοπώλῳ φέγγος ἡμέρᾳ φλέγειν·
δεινῶν τ' ἄημα πνευμάτων ἐκοίμισε
34) 'ει LA, ποιεῖ Stoὸaeus ‖ 343 κοὐx LA, οx Stobaeus, Suidas
‖ 64 χαἰ Br., καὶ LA, Stcbaeus, Suidas ‖ 350 ἐκαρτέρουν τότε libri, γρ.

brobertson · 2018-10-26T15:37:33Z

For what it's worth, it's clear we could improve on this code to generate the 'true' bbox of the word by finding the smallest rectangle around all the ccs within the bbox provided by the routine offered in this pull request. If someone could recommend a library, preferably already imported by Ocropus, that does this or that would be best to modify to this purpose, I'd be happy to work on it for a future pull request.

amitdo · 2018-10-26T15:47:31Z

https://docs.scipy.org/doc/scipy/reference/ndimage.html

https://github.com/tmbdev/ocropy/blob/d3e5cc60b64d/ocrolib/morph.py

zuphilip · 2018-10-29T18:07:38Z

Sorry, I am late to look at this PR... Actually, there is another PR #283 by @JKamlah to extend the hocr output which will include word boxes but also probabilities.

… as to if they use leading or trailing edge so this is the best we can do.)

add option to output <span class=ocr_word> elements to hocr

84d8f6c

kba approved these changes Oct 25, 2018

View reviewed changes

correct the origin of llocs file

9a0d07b

amitdo mentioned this pull request Oct 26, 2018

Noise characters recognized with bbox as the entire page tesseract-ocr/tesseract#1192

Open

kba requested a review from zuphilip October 29, 2018 10:12

deal with corner case which sometimes lops off final character in word

885fb17

zuphilip added the ✨ enhancement label Jan 13, 2019

brobertson added 3 commits July 23, 2019 11:04

use the edge of the space as the beginning of word. (Classifiers vary…

403c07c

… as to if they use leading or trailing edge so this is the best we can do.)

add more char substitutions

1f5e8af

use decomposed unicode always

2619e62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add option to output <span class=ocr_word> elements to hocr #314

add option to output <span class=ocr_word> elements to hocr #314

brobertson commented Oct 25, 2018 •

edited

Loading

amitdo commented Oct 25, 2018

brobertson commented Oct 25, 2018

kba commented Oct 25, 2018

kba Oct 25, 2018

kba Oct 25, 2018

brobertson commented Oct 25, 2018

amitdo commented Oct 25, 2018 •

edited

Loading

kba commented Oct 25, 2018 •

edited

Loading

amitdo commented Oct 25, 2018

brobertson commented Oct 25, 2018

brobertson commented Oct 26, 2018

brobertson commented Oct 26, 2018

brobertson commented Oct 26, 2018

amitdo commented Oct 26, 2018 •

edited

Loading

zuphilip commented Oct 29, 2018

add option to output <span class=ocr_word> elements to hocr #314

Are you sure you want to change the base?

add option to output <span class=ocr_word> elements to hocr #314

Conversation

brobertson commented Oct 25, 2018 • edited Loading

amitdo commented Oct 25, 2018

brobertson commented Oct 25, 2018

kba commented Oct 25, 2018

kba Oct 25, 2018

Choose a reason for hiding this comment

kba Oct 25, 2018

Choose a reason for hiding this comment

brobertson commented Oct 25, 2018

amitdo commented Oct 25, 2018 • edited Loading

kba commented Oct 25, 2018 • edited Loading

amitdo commented Oct 25, 2018

brobertson commented Oct 25, 2018

brobertson commented Oct 26, 2018

brobertson commented Oct 26, 2018

brobertson commented Oct 26, 2018

amitdo commented Oct 26, 2018 • edited Loading

zuphilip commented Oct 29, 2018

brobertson commented Oct 25, 2018 •

edited

Loading

amitdo commented Oct 25, 2018 •

edited

Loading

kba commented Oct 25, 2018 •

edited

Loading

amitdo commented Oct 26, 2018 •

edited

Loading