Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the option for character accumulated glyph confidences. #1851

Merged
merged 1 commit into from
Aug 20, 2018

Conversation

noahmetzger
Copy link
Contributor

The parameter glyph_confidences is changed from bool to int.
An execution with value 1 outputs the hOCR file enriched with glyph confidences
for every timestep like before. An execution with value 2 outputs the timesteps
accumulated over the recognized characters.

Signed-off-by: Noah Metzger [email protected]

The parameter glyph_confidences is changed from bool to int.
An execution with value 1 outputs the hOCR file enriched with glyph confidences
for every timestep like before. An execution with value 2 outputs the timesteps
accumulated over the recognized characters.

Signed-off-by: Noah Metzger <[email protected]>
@egorpugin egorpugin merged commit 621a8cd into tesseract-ocr:master Aug 20, 2018
@@ -508,7 +508,7 @@ Tesseract::Tesseract()
STRING_MEMBER(page_separator, "\f",
"Page separator (default is form feed control character)",
this->params()),
BOOL_MEMBER(glyph_confidences, false,
INT_MEMBER(glyph_confidences, 0,
"Allows to include glyph confidences in the hOCR output",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noah, could you please add help information here on the valid values for glyph_confidences?

@noahmetzger
Copy link
Contributor Author

I heard from @bertsky that some would like to give this feature a more appropriate name. As the 4.0.0 release is about to start soon, we should decide about this quickly. Any suggestions? @kba

@zdenop
Copy link
Contributor

zdenop commented Oct 16, 2018

Can you please do it today or tomorrow?
If parameter is valid for hocr only, it would be greate if "hocr" label is part of (new) parameter name

@bertsky
Copy link
Contributor

bertsky commented Oct 16, 2018

Considerations for (re)naming TessBaseAPI::GetGlyphConfidences() and Tesseract::glyph_confidences:

  • the API does not use "glyph" but "symbol". Since this subsumes whitespace characters as well, "glyph" seems even more non-intuitive.
  • the API already gives confidences by other means (TessBaseAPI::AllWordConfidences(), LTRResultIterator::Confidence() ...), but this is really about alternatives, or "choices" in API terminology
  • this affects behaviour with LSTM globally, so the variable should have the prefix lstm (which is already present for lstm_use_matrix)
  • the low-level (pixel-based) timesteps of LSTMs exposed via glyph_confidences=1 should probably be covered via a new RIL_TIMESTEPS instead; this would require GetGlyphConfidences() to take a RIL argument and glyph_confidences to become a mere boolean again. The new RIL would have to be included in LTRResultIterator as well, and its GetChoiceIterator() would then become available on 2 levels.

@bertsky
Copy link
Contributor

bertsky commented Oct 16, 2018

What is more: a function to retrieve the full matrix of LSTM predictions could be very helpful for applications like keyword spotting or post-correction. In contrast to GetGlyphConfidences(), which as of now is a timesteps vector of an n-best output dimensions vector (sorted by probability), this would require a fixed (two-dimensional) array of the full Unicharset.

@amitdo
Copy link
Collaborator

amitdo commented Oct 16, 2018

What is more: a function to retrieve the full matrix of LSTM predictions could be very helpful...

+1

ocropus-archive/DUP-ocropy#279

@zdenop
Copy link
Contributor

zdenop commented Oct 16, 2018

Thank for analyze. If you want to change name, please send PR ASAP.
After releasing 4.0.0. change of name would be problem...

@bertsky
Copy link
Contributor

bertsky commented Oct 16, 2018

Ok then. I am not familiar with the release schedule, but I reckon it might take too long to get the additional RIL right. So how about splitting this up into two separate stages/PRs (one merely naming, the other more involved and with fewer chances of making it on time)?

@zdenop
Copy link
Contributor

zdenop commented Oct 17, 2018

I fine with it. Please send PR or post patch here for renaming ASAP. There is 1-2 open topics for 4.0.0 e.g. we would like to release it this or next week.

@noahmetzger
Copy link
Contributor Author

Would it be fine to replace glyph_confidences by lstm_symbol_alternatives?
I did not add hocr because it is also valid for the API which is used by the Python wrapper for example.

@zdenop
Copy link
Contributor

zdenop commented Oct 17, 2018

yes.

@bertsky
Copy link
Contributor

bertsky commented Oct 17, 2018

How about TessBaseAPI::GetChoices() and Tesseract::lstm_choice_mode?

@noahmetzger
Copy link
Contributor Author

I am not sure if lstm_choice_mode isn't too ambigous as a description for what it does. Any other opinions on this? @kba

@bertsky
Copy link
Contributor

bertsky commented Oct 17, 2018

Well I am certainly not an expert for Tesseract API, but (as stated above) "choices" is the term used for this so far, not "alternatives". And currently it can be both "symbols" or timesteps. Lastly, "mode" appears in various places.

@noahmetzger
Copy link
Contributor Author

allright then lstm_choice_mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants