Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly encode output as UTF-8 #237

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nickjwhite
Copy link

I think I've covered all the cases we need to, but it's possible there could be more.

This should address issue #197 and #10.

@kba
Copy link
Collaborator

kba commented Sep 29, 2017

These will only change the debug output. Python3 should already do this but won't break with it. LGTM. @zuphilip ?

@QuLogic
Copy link
Contributor

QuLogic commented Sep 29, 2017

str + bytes does not work; this is broken on Python 3. print also takes str as input; if you encode it, it's going to call str() on it, meaning you get b'...'. I don't see why this is needed.

@nickjwhite nickjwhite force-pushed the safeprint branch 2 times, most recently from 4eddbff to 47d86c4 Compare October 24, 2017 15:49
@nickjwhite
Copy link
Author

Thanks @QuLogic, I don't write a lot of Python, so hadn't spotted the mixing of str + bytes.

I just updated the pull request, with a version that ensures nothing but str is in the print() arguments. And to force UTF-8, I switched to using codecs.getwriter('utf8') for stdout and stderr for all cases where they could output it. I also improved the commit message, so it's clearer what the problem is and why it's worth fixing.

I think I've covered all the cases we need to, but there could be more.

This is needed as if a locale isn't set to UTF-8, an error will result
of this form:

  UnicodeEncodeError: 'ascii' codec can't encode character u'\u0113' in position 60: ordinal not in range(128)

While the ideal solution is for the user to set their locale to UTF-8,
it is better that we print debug output which may not be displayed
correctly than that we output a fatal (and non-obvious) error,
potentially some time into processing.

This also fixes some cases of implicitly combining str and *obj together
when printing debug output, which fails with some Python versions, by
explicitly using str.join(obj).
@nickjwhite
Copy link
Author

Just as an additional justification / use-case for this being necessary, I'm currently using Ocropus on a system (that I don't manage) which uses slurm, and no matter what I try I cannot persuade the slurm job manager on this system to not strip locale information. And again, to reiterate, UTF-8 is always the correct thing to output here, regardless of locale, and no fatal errors should be emitted just because of a weird locale.

@QuLogic
Copy link
Contributor

QuLogic commented Oct 31, 2017

Does setting PYTHONIOENCODING not help?

@nickjwhite
Copy link
Author

@QuLogic, yes, that should do the job as well, I had not found that environment variable before, thanks for the suggestion.

However, I think ocropy should automatically and always output UTF-8, as that's what it's dealing with for input and output, and to do otherwise risks unnecessary runtime crashes. Given the diverse environments Ocropus can run in I think it's best to just enforce this in the program - after all, we have got bug reports from users (#10, #197), and frankly it took me ages to figure out where the issue was too, with the weird HFS system I need to use.

I accidentally missed these from the original commit (c4ae4b).
@kba
Copy link
Collaborator

kba commented Dec 8, 2017

As laid out in #197 there are a few options to achieve this. Revisiting the UTF8/debug output issues, I'm not keen on explicitly adding boilerplate to every file with print statements. Changing stdout/stderr encoding in a single place (like common.py) probably achieves the same goal but is error-prone.

I would prefer to replace print/print_info/print_error with a dedicated logging mechanism, such as python's bundled logging or something smaller in the codebase. That would make it easier to override behavior (such as output stream, encoding, logging level).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants