-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle non-breaking spaces and other special unicode characters #6
Comments
not sure if this is the same issue, but I'm getting:
Apparently this is a new strictness introduced by python 3. Also see: |
Thanks for report @codinguncut ! For now you can work around this issue by parsing the document yourself and passing |
The issue is that Scrapy used Content-Type header to get the encoding ('utf-7'), while the site in fact seems to return utf-8. Then Scrapy decodes body using 'errors=replace' (w3lib_replace to be precise, see https://github.com/scrapy/w3lib/blob/34435d085c6adb14c94cd0188c23f6dc7d4da0f7/w3lib/encoding.py#L174) - and this produces an output which can't be encoded back to utf-8 for some reason. I think the right place to fix it is probably w3lib. html-text can provide extra robustness by using surrogateescape, but it should be better to get a proper unicode body before passing it to html_text. |
FTR, response.css / response.xpath also don't work for this website. |
See discussion in #2 (comment)
The text was updated successfully, but these errors were encountered: