Attach document image to tweet #29

rodolfolottin · 2018-04-08T18:39:53Z

What is the purpose of this Pull Request?
The purpose is address #6 .

What was done to achieve this purpose?
I edited the Dockerfile in order to be able to use opencv and the Wand python library. Later on, with a help from this post I've implemented a method to retrieve the reimbursement's pdf from the camara website, cast it to a png format and then call the function that crops the white areas from an image.

Why is it still in work in progress?
I need some help to check if I've correctly implemented the tests, maybe I should have better exploited the fixtures and mocks functionalities.

…function

rodolfolottin · 2018-04-08T18:53:34Z

Oh, it seems that the urlopen mock isn't correctly working. Is there a way to access the CI environment?

cuducos

Hi @rodolfolottin – many thanks for this PR!

I made minro comments on style – use them as you feel useful, dump them as you feel I'm just being annoying.

The main points I would like to feedback are:

1

Regarding my comment on the Dockerfile, any ideas about how to improve maintainability of this file?

2

The crop.py would be considerably enhanced with comments and docstrings making the reasoning transparent for humans : )

3

Regarding your comment:

Oh, it seems that the urlopen mock isn't correctly working.

Reading the CI logs I think that is not the case. The error is ModuleNotFoundError: No module named 'cv2', which seems not to be related to the mock. I'm afraid we need to install opencv in the CI using .travis.yml. What do you think?

cuducos · 2018-04-09T11:47:33Z

Dockerfile

+  make VERBOSE=1 && \
+  make && \
+  make install && \
+  rm -rf /opt/opencv-${OPENCV_VERSION}


I think this a lot of code just to install OpenCV: I see two alternatives here.

The first one is to avoid adding too many layers (too many separated RUNcommands). This makes the image grows big and makes it difficult to maintain (after a couple of weeks/months people might wonder which lines to remove if we don't need openCV anymore). This solution might be as simple as:

… RUN apk add --no-cache --virtual build-base \ && apk add --no-cache --virtual libxml2-dev \ && apk add --no-cache --virtual libxslt-dev \ && apk add --no-cache --virtual imagemagick \ && apk add --no-cache --virtual imagemagick-dev \ && apk add --no-cache --virtual ghostscript \ && mkdir -p /usr/include/libxml \ && ln -s /usr/include/libxml2/libxml/xmlexports.h /usr/include/libxml/xmlexports.h \ && ln -s /usr/include/libxml2/libxml/xmlversion.h /usr/include/libxml/xmlversion.h ENV CC /usr/bin/clang ENV CXX /usr/bin/clang++ ENV OPENCV_VERSION=3.1.0 RUN <one single run command with all required commands chained> …

Another alternative is to use an image that already has Python 3.6, OpenCV and the data science kit (NumPy, Pandas, scikit-learn) instead of the simple python:3.6.3-alpine. Not sure if it exists though.

https://hub.docker.com/r/continuumio/anaconda3/ may be a good alternative.

cuducos · 2018-04-09T11:49:29Z

tests/targets/test_twitter.py

        message = (
            '🚨Gasto suspeito de Dep. @DepEduardoCunha (RJ). '
            'Você pode me ajudar a verificar? '
            'https://jarbas.serenata.ai/layers/#/documentId/10 '
            '#SerenataDeAmor na @CamaraDeputados'
        )
-        self.assertEqual(message, self.subject.text())
+        reimbursement_image = None


As far as I could see you defined reimbursement_image but never uses it. Is that correct?

It's a mistake, I'll remove it.

I had some doubts about how to test the methods that call the tweet_image method. What do you think? Should I test for the success case (a properly file-like object is returned) in this method?

I'd mock the tweet_image and assert that it was called:

@patch.object(Post, 'tweet_image') def test_something_else(self, tweet_image_mock): # do magic tweet_image_mock.assert_called_once_with()

cuducos · 2018-04-09T11:50:08Z

tests/targets/test_twitter.py

@@ -1,5 +1,6 @@
 import datetime
 from unittest import TestCase, mock
+from io import BytesIO, BufferedReader


As a matter of style, would you mind putting sorting this? I mean, from io… coming before from unittest….

cuducos · 2018-04-09T11:53:32Z

tests/targets/test_twitter.py

+            mock_response_read = BytesIO(pdf_fixture.read())
+        urlopen_mock.return_value = mock_response_read
+        self.assertIsInstance(
+            self.subject.tweet_image(), BufferedReader)


I have the impression that this block could be simpler:

with open('tests/fixtures/10.pdf', 'rb') as pdf_fixture: mock_response = pdf_fixture mock_response_read = BytesIO(pdf_fixture.read()) urlopen_mock.return_value = mock_response_read self.assertIsInstance(self.subject.tweet_image(), BufferedReader)

You create pdf_fixture, something that already has .read() method, to rename it and add it to a BytesIO to read it. What about this, does it looks like it's gonna work?

with open('tests/fixtures/10.pdf', 'rb') as mock_response: urlopen_mock.return_value = mock_response self.assertIsInstance(self.subject.tweet_image(), BufferedReader)

I got this error when trying to just assign this variable to the urlopen_mock.return_value:

...Exception ignored in: <bound method Resource.__del__ of <wand.image.Image: (empty)>> Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/wand/resource.py", line 232, in __del__ self.destroy() File "/usr/local/lib/python3.6/site-packages/wand/image.py", line 2767, in destroy for i in range(0, len(self.sequence)): TypeError: object of type 'NoneType' has no len() F.......... ====================================================================== FAIL: test_tweet_image (targets.test_twitter.TestPost) ---------------------------------------------------------------------- Traceback (most recent call last): File "/usr/local/lib/python3.6/unittest/mock.py", line 1179, in patched return func(*args, **keywargs) File "/usr/src/app/tests/targets/test_twitter.py", line 163, in test_tweet_image self.assertIsInstance(self.subject.tweet_image(), BufferedReader) AssertionError: None is not an instance of <class '_io.BufferedReader'> ----------------------------------------------------------------------

When debugging this code I came to the conclusion that in this way the urlopen isn't correctly mocked.

I could simplify the test writing it like down below, but I would still have to read the content of the opened file file and cast it to a BytesIO instance. What do you think about it?

@mock.patch('whistleblower.targets.twitter.urllib.request.urlopen') def test_tweet_image(self, urlopen_mock): with open('tests/fixtures/10.pdf', 'rb') as mock_response: urlopen_mock.return_value = BytesIO(mock_response.read()) self.assertIsInstance(self.subject.tweet_image(), BufferedReader)

Looks great : ) Way better!

cuducos · 2018-04-09T11:54:39Z

tests/targets/test_twitter.py

+            self.subject.tweet_image(), BufferedReader)
+
+        urlopen_mock.side_effect = Exception()
+        self.assertIsNone(self.subject.tweet_image())


This could be a new test. The first assert would be test_tweet_image_success and this one test_tweet_image_error, for example.

Sure! I did in this way because the other tests have both the success and error cases in the same method.

cuducos · 2018-04-09T11:55:07Z

whistleblower/helpers/crop.py

+KERNEL_HEIGHT = 15
+
+
+def remove_borders(image, threshold, max_width, max_height):


Would you mind adding doctrings to explain a bit more about these arguments?

This code is mine. I'll improve them.

Many thanks, @begnini : )

cuducos · 2018-04-09T12:00:20Z

whistleblower/helpers/crop.py

+
+        total = image[:, width - i - 1].sum() / 255
+        if total > threshold:
+            image[:, i - 1] = numpy.ones(height) * 255


We could extract this logic into a helper function _remove_border(img, size, max_size, threshold) in the DRY philosophy. So remove_borders could be simpler and we could document further what happens in _remove_border:

def remove_borders(image, threshold, max_width, max_height): height. width, *_ = image.shape image = _remove_border(image, width, max_width, threshold) image = _remove_border(image, height, max_height, threshold) return image

And in _remove_border you could document the rational underneath it ; )

The indexes is different. The first loop look at columns, the other looks at lines. To break in one function, the code should rotate the image and this will be a lot weird. I can refactor to create a remove_column_borders and remove_line_borders ou something like that.

True story! I missed that, my bad. Just ignore my comment then : )

cuducos · 2018-04-09T12:01:00Z

whistleblower/helpers/crop.py

+DEFAULT_HEIGHT = 1100
+
+KERNEL_WIDTH  = 25
+KERNEL_HEIGHT = 15


What are the meaning of these constants?

In theory, these are the distance (in pixels) between letters in the document to be considered the same line. They are used in the mathematical operations which glue the letters of the same line together. I'll add these comments.

cuducos · 2018-04-09T12:04:09Z

whistleblower/helpers/crop.py

+        w = max(x0 + w0, x1 + w1) - x
+        h = max(y0 + h0, y1 + h1) - y
+
+        expanded = [x, y, w, h]


Roughly from line 50 and on you started to always use a blank line between instructions. I think a better use of blank likes could enhance the readability of this code making it more meaningful ; )

They are grouped by similarity . For example, computing width and height, are always grouped, x0 and y0 too (is the same computing in different variables). I think if we add more blank lines will loose a little of reading.

I didn't mean adding more spaces, but removing them : )

For example, grouping by similarity isn't clear in this snnipet:

rectangles = sorted(rectangles, key=lambda x:x[4], reverse=True) rectangles = rectangles[1:]

They are defining the same variable, so I expected these lines to be grouped, not separated.

Anyway: those are one of the minor comments. Feel free to ignore them anyway and altogether : )

cuducos · 2018-04-09T12:08:41Z

whistleblower/targets/twitter.py

+
+                with open(temp.name, 'rb') as cropped_file:
+                    cropped_image = cropped_file
+        except:


Bare except: is not recommended because it blurs different erros that might occur. Long blocks of code in the try: is also not recommended for the same reason.

Can you refactor these lines (188-202) using shorter blocks of code (usually a single line, but not strictly necessary) in the try blocks and specifying the errors expected in the opening of except block?

rodolfolottin · 2018-04-15T23:51:28Z

Thanks for the feedback, @cuducos.

Currently I'm working on the Dockerfile improvement. I'll keep you updated.

… image

rodolfolottin · 2018-04-23T00:29:25Z

Hi!

I updated the code with some of the requested changes.

The Dockerfile now is using the anaconda image, the one that @Irio mentioned.
About the try blocks: I removed some code from the block. As most of this code is just a cast from a type to another, I could'nt figure out which kind of exceptions they can raise, once that the urlopen correctly retrieves the content from the url.
There is still a error in the tests and I think that the urlopen isn't correctly being mocked. Can I have your help with it? @cuducos

I manually tested the new Docker image and it worked fine. I could be able to retrieve the pdf file, cast it to a png format and crop it. I also opened the image to check if the crop worked and it was fine.

cuducos · 2018-04-23T00:48:33Z

tests/targets/test_twitter.py

+    def test_tweet_image_success(self, urlopen_mock):
+        with open('tests/fixtures/10.pdf', 'rb') as mock_response:
+            urlopen_mock.return_value = BytesIO(mock_response.read())
+        self.assertIsInstance(self.subject.tweet_image(), BufferedReader)


Regarding this test (lines 157-161): I haven't ran this code, but I might help with two comments – let me know if they are helpful ; )

The mock

Simplifying the source code, it reads:

response = urllib.request.urlopen(self.camara_image_url()) image_bin = Image(file=response).make_blob('png')

So basically what you want to is whatever response means in the Image call. You want it to be a file object I guess. Thus you might not need to write the contents to BytesIO and directly use the result of the open context manager:

with open('tests/fixtures/10.pdf', 'rb') as mock_response: urlopen_mock.return_value = mock_response self.assertIsInstance(self.subject.tweet_image(), BufferedReader)

Note that the assertion happens within the with block.

The error

The traceback tells me that there's a higher chance that the error is not related to the urlopen mock per se, but to the object it returns. Let's dig in the last few lines of the error:

File "/home/travis/build/okfn-brasil/whistleblower/whistleblower/targets/twitter.py", line 194, in tweet_image image_bin = Image(file=response).make_blob('png') File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/wand/image.py", line 2740, in __init__ self.read(file=file, resolution=resolution) File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/wand/image.py", line 2822, in read self.raise_exception() File "/home/travis/virtualenv/python3.6.3/lib/python3.6/site-packages/wand/resource.py", line 222, in raise_exception raise e wand.exceptions.DelegateError: Postscript delegate failed `/tmp/magick-QbjrFeu8': No such file or directory @ error/pdf.c/ReadPDFImage/677

This tells us that when you call, in the test Image(file=response).make_blob('png'), internally wand tries to read this file with self.read(file=file, resolution=resolution). This reading attempt fails raising an exception that tells us there's no file /tmp/magick-QbjrFeu8.

Not sure what this file is, but as you're mocking the urlopen response with BytesIO (which is something in memory, not in the file system) this might be the real problem. So I'd try to do as I suggested earlier because the fixture is in the file system. Alternatively you might need to create a file for that.

cuducos · 2018-04-23T01:03:04Z

whistleblower/targets/twitter.py

+            with open(temp.name, 'rb') as cropped_file:
+                cropped_image = cropped_file
+
+        return cropped_image


Does this code work? I mean, when I exit a context manager created with NamedTemporaryFile I expect the file do be deleted. It might work because the cropped_image is still in memory, but I think this design is not neat. Does the `crop function takes a BytesIO instead? This method could return an image as bytes, for example.

Alternatively you could write your own context manager to have more control and to be more explicit about the temporary file deletion:

… from contextlib import contextmanager … class Tweet: … @contextmanager def receipt_pdf_as_png(self, pdf_response): image_bin = Image(file= pdf_response).make_blob('png') numpy_array = np.frombuffer(image_bin, np.uint8) # write PNG to a temp file temp = NamedTemporaryFile(delete=False, suffix='.png') with open(temp.name, 'wb') as fobj: crop(numpy_array, temp.name) # return PNG as a file object with open(temp.name, 'rb') as fobj: yield fobj # delete temporary file os.remove(temp.name) def tweet_image(self): try: response = urllib.request.urlopen(self.camara_image_url()) except urllib.error.HTTPError: return None with self.receipt_pdf_as_png(response) as image: return image

rodolfolottin · 2018-04-28T17:34:43Z

I updated the code with the changes related to your last review. Thanks for it @cuducos, they make totally sense.

About the actual error: it seems that the imagemagick package isn't able to open the image. First I thought that there were some missing packages in anaconda, then I updated the code with some of them but the errors still remains. Since I opened the PR I didn't get any of these errors locally, except the one from the last changes (#29 (comment)) and because of it I am unable to test it and make sure that it works before updating the pull request. Sorry about the failing integrations.

rodolfolottin · 2018-05-05T15:59:51Z

Yey! It seems that it's all ok! I got some help from @silviodc, who came up with a solution that he developed to rosie.

rodolfolottin · 2018-07-12T15:02:04Z

Hi guys! @begnini do you have any follow up about this?

rodolfolottin added 13 commits February 3, 2018 14:03

Update Dockerfile with imagemagick packages

d8c8e46

Rename text() method and create a method to get all the tweet data

d082972

Implements method to build the camara image url

5ad5a34

Add wand in requirements.txt

1b054ae

Update Dockerfile with opencv and ghostscript

7822531

Crop a given png file in order to remove its white areas

768c1bc

Add image pdf and png to fixtures

653275f

Change camara url

3bccd45

Add tweet_image call in publish method

eb0d8cd

Implement method to retrieve the reimbursement pdf and call the crop …

e5a53dc

…function

Add the image to tweet_data

f0b16cb

Resolve merge conflicts

1f5fd5f

Add opencv to requirements.txt file

38c2b5d

cuducos suggested changes Apr 9, 2018

View reviewed changes

Irio added the work in progress label Apr 12, 2018

rodolfolottin added 3 commits April 15, 2018 14:23

Sort imports descending

32be9f4

Remove unused var

5f0cae7

Specify HTTPError except when trying to retrieve the reimbursement image

b0dc505

rodolfolottin added 4 commits April 22, 2018 20:15

Start using anaconda docker image instead of the simple alpine python…

c841e32

… image

Add apt-get command to install libmagickwand-dev

e88e8f2

Change adduser parameter to add a system user

fa7a805

Remove code from try block

622268a

cuducos reviewed Apr 23, 2018

View reviewed changes

rodolfolottin added 4 commits April 28, 2018 13:25

Remove unused fixtures

60d87be

Refactor tweet_image and tests

23f997b

Update Dockerfile with ghostscript related packages

2be5a44

Update Dockerfile with imagemagick package

0c065d8

rodolfolottin added 3 commits May 4, 2018 12:45

Add apt-get update to Dockerfile script

3c4d8e0

Update Dockerfile and travis.yml with Wand required packages

1a5d446

Update .travis.yml dist and scripts

0fccf2e

rodolfolottin changed the title ~~WIP: Attach document image to tweet~~ Attach document image to tweet May 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach document image to tweet #29

Attach document image to tweet #29

rodolfolottin commented Apr 8, 2018

rodolfolottin commented Apr 8, 2018

cuducos left a comment

cuducos Apr 9, 2018

Irio Apr 12, 2018

cuducos Apr 9, 2018

rodolfolottin Apr 15, 2018

cuducos Apr 16, 2018

cuducos Apr 9, 2018

cuducos Apr 9, 2018

rodolfolottin Apr 15, 2018 •

edited

Loading

cuducos Apr 16, 2018

cuducos Apr 9, 2018

rodolfolottin Apr 15, 2018

cuducos Apr 9, 2018

begnini Apr 9, 2018

cuducos Apr 10, 2018

cuducos Apr 9, 2018

begnini Apr 9, 2018

cuducos Apr 10, 2018

cuducos Apr 9, 2018

begnini Apr 10, 2018

cuducos Apr 9, 2018

begnini Apr 9, 2018

cuducos Apr 10, 2018

cuducos Apr 9, 2018

rodolfolottin commented Apr 15, 2018

rodolfolottin commented Apr 23, 2018 •

edited

Loading

cuducos Apr 23, 2018

cuducos Apr 23, 2018

rodolfolottin commented Apr 28, 2018 •

edited

Loading

rodolfolottin commented May 5, 2018

rodolfolottin commented Jul 12, 2018

		KERNEL_HEIGHT = 15


		def remove_borders(image, threshold, max_width, max_height):

Attach document image to tweet #29

Are you sure you want to change the base?

Attach document image to tweet #29

Conversation

rodolfolottin commented Apr 8, 2018

rodolfolottin commented Apr 8, 2018

cuducos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodolfolottin Apr 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodolfolottin commented Apr 15, 2018

rodolfolottin commented Apr 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodolfolottin commented Apr 28, 2018 • edited Loading

rodolfolottin commented May 5, 2018

rodolfolottin commented Jul 12, 2018

rodolfolottin Apr 15, 2018 •

edited

Loading

rodolfolottin commented Apr 23, 2018 •

edited

Loading

rodolfolottin commented Apr 28, 2018 •

edited

Loading