Skip to content
This repository has been archived by the owner on Sep 18, 2018. It is now read-only.

Attach document image to tweets #6

Open
Irio opened this issue Jul 3, 2017 · 19 comments
Open

Attach document image to tweets #6

Irio opened this issue Jul 3, 2017 · 19 comments

Comments

@Irio
Copy link
Collaborator

Irio commented Jul 3, 2017

Following @talespaiva's suggestion in a tweet and recent replies to
@RosieDaSerenata.

Docs for Twitter API: https://python-twitter.readthedocs.io/en/latest/_modules/twitter/api.html?highlight=%22def%20PostUpdate%22

@cuducos
Copy link
Collaborator

cuducos commented Jul 3, 2017

A possible roadmap:

  1. convert the receipt from PDF to PNG
  2. crop the white paper areas (sometimes small receipts are in an A4 page size scanned image)
  3. upload the PNG to someplace like Imgur
  4. add the PNG URL to the tweet

@paulocezar
Copy link

I'll give this one a try.

@silviodc
Copy link

silviodc commented Oct 7, 2017

Hi @paulocezar

To convert the PDF to Image I suggest you to take a look in these notebooks from okfn-brasil/serenata-de-amor#238

This, you can speed up your work focusing on crop the image and uploading it. The necessary libraries are in the end of the docker file. "OpenCV, Wand..." . Try to look in segmentation techniques to crop the recipe :)
Looking forward to see your #PR ;)

@murilobsd
Copy link

Hi,
I do not know how the progress of this issue is? Maybe the trim function http://docs.wand-py.org/en/0.3-maintenance/wand/image.html#wand.image.Image.trim might help!

@silviodc
Copy link

silviodc commented Oct 19, 2017

Thanks for sharing it if us @murilobsd . ;)

In Addition, I would like to suggest who is implementing it to take a look in these libraries to upload the images :)

https://pypi.python.org/pypi/python-tumblpy/1.1.4
https://github.com/Imgur/imgurpython

@CauanCabral
Copy link

Why not upload image to twitter? Keeping only twitter as external service dependency.

https://developer.twitter.com/en/docs/media/upload-media/api-reference/post-media-upload

@cuducos
Copy link
Collaborator

cuducos commented Nov 9, 2017

Why not upload image to twitter?

I think this is better/easier/simpler than using a third-party service for hosting images (unless @silviodc has other usages for the storage in mind). What's is needed IMHO is an implementation sich as:

  1. Get reimbursement data needed to build the receipt URL (applicant_id, year and document_id) to concatenate some string and get the receipt URL
  2. Try to fetch the PDF
  3. If it succeeded convert to PNG
  4. Crop it
  5. Add to the tweet with the API @CauanCabral linked

@rodolfolottin
Copy link

Hi! Does anyone need some help with this issue?

@cuducos
Copy link
Collaborator

cuducos commented Jan 24, 2018

AFAIK there's no one working on that, @rodolfolottin – make yourself at home : )

@rodolfolottin
Copy link

Ok @cuducos . I'll give it a try.

@rodolfolottin
Copy link

So, I have some doubts about how to test this functionally. First one is: how can I get some data, once that the tests are using mocks? I know that I can get the pdf directly in the camara’s web site, but I also want to see the data from each reimbursement tuple.

Another one is: what my tests should test? I get that I should test the tweet content that is going to be posted, but what about the fetched pdf? And the blank area that I have to crop, how can/should I test that? Should I use some example pdf as fixture?

Many thanks!

@cuducos
Copy link
Collaborator

cuducos commented Jan 29, 2018

Hi @rodolfolottin, let me recap road map drafted above:

  1. Get reimbursement data needed to build the receipt URL (applicant_id, year and document_id) to concatenate some string and get the receipt URL
  2. Try to fetch the PDF
  3. If it succeeded convert to PNG
  4. Crop it
  5. Add to the tweet with the API @CauanCabral linked

Given these steps, this is my 2c:

In steps 1 and 5 we're responsible for generating the right calls to external services, but not responsible to manage the calls themselves. What I mean is that:

  • In step 1 we must assert we're generating the proper URL to download the PDF and passing it to the download function (for example, urllib.request.urlretrieve)
  • In step 5 we must assert we're properly calling the Twitter API with the image attached

That said, we I'd say that in step 1 we can mock the download method and:

  1. Assert it's called with the proper URL
  2. Use a fixture as it's response, so we have a real PDF file to test steps 2, 3 and 4

Then we must mock the Twitter API call and assert that we're calling it with the image as an attachement.

Does that make sense to you all?

@rodolfolottin
Copy link

rodolfolottin commented Feb 4, 2018

Sorry for the late answer.

Yeah, @cuducos . Thanks again!

Now I'm working on croping the image using wand, but it's not being easy. My first approach was to try to crop the image based in it background color. As the image is a scan itself, the whole background of the image have the same white color. I'm looking for some related problems, but the ones that I've founded have, in general, two different colors, which makes easier to differentiate the image.

Edit: just thinking here, but maybe, IDK, I could parse the rows and colums from the image and crop the pixels based on the presence of a different color than white.

@cuducos
Copy link
Collaborator

cuducos commented Feb 5, 2018

Hi @rodolfolottin,

I see to non-exclusive possibilities here:

  1. Ask people in the Telegram group if they have any experience in automatic cropping scanned images (because scanning always leave some pixels here and there and I think a simple approach based on color won't work)
  2. Baby steps: we put this image in production without cropping and adding it as a feature later ; )

@silviodc
Copy link

silviodc commented Feb 5, 2018

Hi guys,

One question about the crop of images.

The function mentioned by @murilobsd (trim) doesn't work?

trim(color=None, fuzz=0)
Parameters: | color (Color) – the border color to remove. if it’s omitted top left pixel is used by defaultfuzz (numbers.Integral) – Defines how much tolerance is acceptable to consider two colors as the same.

PS: In your case you will use it without color and defining a defaultfuzz empirically.

@rodolfolottin
Copy link

rodolfolottin commented Feb 5, 2018

Hi @silviodc and @cuducos . Thanks for your help.

@cuducos , as the part of crop the image was the hard one to me I decided to go for and do some tests. Because of that, I don't have the another part done yet, but I can work on finish it.

@silviodc , in my tests with this function I was using the white color and I did'nt get what I expected. For this image, the more I increase the defaultfuzz value, most of the image (the invoice) is cropped. In both cases, using the white color and not using it, I got the same results. And as sometimes the invoice is not in the center of the scanned image, I did'nt fell secure to go on with this approach. As an example I am attaching a cropped image with a defaultfuzz value of 50%.

Here it is.

I'm taking the @cuducos advice of doing baby steps and I will worry with the croping function later.

@silviodc
Copy link

silviodc commented Feb 5, 2018

Hi @rodolfolottin
Thanks for the feedback. Maybe this weekend I will try to combine some edge detection and crop... I will let you know if it works.

@CauanCabral
Copy link

Hey, today I asked for help a friend who work with image processing in the job and he suggest the use of OpenCV for that.

The response: https://twitter.com/begnini/status/960547129264615425
StackOverflow related link: https://pt.stackoverflow.com/a/265916

Both are in portuguese.

@begnini
Copy link

begnini commented Feb 6, 2018

Hi,

my friend @CauanCabral pointed me to this issue and I'm played a little with the documents. I work with digitalized documents and I know some are hard to manipulate, so, what I made is good, but is not pixel perfect.

To proof the concept, I downloaded 100 pdfs from the jarbas.sereneta.ai home page, and with pdfimage extracted all images from these pdfs. After this, I processed these images.

The result you can see here https://github.com/begnini/document_crop/blob/master/crop.md. The code is in this repository, too (https://github.com/begnini/document_crop/blob/master/crop.py).

I'll improve the documentation later, but if you have any questions, be free to ask me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants