Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the crawler concurrent #91

Open
rajivharlalka opened this issue Sep 19, 2024 · 14 comments
Open

Make the crawler concurrent #91

rajivharlalka opened this issue Sep 19, 2024 · 14 comments
Assignees

Comments

@rajivharlalka
Copy link
Member

Is your feature request related to a problem? Please describe.
Currently the crawler sequentially fetches each paper details, parses it and downloads the paper. This can be made lot faster using go-routines.

@shikharish
Copy link
Member

This was implemented then removed because it lead to the library website dropping requests.

@rajivharlalka
Copy link
Member Author

Did the implementation have an upperbound on the number of parallel requests being made? AFAIR no. IMO using waitgroups to limit the number of concurrent workers to 2-3 should improve the perform significantly.

@shikharish
Copy link
Member

shikharish commented Sep 19, 2024

I dont remember exactly.
BTW we won't need to implement go-routines ourselves as colly has an option to enable async request and also limit them. Can test that.

@proffapt
Copy link
Member

@rajivharlalka or @harshkhandeparkar please update the state of this issue to be reflected on the kanban.

@harshkhandeparkar
Copy link
Member

@shikharish what should be the status of this?

@shikharish
Copy link
Member

It is not needed as of now. We only need to run the crawler once or twice a semester so it's very low priority.

@harshkhandeparkar
Copy link
Member

Is it hard to do?

@shikharish
Copy link
Member

Not at all

@harshkhandeparkar
Copy link
Member

Then just finish it off maybe?

@harshkhandeparkar
Copy link
Member

No point in keeping hanging issues if they can be solved in a few minutes.

@shikharish shikharish self-assigned this Sep 27, 2024
@proffapt
Copy link
Member

@shikharish updates?

@shikharish
Copy link
Member

Did some testing and turns out even using 2 go routines leads to dropping of 1-2 requests. Further increasing it to 6 goroutines makes it 3-4 requests.

Should we skip this one for now?

@proffapt
Copy link
Member

Try to implement a retry function. Also, how many requests are you able to make concurrently. Even if it is more than one, then that's a win situation.

@proffapt
Copy link
Member

Halting this, till we have time to have a look at it comfortably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

4 participants