Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Open
5 tasks
Emulator000 opened this issue Jul 25, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Emulator000
Copy link
Member

Summary

Currently, crawly is designed to efficiently crawl and scrape static web pages while adhering to robots.txt rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.

Enhancement proposal

This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.

Goals

  1. Dynamic content rendering: use a web driver to fully render pages before scraping to capture dynamic content;
  2. Integration with existing architecture: Seamlessly connect web driver capabilities into the current Crawler and CrawlerBuilder setup;
  3. Respect existing configurations: ensure that the rendering process adheres to existing configurations such as adherence to robots.txt, rate limits, and depth limits.

Implementation suggestions

Option 1: Selenium Web Driver

Option 2: Headless browsers

  • Use headless browsers like Puppeteer or Playwright for improved performance in rendering and scraping dynamic content;
  • This might involve creating Rust bindings or using existing native integrations.

Proposed API changes

Introduce a new setting in CrawlerBuilder to enable dynamic content rendering:

let crawler = CrawlerBuilder::new()
    .with_max_depth(10)
    .with_max_pages(100)
    .with_max_concurrent_requests(50)
    .with_rate_limit_wait_seconds(2)
    .with_robots(true)
    .with_dynamic_rendering(true) // New configuration
    .build()?;

Example usage

Demonstrate how users would take advantage of the new feature in their projects:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_dynamic_rendering(true)
        .build()?;

    let results = crawler.start("https://example-dynamic.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Expected benefits

  • Expanded reach: ability to scrape modern sites that heavily rely on client-side JavaScript;
  • Flexibility: users can choose to enable or disable dynamic rendering based on their needs, keeping crawly lightweight for static sites.

Additional context:

  • This feature would require additional dependencies and might have implications on crawling speed and resource usage;
  • It is essential to handle the added complexity in error handling and debugging when integrating third-party web drivers/browsers.

Tracking activity:

  • Research and choose web driver solution (Selenium, Puppeteer, Playwright, etc.);
  • Prototype the integration with a basic dynamic page;
  • Implement API changes and configuration options;
  • Develop tests and documentation for the new feature;
  • Solicit feedback from the community and refine the implementation.
@Emulator000 Emulator000 added the enhancement New feature or request label Jul 25, 2024
@Emulator000 Emulator000 self-assigned this Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant