Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Emulator000 · 2024-07-25T20:19:25Z

Summary

Currently, crawly is designed to efficiently crawl and scrape static web pages while adhering to robots.txt rules.
However, many modern websites use JavaScript to dynamically generate content, which presents a limitation for the current crawling capabilities.

Enhancement proposal

This feature request aims to integrate support for a web driver (such as Selenium or headless browsers like Puppeteer) to enable the crawling and rendering of dynamic content created with JavaScript.

Goals

Dynamic content rendering: use a web driver to fully render pages before scraping to capture dynamic content;
Integration with existing architecture: Seamlessly connect web driver capabilities into the current Crawler and CrawlerBuilder setup;
Respect existing configurations: ensure that the rendering process adheres to existing configurations such as adherence to robots.txt, rate limits, and depth limits.

Implementation suggestions

Option 1: Selenium Web Driver

Utilize Selenium WebDriver to control a browser and render dynamic content;
Integrate with Rust using crates like fantoccini or thirtyfour.

Option 2: Headless browsers

Use headless browsers like Puppeteer or Playwright for improved performance in rendering and scraping dynamic content;
This might involve creating Rust bindings or using existing native integrations.

Proposed API changes

Introduce a new setting in CrawlerBuilder to enable dynamic content rendering:

let crawler = CrawlerBuilder::new()
    .with_max_depth(10)
    .with_max_pages(100)
    .with_max_concurrent_requests(50)
    .with_rate_limit_wait_seconds(2)
    .with_robots(true)
    .with_dynamic_rendering(true) // New configuration
    .build()?;

Example usage

Demonstrate how users would take advantage of the new feature in their projects:

use anyhow::Result;
use crawly::CrawlerBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    let crawler = CrawlerBuilder::new()
        .with_max_depth(10)
        .with_dynamic_rendering(true)
        .build()?;

    let results = crawler.start("https://example-dynamic.com").await?;

    for (url, content) in &results {
        println!("URL: {}\nContent: {}", url, content);
    }

    Ok(())
}

Expected benefits

Expanded reach: ability to scrape modern sites that heavily rely on client-side JavaScript;
Flexibility: users can choose to enable or disable dynamic rendering based on their needs, keeping crawly lightweight for static sites.

Additional context:

This feature would require additional dependencies and might have implications on crawling speed and resource usage;
It is essential to handle the added complexity in error handling and debugging when integrating third-party web drivers/browsers.

Tracking activity:

Research and choose web driver solution (Selenium, Puppeteer, Playwright, etc.);
Prototype the integration with a basic dynamic page;
Implement API changes and configuration options;
Develop tests and documentation for the new feature;
Solicit feedback from the community and refine the implementation.

The text was updated successfully, but these errors were encountered:

Emulator000 added the enhancement New feature or request label Jul 25, 2024

Emulator000 self-assigned this Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Emulator000 commented Jul 25, 2024

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Ability to connect a web driver to also fetch and render dynamic page content (JavaScript) #1

Comments

Emulator000 commented Jul 25, 2024

Summary

Enhancement proposal

Goals

Implementation suggestions

Option 1: Selenium Web Driver

Option 2: Headless browsers

Proposed API changes

Example usage

Expected benefits