Skip to content

kabelsea/go-scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status Coverage Status

go-scrapy

A scrapy implementation in Go. (Work in progres)

Overview

go-scrapy is a very useful and productive web crawlign framework, used to crawl websites and extract structured data from parsed pages.

Requirements

  • Golang 1.x - 1.9.x
  • Works on Linux, Windows, Mac OSX, BSD

Installation

Install:

go get github.com/kabelsea/go-scrapy

Import:

import scrapy "github.com/kabelsea/go-scrapy/scrapy"

Quickstart

func main() {
  // Init spider configuration
  config := &scrapy.SpiderConfig{
    Name:               "HabraBot",
    MaxDepth:           5,
    ConcurrentRequests: 20,
    StartUrls: []string{
      "https://habrahabr.ru/",
    },
    Rules: []scrapy.Rule{
      {
        LinkExtractor: &scrapy.LinkExtractor{
          Allow:        []string{`^/post/\d+/$`},
          AllowDomains: []string{`^habrahabr\.ru`},
        },
        Follow: true,
      },
      {
        LinkExtractor: &scrapy.LinkExtractor{
          Allow:        []string{`^/users/[^/]+/$`},
          AllowDomains: []string{`^habrahabr\.ru`},
        },
        Handler: ProcessItem,
      },
    },
  }

  // Create new spider
  spider, err := scrapy.NewSpider(config)
  if err != nil {
    panic(err)
  }

  // Run spider and wait
  spider.Wait()
}

// Process crawled page
func ProcessItem(resp *scrapy.Response) {
  log.Println("Process item:", resp.Url, resp.StatusCode)
}

Howto

Please go through examples to get an idea how to use this package.

Roadmap

  • Middlewares
  • More examples
  • Tests

About

Web crawling and scraping framework for Golang

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published