Scrapy-Based Random Link Crawler

  • Share this:

Code introduction


This function uses the Scrapy library to crawl random links from a given starting URL. It defines a CrawlSpider subclass, which inherits from CrawlSpider, and sets the allowed domains and start URLs. Rules are defined using Rule, where LinkExtractor is used to extract links. The function starts the crawler using CrawlerProcess.


Technology Stack : Scrapy

Code Type : Scrapy Crawler

Code Difficulty : Intermediate


                
                    
def crawl_random_links(start_url):
    import scrapy
    from scrapy.crawler import CrawlerProcess
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor

    class RandomLinksSpider(CrawlSpider):
        name = 'random_links_spider'
        allowed_domains = ['example.com']
        start_urls = [start_url]

        rules = (
            Rule(LinkExtractor(), callback='parse_item', follow=True),
        )

        def parse_item(self, response):
            yield {'url': response.url}

    process = CrawlerProcess(settings={
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    })

    process.crawl(RandomLinksSpider)
    process.start()                
              
Tags: