Spider Web Scraping



Scrapy A Fast and Powerful Scraping and Web Crawling Framework An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. Maintained by Zyte (formerly Scrapinghub) and many other contributors. Web scraping often goes hand in hand with web crawling. A Crawler (also known as a robot, spider, bot, etc.) is a program that indexes websites by following links. From a set of initial URLs, it requests each one of it and takes note of every hyperlink contained in the response. Then it follows the hyperlinks only to repeat the process. Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. Follow links) and how to extract structured data from their pages (i.e. Scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular. Starting Scraping. Creating a New Project; Running Script; Project Management; Example; Further; References; Introduction. Before reading it, please read the warnings in my blog Learning Python: Web Scraping. Different from Beautiful Soup or Scrapy, pyspider is a powerful spider (web crawler) system in Python: Write script in Python.

  • Starting Scraping

Introduction

Before reading it, please read the warnings in my blog Learning Python: Web Scraping.

Different from Beautiful Soup or Scrapy, pyspider is a powerful spider (web crawler) system in Python:

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Beanstalk, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc…
  • Distributed architecture, Crawl Javascript pages, Python 2&3, etc…

Installation and Start

Use pip to install it:

Spider Web Scraping

Spider Web Scraping Tool

Start it with the command below or run the run.py in the module directory:

Perhaps you might meet the error below:

There are two solutions to solve this error (See in Error to start webui service).

Change the line 209 in the file pyspider/webui/webdav.py (almost the end of the file):

Or you can change the version of wsgidav (I do not recommend this one option since it already published a new version 3.x).

After that, you could visit http://localhost:5000/ to use the system.And, in the directory where you start pyspider, there will be a directory data that is auto-generated and stores the databases of projects, tasks and results.

Get more help information with:

Starting Scraping

Creating a New Project

For the first time, there are no projects in the page.You need to create a new one by clicking the “Create” button.Input the project name and the URL you want to scrap:

Click the “Create” button and enter the script editing page:

On the right panel, it is an auto-generated sample script:

  • def on_start(self) is the entry point of the script. It will be called when you click the run button on dashboard.
  • self.crawl(url, callback=self.index_page) is the most important API here. It will add a new task to be crawled. Most of the options will be spicified via self.crawl arguments.
  • def index_page(self, response) get a Response object. response.doc is a pyquery object which has jQuery-like API to select elements to be extracted.
  • def detail_page(self, response) return a dict object as result. The result will be captured into resultdb by default. You can override on_result(self, result) method to manage the result yourself.

Other configuration:

Spider - a smart web scraping tool
  • @every(minutes=24*60, seconds=0) is a helper to tell the scheduler that on_start method should be called everyday.
  • @config(age=10 * 24 * 60 * 60) specified the default age parameter of self.crawl with page type index_page (when callback=self.index_page). The parameter age* can be specified via self.crawl(url, age=102460*60) (highest priority) and crawl_config (lowest priority).
  • age=10 * 24 * 60 * 60 tell scheduler discard the request if it have been crawled in 10 days. pyspider will not crawl a same URL twice by default (discard forever), even you had modified the code, it’s very common for beginners that runs the project the first time and modified it and run it the second time, it will not crawl again (read itag for solution)
  • @config(priority=2) mark that detail pages should be crawled first.

Running Script

If you have modified the script, then click the “save” button.

Click the green “run” button on the left panel.After that, you will find a red 1 above follows:

Click the “follows” button to switch to the follows panel.It lists an index page.

Click the green play button on the right of the URL (this will invoke the index_page method).It will list all the URLs in the panel:

We could choose any one of the detail pages and click the green play button on the right (this will invoke the detail_page method).It will show the final result.In this example, we get the title and URL in json format.

Project Management

Back to the dashboard, you will find the project that is new created.Change the status of the project from “TODO” to “RUNNING”:

Click the “run” button.Then the project will start to run:

The output log in the background:

Click the “Results” button and check all the scraping results:

Click one of the results and the new page will show the result in detail.

Example

Spider Web Scraping Machine

UEFA European Cup Coefficients Database lists links for matches, country ranking and club ranking since season 1955/1956.The sample program below extracts the match data from season 2004/2005 to season 2017/2018.

Output looks like:

Further

Some web contents are becoming more complicated using some technology like AJAX.Then page looks different with it in browser, the information you want to extract is not in the HTML of the page.In this case, you will need the web browser developer tools(such as Web Developer Tools in Firefox or Chrome) to find the request with parameters by yourself.

Sometimes web page is too complex to find out the API request. It provides an option to use PhantomJS.To use PhantomJS, you should have PhantomJS installed. If you are running pyspider with all mode, PhantomJS is enabled if executable in the PATH.

More information about pyspider in detail can be found in pyspider Official Documentation or its GitHub.

References

Please enable JavaScript to view the comments powered by Disqus.blog comments powered by Disqus

Published

Tags

Only a lazy person does not speak about Big data, but he hardly understands what it is and how it works. Let’s start with the simplest – terminology. Big data is a variety of tools, approaches and methods for processing both structured and unstructured data in order to use it for specific tasks and purposes.

The most valuable commodity in the world after time is information.

The term “big data” was introduced by Nature’s editor Clifford Lynch back in 2008 in a special issue dedicated to the explosive growth of global information volumes. Although, of course, the big data itself existed before. According to experts, most data streams over 100 GB per day fall into the category of Big data.

Today, under this simple term only two words are hidden – data storage and processing.

In the modern world, Big Data is a socio-economic phenomenon, which is related to the fact that new technological capabilities have appeared for analyzing a huge amount of data.

Spider Web Scraping Tools

A typical example of big data is information coming from various physical experimental installations, for example, the Large Hadron Collider, which produces a huge amount of data and does it all the time. The installation continuously produces large amounts of data, and scientists with their help, scientists solve in parallel many problems.

The appearance of big data in public space was due to the fact that these data affected almost all people, and not just the scientific community, where such problems have been solved for a long time. The public sphere of technology Big Data came out when it came to talking about a very specific number – the number of inhabitants of the planet. 7 billion that are collected on social networks and other projects that aggregate people. YouTube, Facebook, where the number of people is measured in billions, and the number of operations that they perform at the same time is enormous. The data flow in this case is user action. For example, data from the same YouTube hosting, which are poured over the network in both directions. By processing is meant not only interpretation, but also the ability to correctly process each of these actions, that is, to put it in the right place and make sure that this data is available to each user quickly, because social networks do not tolerate expectations.

With so much information, the question is how to find the information you need and understand it. This task seems impracticable, but using the web crawling and web scraping tools can be done quite easily.

Big data analytics, machine learning, search engine indexing and many other areas of modern data operations require web crawling and web scraping data. There is a tendency to interchangeably use the terms web crawling and web scraping and although they are closely related, there are differences between the two processes.

Spider

A web crawler sometimes called a “spider,” is a standalone bot that systematically scans the Internet for indexing and searching for content, following internal links on web pages. In general, the term “crawler” means the ability of a program to navigate web pages on its own, possibly even without a clearly defined end goal or goal, endlessly exploring what a site or network can offer. Web crawlers are actively used by search engines such as Google, Bing and others to extract content for a URL, check this page for other links, get URLs for these links and so on.

On the other hand, web scraper is a process of extracting specific data. Unlike web crawling, a web scraper searches for specific information on specific websites or pages.

Basically, web crawling creates a copy of what’s there and web scraping extracts specific data for analysis, or to create something new. However, in order to conduct web scraping you would first have to do some sort of web crawling to find the information you need. Data crawling involves certain degree of scraping, like saving all the keywords, the images and the URLs of the web page.

Spider Web Scraping Techniques

Web crawling would be generally what Google, Yahoo, Bing etc. do, searching for any kind of information. Web scraping is essentially targeted at specific websites for specific data, e.g. for stock market data, business leads, supplier product scraping.