fcs.crawler.crawler

This module contains Crawler Unit implementation.

KEEP_STATS_SECONDS

After this time in seconds old statistics are removed.

class CrawlerState

Stores attributes defining Crawler Unit states.

UNDEFINED

Crawler Unit’s state is undefined.

WORKING

Crawler Unit is working.

WAITING

Crawler Unit is idle.

CLOSING

Crawler Unit is being closed.

class Crawler(web_server, event, port, manager_address, max_content_length=1024 * 1024, handle_robots=False)

Crawler Unit implementation.

Parameters:
  • web_server (string) – Object that represents Crawler Unit, used for communication with Task Server.
  • event (string) – Instance of threading.Event class used for synchronization at the end of Crawler Unit’s work.
  • port (int) – Port of this Crawler Unit.
  • manager_address (string) – Address of Manager module.
  • max_content_length (int) – Maximal size of content.
  • handle_robots (bool) – Flag that informs if Crawler Unit should handle robots.txt.

Queue of packages of links to crawl. Each package contains: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.

browser

Object for visiting the web pages (an instance of mechanize.Browser).

uuid

Crawler Unit’s UUID.

stats_reset_time

Object used for computing time period from which the crawler efficiency statistics are collected.

List of tuples with statistics regarding processed links (number of processed links and time crawling these links took).

Puts links package into queue.

Parameters:link_package (list) – Package of links to put into queue. Each package is a list containing the following information: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.
get_stats(seconds)

Returns statistics from given time period.

Parameters:seconds (int) – Defines time period for which statistics should be returned (this method returns statistics since (now - seconds)).
Returns:Statistics from given time period (number of crawled links and total time crawling these links took).
Return type:dict
get_address()

Returns this Crawling Unit’s full address (with port number)

Returns:Crawling Unit’s address
Return type:string
get_state()

Returns Crawling Unit state.

Returns:Crawling Unit state
Return type:CrawlerState
stop()

Stops Crawling Unit.

kill()

Kills Crawling Unit.

run()

Main Crawling Unit loop.