fcs.crawler.crawler¶

This module contains Crawler Unit implementation.

KEEP_STATS_SECONDS¶: After this time in seconds old statistics are removed.

class CrawlerState¶

Stores attributes defining Crawler Unit states.

class Crawler(web_server, event, port, manager_address, max_content_length=1024 * 1024, handle_robots=False)¶

Crawler Unit implementation.

Parameters:

web_server (string) – Object that represents Crawler Unit, used for communication with Task Server.
event (string) – Instance of threading.Event class used for synchronization at the end of Crawler Unit’s work.
port (int) – Port of this Crawler Unit.
manager_address (string) – Address of Manager module.
max_content_length (int) – Maximal size of content.
handle_robots (bool) – Flag that informs if Crawler Unit should handle robots.txt.

link_package_queue¶: Queue of packages of links to crawl. Each package contains: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.

browser¶: Object for visiting the web pages (an instance of mechanize.Browser).

stats_reset_time¶: Object used for computing time period from which the crawler efficiency statistics are collected.

crawled_links¶: List of tuples with statistics regarding processed links (number of processed links and time crawling these links took).

put_into_link_queue(link_package)¶

Puts links package into queue.

Parameters:	link_package (list) – Package of links to put into queue. Each package is a list containing the following information: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.

get_stats(seconds)¶

Returns statistics from given time period.

Parameters:	seconds (int) – Defines time period for which statistics should be returned (this method returns statistics since (now - seconds)).
Returns:	Statistics from given time period (number of crawled links and total time crawling these links took).
Return type:	dict

get_address()¶

Returns this Crawling Unit’s full address (with port number)

Returns:	Crawling Unit’s address
Return type:	string

get_state()¶

Returns Crawling Unit state.

Returns:	Crawling Unit state
Return type:	`CrawlerState`