fcs.crawler.crawler¶
This module contains Crawler Unit implementation.
- KEEP_STATS_SECONDS¶
After this time in seconds old statistics are removed.
- class CrawlerState¶
Stores attributes defining Crawler Unit states.
- UNDEFINED¶
Crawler Unit’s state is undefined.
- WORKING¶
Crawler Unit is working.
- WAITING¶
Crawler Unit is idle.
- CLOSING¶
Crawler Unit is being closed.
- class Crawler(web_server, event, port, manager_address, max_content_length=1024 * 1024, handle_robots=False)¶
Crawler Unit implementation.
Parameters: - web_server (string) – Object that represents Crawler Unit, used for communication with Task Server.
- event (string) – Instance of threading.Event class used for synchronization at the end of Crawler Unit’s work.
- port (int) – Port of this Crawler Unit.
- manager_address (string) – Address of Manager module.
- max_content_length (int) – Maximal size of content.
- handle_robots (bool) – Flag that informs if Crawler Unit should handle robots.txt.
- link_package_queue¶
Queue of packages of links to crawl. Each package contains: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.
- browser¶
Object for visiting the web pages (an instance of mechanize.Browser).
- uuid¶
Crawler Unit’s UUID.
- stats_reset_time¶
Object used for computing time period from which the crawler efficiency statistics are collected.
- crawled_links¶
List of tuples with statistics regarding processed links (number of processed links and time crawling these links took).
- put_into_link_queue(link_package)¶
Puts links package into queue.
Parameters: link_package (list) – Package of links to put into queue. Each package is a list containing the following information: package ID, links to crawl, Task Server’s (i.e. package sender) address, MIME type of data to crawl.
- get_stats(seconds)¶
Returns statistics from given time period.
Parameters: seconds (int) – Defines time period for which statistics should be returned (this method returns statistics since (now - seconds)). Returns: Statistics from given time period (number of crawled links and total time crawling these links took). Return type: dict
- get_address()¶
Returns this Crawling Unit’s full address (with port number)
Returns: Crawling Unit’s address Return type: string
- get_state()¶
Returns Crawling Unit state.
Returns: Crawling Unit state Return type: CrawlerState
- stop()¶
Stops Crawling Unit.
- kill()¶
Kills Crawling Unit.
- run()¶
Main Crawling Unit loop.