fcs.server.task_server¶
This module contains implementation of Task Server.
- URL_PACKAGE_TIMEOUT¶
After specified time in seconds package is considered lost.
- DATE_FORMAT¶
Datetime format used in system.
- WAIT_FOR_DOWNLOAD_TIME¶
Time in seconds tells how long task server waits for downloading crawling results before quit.
- KEEP_STATS_SECONDS¶
After this time in seconds old statistics are removed.
- CHECK_EFFICIENCY_PERIOD¶
Time in second for which efficiency statistics are gathered.
- class Status¶
Stores attributes defining Crawler Unit state.
- INIT¶
Task Server is being initialized.
- STARTING¶
Task Server is starting.
- RUNNING¶
Task Server is running.
- PAUSED¶
Task Server is paused.
- STOPPING¶
Task Server is being stopped.
- KILLED¶
Task Server is killed.
- class TaskServer(web_server, task_id, manager_address, max_url_depth=1)¶
Main class of Task Server, containing its logic.
Parameters: - link_db¶
Database for extracted links. Current implementation: GraphAndBTreeDB.
- content_db¶
Database for content extracted from processed pages. Current implementation: BerkeleyContentDB.
- crawlers¶
Dict of the following format: key - Crawling Unit’s address, value - links to be processed by this Crawling Unit.
- max_links¶
Maximal amount of unique links that may be crawled during the current task.
- expire_date¶
Expiration date of the given task.
- mime_type¶
List of MIME types of data to be crawled.
- uuid¶
Task Server’s UUID.
- whitelist¶
Regexp with allowed URL form.
- blacklist¶
Regexp with forbidden URL form.
- urls_per_min¶
Expected efficiency in URLs per minute. For more details about this speed, see assign_crawlers().
- package_cache¶
Dict of the following format: key - package_id, value - information about packages with links that have been sent to Crawling Unit (time of sending, list of links, Crawling Unit’s address, timeout flag).
- package_id¶
ID of package with links.
- processing_crawlers¶
List of working Crawling Units assigned to this Task Server.
- crawled_links¶
List for statistics - processed links, crawling beginning and end times.
- stats_reset_time¶
Object used for computing time period from which the server efficiency statistics are collected.
- assign_crawlers(assignment)¶
Sets actual crawler assignment. Task Server can send crawling requests only to these crawlers and size of packages must be specified in assignment dict for each crawler. It allows to control crawling efficiency of all Task Servers.
Parameters: assignment (dict) – Dict of the following format: key - Crawling Unit’s address, value - links to be processed by the given Crawling Unit.
- assign_speed(speed)¶
Sets Task Server’s crawling speed. After each speed change statistics are reset.
Parameters: speed (int) – Crawling speed computed as follows: speed = urls_per_min * task.priority / priority_sum, where urls_per_min is defined on the basis of user’s quota, task.priority is a value of priority of the given task and priority_sum is a sum of all of the user’s tasks priorities.
- update(data)¶
Updates crawling parameters and status. It is usually called when some changes in task data are made using GUI or API.
Parameters: data (dict) – Task description (parameters of the task).
- pause()¶
Pauses the Task Server if it was running.
- resume()¶
Resumes the Task Server if it was paused.
- stop()¶
Stops the Task Server. Stopped Task Server won’t send crawling requests anymore. It will wait WAIT_FOR_DOWNLOAD_TIME seconds for user to download gathered data.
- kill()¶
Kills the Task Server. Task Server that is to be killed, will be stopped as soon as possible.
- run()¶
Main Task Server loop.
- get_idle_crawlers()¶
Returns list of crawlers which are not processing any requests.
Returns: List of idle Crawler Units. Return type: list
- feedback(link, rating)¶
Increases priority of specified link and its children.
Parameters:
- add_links(links, priority, depth=0, source_url="")¶
Adds links to process.
Parameters: Raises Exception: in case of an error in database.
- put_data(package_id, data)¶
Handles crawled data package received from crawler and puts it into a content database. If received package is not in a package cache or crawling request has timed out, no data will be stored in database. It also marks crawler which was assigned to this crawling request as ‘idle’, so next request can be sent to this crawler.
Parameters: