fcs.server.task_server¶

This module contains implementation of Task Server.

URL_PACKAGE_TIMEOUT¶: After specified time in seconds package is considered lost.

DATE_FORMAT¶: Datetime format used in system.

WAIT_FOR_DOWNLOAD_TIME¶: Time in seconds tells how long task server waits for downloading crawling results before quit.

KEEP_STATS_SECONDS¶: After this time in seconds old statistics are removed.

CHECK_EFFICIENCY_PERIOD¶: Time in second for which efficiency statistics are gathered.

class Status¶

Stores attributes defining Crawler Unit state.

INIT¶: Task Server is being initialized.

STARTING¶: Task Server is starting.

RUNNING¶: Task Server is running.

PAUSED¶: Task Server is paused.

STOPPING¶: Task Server is being stopped.

KILLED¶: Task Server is killed.

class TaskServer(web_server, task_id, manager_address, max_url_depth=1)¶

Main class of Task Server, containing its logic.

Parameters:	web_server (WebServer) – Wrapper of TaskServer’s REST API (see `WebServer`). task_id (int) – ID of task for which Task Server was created. manager_address (string) – FCS manager module address (see Management module (fcs.manager)). max_url_depth (int) – Maximal allowed crawling tree depth.

link_db¶: Database for extracted links. Current implementation: GraphAndBTreeDB.

content_db¶: Database for content extracted from processed pages. Current implementation: BerkeleyContentDB.

crawlers¶: Dict of the following format: key - Crawling Unit’s address, value - links to be processed by this Crawling Unit.

max_links¶: Maximal amount of unique links that may be crawled during the current task.

expire_date¶: Expiration date of the given task.

mime_type¶: List of MIME types of data to be crawled.

uuid¶: Task Server’s UUID.

whitelist¶: Regexp with allowed URL form.

blacklist¶: Regexp with forbidden URL form.

urls_per_min¶: Expected efficiency in URLs per minute. For more details about this speed, see assign_crawlers().

package_cache¶: Dict of the following format: key - package_id, value - information about packages with links that have been sent to Crawling Unit (time of sending, list of links, Crawling Unit’s address, timeout flag).

package_id¶: ID of package with links.

processing_crawlers¶: List of working Crawling Units assigned to this Task Server.

status¶: Crawler state, described by Status.

crawled_links¶: List for statistics - processed links, crawling beginning and end times.

stats_reset_time¶: Object used for computing time period from which the server efficiency statistics are collected.

assign_crawlers(assignment)¶

Sets actual crawler assignment. Task Server can send crawling requests only to these crawlers and size of packages must be specified in assignment dict for each crawler. It allows to control crawling efficiency of all Task Servers.

Parameters:	assignment (dict) – Dict of the following format: key - Crawling Unit’s address, value - links to be processed by the given Crawling Unit.

assign_speed(speed)¶

Sets Task Server’s crawling speed. After each speed change statistics are reset.

Parameters:	speed (int) – Crawling speed computed as follows: speed = urls_per_min task.priority / priority_sum, where urls_per_min* is defined on the basis of user’s quota, task.priority is a value of priority of the given task and priority_sum is a sum of all of the user’s tasks priorities.

update(data)¶

Updates crawling parameters and status. It is usually called when some changes in task data are made using GUI or API.

Parameters:	data (dict) – Task description (parameters of the task).

pause()¶: Pauses the Task Server if it was running.

resume()¶: Resumes the Task Server if it was paused.

stop()¶: Stops the Task Server. Stopped Task Server won’t send crawling requests anymore. It will wait WAIT_FOR_DOWNLOAD_TIME seconds for user to download gathered data.

kill()¶: Kills the Task Server. Task Server that is to be killed, will be stopped as soon as possible.

run()¶: Main Task Server loop.

get_idle_crawlers()¶

Returns list of crawlers which are not processing any requests.

Returns:	List of idle Crawler Units.
Return type:	list

feedback(link, rating)¶

Increases priority of specified link and its children.

Parameters:	link (string) – Link. rating (string) – Link’s new rating, can be a number 1-5 casted to string.

add_links(links, priority, depth=0, source_url="")¶

Adds links to process.

Raises Exception:
Parameters:	links (list) – List of links (links are of string type). priority (int) – Links’ priority, can be a number 0-999 (0 is the lowest priority). depth (int) – Depth of crawling for a page from which links have been retrieved. source_url (string) – URL of page from which links have been retrieved.
	in case of an error in database.

put_data(package_id, data)¶

Handles crawled data package received from crawler and puts it into a content database. If received package is not in a package cache or crawling request has timed out, no data will be stored in database. It also marks crawler which was assigned to this crawling request as ‘idle’, so next request can be sent to this crawler.

Parameters:	package_id (int) – ID of crawled data package (identical to the package ID from crawling request). data (string) – Crawled data package.

get_data(size)¶

Returns path to file with crawling results.

Parameters:	size (int) – Size of package with demanded crawling results.
Returns:	Path to file with crawling results.
Return type:	string