fcs.manager.models¶

This module contains model layer - implementation of system units and consists of object-relational mapping classes:

class UserManager¶

Provides methods for creation of user and his Quota or superuser.

create_user(username, email, password)¶

Creates common FCS user with a default Quota.

Parameters:	username (string) – New user’s name. email (string) – New user’s email address. password (string) – New user’s password.
Returns:	New user
Return type:	`User`

create_superuser(username, email, password)¶

Creates FCS superuser that can use admin panel.

Parameters:	username (string) – New user’s name. email (string) – New user’s email address. password (string) – New user’s password.
Returns:	New superuser
Return type:	`User`

class User¶: FCS user class. Extends django.contrib.auth.models.AbstractUser.

Note

Username, password and email are required. Other fields are optional.

class Quota¶

Represents limitations in creating tasks. Each User object is connected with his personal quota.

max_priority¶: Maximal allowed sum of tasks’ priorities.

max_tasks¶: Maximal allowed number of running tasks.

link_pool¶: Maximal allowed sum of tasks’ processed links number.

max_links¶: Maximal allowed number of processed links per task.

urls_per_min¶: Expected crawling speed sum. Used by efficiency estimation module and autoscaling.

user¶: Quota’s owner.

class QuotaException¶: Raised when user exceeds limitations defined by Quota object.

class TaskManager¶

Manages creation of Task.

static create_task(user, name, priority, expire, start_links, whitelist='*', blacklist='', max_links=1000, mime_type='text/html')¶

Returns new task.

Raises QuotaException:
Parameters:	user (string) – User’s name. name (string) – New task’s name. priority (int) – Task priority. expire (datetime) – Task expiration date. start_links (string) – List of pages where crawler starts his work. whitelist (string) – Allowed URLs as regexp list. blacklist (string) – Disallowed URLs as regexp list. max_links (string) – Maximal allowed number of processed pages. mime_type (string) – List of allowed MIME types.
Returns:	New task
Return type:	`Task`
	if user quota is exceeded.

class Crawler¶

Represents Crawling Unit.

address¶: Crawling unit’s address.

uuid¶: Crawling unit’s UUID.

is_alive()¶

Checks if crawler responds for requests.

Returns:	Information if crawler is alive
Return type:	bool

stop()¶: Sends stop request to crawler.

Note

If crawler doesn’t respond this object will be deleted.

kill()¶: Sends kill request to crawler.

Note

If crawler doesn’t respond this object will be deleted.

send(path, method='get', data=None)¶

Sends request to crawler.

Parameters:	path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’. method (string) – Method of request, acceptable values are ‘get’ or ‘post’. data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.crawler.web_interface documentation.
Returns:	Response or None if connection cannot be established
Return type:	requests.Response or None

class TaskServer¶

Represents server which executes crawling tasks.

address¶: Task Server’s address.

urls_per_min¶: Tasks server’s speed.

uuid¶: Task Server’s UUID.

is_alive()¶

Checks if Task Server responds for requests.

Returns:	Information if Task Server is alive
Return type:	bool

kill()¶: Sends kill request to Task Server.

Note

If server doesn’t respond this object will be deleted.

send(self, path, method='get', data=None)¶

Sends request to Task Server.

Parameters:	path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’. method (string) – Method of request, acceptable values are ‘get’ or ‘post’. data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.server.web_interface documentation.
Returns:	Response or None if connection cannot be established
Return type:	requests.Response or None

delete()¶: Deletes this Task Server.

class Task¶

Represents crawling task defined by user.

user¶: User that owns this task.

name¶: Task’s name.

priority¶: Task’s priority.

start_links¶: Starting point of crawling.

whitelist¶: URLs which should be crawled (in regex format).

blacklist¶: URLs which should not be crawled (in regex format).

max_links¶: Maximal amount of links that may be visited while crawling.

expire_date¶: Datetime of task expiration.

mime_type¶: MIME types which are to be crawled.

active¶: Boolean value. If true task is running, else task is paused.

finished¶: Boolean value. If true task is finished, else running or paused.

created¶: Datetime of task creation.

last_data_download¶: Time of last crawled data collection.

server¶: Task Server that handles this task.

last_server_spawn¶: Time of last spawn of server which was run for handling this task.

autoscale_change¶: Boolean value, informs if some task’s parameter has been modified. It value is true, Task Server has to be informed of this change.

clean()¶

Cleans task’s data. Validates new task’s fields before save operation.

Raises:	ValidationError – If task’s parameters cannot be validated QuotaException – If user’s quota has been exceeded

save(*args, **kwargs)¶: Saves task in data base and sends information about modifications to its Task Server.

get_parsed_whitelist()¶

Returns whitelist converted from user-friendly regex to python regex.

Returns:	Whitelist in python regex format
Return type:	list

get_parsed_blacklist()¶

Returns blacklist converted from user-friendly regex to python regex.

Returns:	Blacklist in python regex format
Return type:	list

change_priority(priority)¶

Sets task priority.

Note

Task with higher priority crawls more links at the same time than those with lower priority.

Raises QuotaException:
Parameters:	priority (int) – Task’s new priority.
	if task priority exceeds quota of user which owns this task

pause()¶: Pauses task.

Note

Paused task does not crawl any links until it is resumed. It temporarily releases resources used by this task (such as priority).

resume()¶

Resumes task - task becomes active so it can crawl links.

Raises QuotaException:
	if user has not enough free priority resources to run this task. Then, user should decrease priority of this or other active task.

stop()¶: Marks task as finished.

Note

Finished tasks cannot be resumed and they do not count to user max_tasks quota. After some time its Task Server will be closed and crawling results will be lost.

is_waiting_for_server()¶

Checks if running task has no Task Server assigned. This case includes waiting until new Task Server starts.

Returns:	Information if this task has no Task Server assigned
Return type:	bool

feedback(link, rating)¶

Processes feedback from client in order to update crawling process to satisfy client expectations.

Parameters:	link (string) – Rated link rating (string) – Rating as number in range 1 - 5

send_update_to_task_server()¶: Sends information about task update to its Task Server.

create_api_keys(sender, **kwargs)¶

Creates Application object, required for working with REST API.

Parameters:	sender (string) – signal sender. In our case this parameter is irrelevant, however more details about this mechanism can be found in Django documentation.

class MailSent¶

Representation of mail sent to user, reminding him to collect crawling data waiting for him.

tasks¶: List of tasks related to uncollected data.

date¶: Date of mail sending.