fcs.manager.models¶
This module contains model layer - implementation of system units and consists of object-relational mapping classes:
- class UserManager¶
Provides methods for creation of user and his Quota or superuser.
- create_user(username, email, password)¶
Creates common FCS user with a default Quota.
Parameters: Returns: New user
Return type:
- class User¶
FCS user class. Extends django.contrib.auth.models.AbstractUser.
Note
Username, password and email are required. Other fields are optional.
- class Quota¶
Represents limitations in creating tasks. Each User object is connected with his personal quota.
- max_priority¶
Maximal allowed sum of tasks’ priorities.
- max_tasks¶
Maximal allowed number of running tasks.
- link_pool¶
Maximal allowed sum of tasks’ processed links number.
- max_links¶
Maximal allowed number of processed links per task.
- urls_per_min¶
Expected crawling speed sum. Used by efficiency estimation module and autoscaling.
- user¶
Quota’s owner.
- class QuotaException¶
Raised when user exceeds limitations defined by Quota object.
- class TaskManager¶
Manages creation of Task.
- static create_task(user, name, priority, expire, start_links, whitelist='*', blacklist='', max_links=1000, mime_type='text/html')¶
Returns new task.
Parameters: - user (string) – User’s name.
- name (string) – New task’s name.
- priority (int) – Task priority.
- expire (datetime) – Task expiration date.
- start_links (string) – List of pages where crawler starts his work.
- whitelist (string) – Allowed URLs as regexp list.
- blacklist (string) – Disallowed URLs as regexp list.
- max_links (string) – Maximal allowed number of processed pages.
- mime_type (string) – List of allowed MIME types.
Returns: New task
Return type: Raises QuotaException: if user quota is exceeded.
- class Crawler¶
Represents Crawling Unit.
- address¶
Crawling unit’s address.
- uuid¶
Crawling unit’s UUID.
- is_alive()¶
Checks if crawler responds for requests.
Returns: Information if crawler is alive Return type: bool
- stop()¶
Sends stop request to crawler.
Note
If crawler doesn’t respond this object will be deleted.
- kill()¶
Sends kill request to crawler.
Note
If crawler doesn’t respond this object will be deleted.
- send(path, method='get', data=None)¶
Sends request to crawler.
Parameters: - path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’.
- method (string) – Method of request, acceptable values are ‘get’ or ‘post’.
- data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.crawler.web_interface documentation.
Returns: Response or None if connection cannot be established
Return type: requests.Response or None
- class TaskServer¶
Represents server which executes crawling tasks.
- address¶
Task Server’s address.
- urls_per_min¶
Tasks server’s speed.
- uuid¶
Task Server’s UUID.
- is_alive()¶
Checks if Task Server responds for requests.
Returns: Information if Task Server is alive Return type: bool
- kill()¶
Sends kill request to Task Server.
Note
If server doesn’t respond this object will be deleted.
- send(self, path, method='get', data=None)¶
Sends request to Task Server.
Parameters: - path (string) – Request name, may be one of the following: ‘/put_links’, ‘/kill’, ‘/stop’, ‘/alive’, ‘/stats’.
- method (string) – Method of request, acceptable values are ‘get’ or ‘post’.
- data (dict) – Dict with parameters (in JSON). Details of particular request’s parameters are described in fcs.server.web_interface documentation.
Returns: Response or None if connection cannot be established
Return type: requests.Response or None
- delete()¶
Deletes this Task Server.
- class Task¶
Represents crawling task defined by user.
- user¶
User that owns this task.
- name¶
Task’s name.
- priority¶
Task’s priority.
- start_links¶
Starting point of crawling.
- whitelist¶
URLs which should be crawled (in regex format).
- blacklist¶
URLs which should not be crawled (in regex format).
- max_links¶
Maximal amount of links that may be visited while crawling.
- expire_date¶
Datetime of task expiration.
- mime_type¶
MIME types which are to be crawled.
- active¶
Boolean value. If true task is running, else task is paused.
- finished¶
Boolean value. If true task is finished, else running or paused.
- created¶
Datetime of task creation.
- last_data_download¶
Time of last crawled data collection.
- server¶
Task Server that handles this task.
- last_server_spawn¶
Time of last spawn of server which was run for handling this task.
- autoscale_change¶
Boolean value, informs if some task’s parameter has been modified. It value is true, Task Server has to be informed of this change.
- clean()¶
Cleans task’s data. Validates new task’s fields before save operation.
Raises: - ValidationError – If task’s parameters cannot be validated
- QuotaException – If user’s quota has been exceeded
- save(*args, **kwargs)¶
Saves task in data base and sends information about modifications to its Task Server.
- get_parsed_whitelist()¶
Returns whitelist converted from user-friendly regex to python regex.
Returns: Whitelist in python regex format Return type: list
- get_parsed_blacklist()¶
Returns blacklist converted from user-friendly regex to python regex.
Returns: Blacklist in python regex format Return type: list
- change_priority(priority)¶
Sets task priority.
Note
Task with higher priority crawls more links at the same time than those with lower priority.
Parameters: priority (int) – Task’s new priority. Raises QuotaException: if task priority exceeds quota of user which owns this task
- pause()¶
Pauses task.
Note
Paused task does not crawl any links until it is resumed. It temporarily releases resources used by this task (such as priority).
- resume()¶
Resumes task - task becomes active so it can crawl links.
Raises QuotaException: if user has not enough free priority resources to run this task. Then, user should decrease priority of this or other active task.
- stop()¶
Marks task as finished.
Note
Finished tasks cannot be resumed and they do not count to user max_tasks quota. After some time its Task Server will be closed and crawling results will be lost.
- is_waiting_for_server()¶
Checks if running task has no Task Server assigned. This case includes waiting until new Task Server starts.
Returns: Information if this task has no Task Server assigned Return type: bool
- feedback(link, rating)¶
Processes feedback from client in order to update crawling process to satisfy client expectations.
Parameters:
- send_update_to_task_server()¶
Sends information about task update to its Task Server.
- create_api_keys(sender, **kwargs)¶
Creates Application object, required for working with REST API.
Parameters: sender (string) – signal sender. In our case this parameter is irrelevant, however more details about this mechanism can be found in Django documentation.