fcs.crawler.content_parser¶
This module contains classes responsible for parsing content acquired by Crawling Units.
- class ParserProvider¶
Provides concrete parser instance.
- parsers¶
Dict in the following format: {content_type, parser_instance}. Stores concrete parser instances for the given content type.
- static get_parser(content_type)¶
Returns parser instance depending on passed content type.
Parameters: content_type (string) – Type of content to parse (MIME type). Returns: Instance of parser that is able to parse a content of given type. Return type: Parser Raises Exception: if unknown parser type has been requested.
- class Parser¶
Superclass for concrete parser implementations.
- parse(content, url="")¶
This method should contain parsing logic.
Parameters:
Note
Parser class’s parse method is not implemented and should be overwritten.
- class TextHtmlParser¶
Parses HTML type content and retrieves links (recognized by <a href> and <link href> tags).
- parse(content, url="")¶
Parses HTML page content.
Parameters: Returns: List with 2 elements: the first one is the site’s HTML encoded in Base64 format, the second one contains links retrieved from that site.
Return type: list