The basic API included the application framework and routing system (provided by werkzeug.routing) of Brownant.
The app which could manage whole crawler system.
Add a url rule to the app instance.
The url rule is the same with Flask apps and other Werkzeug apps.
Parameters: |
|
---|
Dispatch the URL string to the target endpoint function.
Parameters: | url_string – the origin URL string. |
---|---|
Returns: | the return value of calling dispatched function. |
Mount a supported site to this app instance.
Parameters: | site – the site instance be mounted. |
---|
Parse the URL string with the url map of this app instance.
Parameters: | url_string – the origin URL string. |
---|---|
Returns: | the tuple as (url, url_adapter, query_args), the url is parsed by the standard library urlparse, the url_adapter is from the werkzeug bound URL map, the query_args is a multidict from the werkzeug. |
Validate the ParseResult object.
This method will make sure the parse_url() could work as expected even meet a unexpected URL string.
Parameters: | url (ParseResult) – the parsed url. |
---|
Raise the RequestRedirect exception to lead the app dispatching current request to another URL.
Parameters: | url – the target URL. |
---|
The request object.
Parameters: |
|
---|
The site supported object which could be mounted to app instance.
Parameters: | name – the name of the supported site. |
---|
Play record actions on the target object.
Parameters: | target (Brownant) – the target which recive all record actions, is a brown ant app instance normally. |
---|
Record the method-calling action.
The actions expect to be played on an target object.
Parameters: |
|
---|
The decorator to register wrapped function as the brown ant app.
All optional parameters of this method are compatible with the add_url_rule().
Registered functions or classes must be import-able with its qualified name. It is different from the Flask, but like a lazy-loading mode. Registered objects only be loaded before the first using.
The right way:
@site.route("www.example.com", "/item/<int:item_id>")
def spam(request, item_id):
pass
The wrong way:
def egg():
# the function could not be imported by its qualified name
@site.route("www.example.com", "/item/<int:item_id>")
def spam(request, item_id):
pass
egg()
Parameters: |
|
---|
The base exception of the Brownant framework.
Bases: brownant.exceptions.BrownantException
The given URL or other identity is from a platform which not support.
This exception means any url rules of the app which matched the URL could not be found.
Convert the input value into bytes type.
If the input value is string type and could be encode as UTF-8 bytes, the encoded value will be returned. Otherwise, the encoding has failed, the origin value will be returned as well.
Parameters: |
|
---|---|
Return type: | bytes |
The declarative API is around the “dinergate” and “pipeline property”.
The simple classify crawler.
In order to work with unnamed properties such as the instances of PipelineProperty, the meta class DinergateType will scan subclasses of this class and name all unnamed members which are instances of cached_property.
Parameters: |
|
---|
the URL template string for generating crawled target. the self could be referenced in the template. (e.g. “http://www.example.com/items/{self.item_id}?page={self.page}”)
The fetching target URL.
The default behavior of this property is build URL string with the URL_TEMPLATE.
The subclasses could override URL_TEMPLATE or use a different implementation.
Bases: type
The metaclass of Dinergate and its subclasses.
This metaclass will give all members are instance of cached_property default names. It is because many pipeline properties are subclasses of cached_property, but them would not be created by decorating functions. They will has not built-in __name__, which may cause them could not cache values as expected.
Bases: werkzeug.utils.cached_property
The base class of pipeline properties.
There are three kinds of initial parameters.
A workable subclass of PipelineProperty should implement the abstruct method provide_value(), which accept an argument, the instance of Dinergate.
Overriding prepare() is optional in subclasses.
Parameters: | kwargs – the parameters with the three kinds. |
---|
The abstruct method which should be implemented by subclasses. It provide the value expected by us from the subject instance.
Parameters: | obj (Dinergate) – the subject instance. |
---|
the definition of attr_names
Get attribute of the target object with the configured attribute name in the attr_names of this instance.
Parameters: |
|
---|
the definition of options
This method will be called after instance ininialized. The subclasses could override the implementation.
In general purpose, the implementation of this method should give default value to options and the members of attr_names.
Example:
def prepare(self):
self.attr_names.setdefault("text_attr", "text")
self.options.setdefault("use_proxy", False)
the names of required attributes.
The query argument property. The usage is simple:
class MySite(Dinergate):
item_id = URLQueryProperty(name="item_id", type=int)
It equals to this:
class MySite(Dinergate):
@cached_property
def item_id(self):
value = self.request.args.get("item_id", type=int)
if not value:
raise NotSupported
return value
A failure convertion with given type (ValueError be raised) will lead the value fallback to None. It is the same with the behavior of the MultiDict.
Parameters: |
|
---|
The text response which returned by fetching network resource.
Getting this property is network I/O operation in the first time. The http request implementations are all provided by requests.
The usage example:
class MySite(Dinergate):
foo_http = requests.Session()
foo_url = "http://example.com"
foo_text = TextResponseProperty(url_attr="foo_url",
http_client="foo_http",
proxies=PROXIES)
Parameters: |
|
---|
The element tree built from a text response property. There is an usage example:
class MySite(Dinergate):
text_response = "<html></html>"
div_response = "<div></div>"
xml_response = (u"<?xml version='1.0' encoding='UTF-8'?>"
u"<result>\u6d4b\u8bd5</result>")
etree = ElementTreeProperty()
div_etree = ElementTreeProperty(text_response_attr="div_response")
xml_etree = ElementTreeProperty(text_response_attr="xml_response",
encoding="utf-8")
site = MySite(request)
print(site.etree) # output: <Element html at 0x1f59350>
print(site.div_etree) # output: <Element div at 0x1f594d0>
print(site.xml_etree) # output: <Element result at 0x25b14b0>
Parameters: |
|
---|
New in version 0.1.4: The encoding optional parameter.
The text extracted from a element tree property by XPath. There is an example for usage:
class MySite(Dinergate):
# omit page_etree
title = XPathTextProperty(xpath=".//h1[@id='title']/text()",
etree_attr="page_etree",
strip_spaces=True,
pick_mode="first")
links = XPathTextProperty(xpath=".//*[@id='links']/a/@href",
etree_attr="page_etree",
strip_spaces=True,
pick_mode="join",
joiner="|")
Parameters: |
|
---|
New in version 0.1.4: The new option value “keep” of the pick_mode parameter.