Pythonist: 2012

很不错的 crawler。嘻嘻，下面是来自 tutorial 的例子。先是 spider 下面的 dmz_spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem

class DmozSpider (BaseSpider):

    name = 'dmoz'
    allowed_domains = ['dmoz.org']
    start_urls = [
        'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
        'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/',
    ]

    def parse (self, response):
        hxs = HtmlXPathSelector (response)
        sites = hxs.select ('//ul/li')
        items = list ()
        for site in sites:
            title = site.select ('a/text()').extract ()
            link  = site.select ('a/@href').extract ()
            desc  = site.select ('text()').extract ()
            if len (title) > 0:
                title = title [0]
            else:
                continue
            if len (link) > 0:
                link = link [0]
            else:
                continue
            if len (desc) > 0:
                desc = desc[0].strip()
            item = DmozItem ()
            item ['title'] = title
            item ['link']  = link
            item ['desc']  = desc
            items.append (item)
        return items

然后是 items.py

from scrapy.item import Item, Field

class DmozItem(Item):
    title = Field ()
    link  = Field ()
    desc  = Field ()

最后是修改过的 settings.py

BOT_NAME = 'tutorial'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
DEFAULT_ITEM_CLASS = 'tutorial.items.DmozItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

在 0.14 下运行可以，0.13 就挂了... 可以用 scrapy crawl -o items.json -t json 获得需要的 json 输出文件。

Pythonist

Friday, April 6, 2012

zope2/3

Sunday, March 11, 2012

scrapy