zope 是一个 application server,提供的“功能”和 django 类似,但是比 django 早很多。
比较有意思的是它使用 module 的路径对应 web url 的路径,并以此复用不同的对象。但是 zope2 对 RDBMS 支持不是太好,也有很多实现并不是特别的“python”,这导致后来 django 的崛起。但是 zope3 希望能够修正以往 zope2 的问题,可以参看这个文档。
暂且没有什么特别的需要,django 用起来也挺顺手,就此留个记录吧。
Friday, April 6, 2012
Sunday, March 11, 2012
scrapy
很不错的 crawler。嘻嘻,下面是来自 tutorial 的例子。先是 spider 下面的 dmz_spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
class DmozSpider (BaseSpider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = [
'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/',
]
def parse (self, response):
hxs = HtmlXPathSelector (response)
sites = hxs.select ('//ul/li')
items = list ()
for site in sites:
title = site.select ('a/text()').extract ()
link = site.select ('a/@href').extract ()
desc = site.select ('text()').extract ()
if len (title) > 0:
title = title [0]
else:
continue
if len (link) > 0:
link = link [0]
else:
continue
if len (desc) > 0:
desc = desc[0].strip()
item = DmozItem ()
item ['title'] = title
item ['link'] = link
item ['desc'] = desc
items.append (item)
return items
然后是 items.py
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field ()
link = Field ()
desc = Field ()
最后是修改过的 settings.py
BOT_NAME = 'tutorial' BOT_VERSION = '1.0' SPIDER_MODULES = ['tutorial.spiders'] NEWSPIDER_MODULE = 'tutorial.spiders' DEFAULT_ITEM_CLASS = 'tutorial.items.DmozItem' USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)在 0.14 下运行可以,0.13 就挂了... 可以用 scrapy crawl -o items.json -t json 获得需要的 json 输出文件。
Subscribe to:
Posts (Atom)