A super-efficient concurrent spider crawls data from PTT under asynchronous coroutines fast and furiously.
- Python 3.6+
- At least
Python 3.4
foraiohttp
andasyncio.coroutine
features - At least
Python 3.5
forasync/await
syntax - At least
Python 3.6
forFormatted string literals
- At least
aiohttp
lxml
- You need create an asyncio event loop, and a helper function
coroutine_runner()
to execute the coroutine task.
loop = asyncio.get_event_loop()
def coroutine_runner(coroutine):
return loop.run_until_complete(coroutine)
- Build a spider in context-manager like to ensure the spider exits normally.
num_page = 5 # collect how many page
with PTTSpider(board='movie', loop=loop) as spider:
''' get total page of board
return: (number) `total_page` of current board
'''
total_pages = coroutine_runner(spider.get_total_page_num())
''' get pages of posts meta
return: (list) pages of post `metas`
'''
pages = range(total_pages, total_pages - num_page, -1)
coros = asyncio.gather(
*(spider.get_meta(page) for page in pages))
metas = coroutine_runner(coros)
metas = list(itertools.chain.from_iterable(metas))
''' get posts content
return: (list) lots of `posts` dict()
'''
coros = asyncio.gather(
*(spider.get_post(meta['link']) for meta in metas))
posts = coroutine_runner(coros)