Introduction to Web Crawler - インターファーム開発部ブログ

Web Crawler

おはよう、Royです。

Today Im going to introduce something interesting, Web Crawler.

Web Crawler provide us a easier way to gather information from website(Google is an enormous crawler). And we can easily make our web crawler.

Lets start from the most easy way:
We wanna gather information from a website. We need to open a a webpage, choose what we want, copy and paste. Web crawler does the same thing. Lets see a simple example:

I wanna save some kawaii emotions from http://www.mengma.moe/

#coding=utf-8

import urllib2
import re
from bs4 import BeautifulSoup

kawaii = open('emotions.txt', 'w+')

url = "http://mengma.moe/"

html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

for emotion in soup.find_all('input', title='鼠标浮动在我上面可以直接Ctrl+C复制噢'):
    moe = emotion.get('value').encode('utf-8')
    kawaii.write(moe + '\n')

kawaii.close()

urllib2, re: Python built-in library
BeatifuleSoup: http://www.crummy.com/software/BeautifulSoup/

The code uses "html = urllib2.urlopen(url).read()" to open the page, then use "moe = emotion.get('value').encode('utf-8')" to pick out what we want, and save it into file.

So by the crawler, you can gather information from millions of pages easily by code.

If you are interested in web crawling, scrapy would be a powerful friend, scrapy is a fast high-level web crawling and screen scraping framework, used to crawl websites and extract structured data from their pages.

Have a good weekend, and next time let's talk more about web crawler.