Lukas Z's Blog

How to Scrape Wikiquote

Wikiquote logo

I like to use the command line tool fortune which displays a random quote from a database of files.

I wanted to add quotes from Wikiquote. But it turns out that Wikiquote is difficult to parse and the API is no help. (The API gives you a structured document but the actual content is just one large wall of text that has to be parsed just like the HTML page.)

So after giving up on scraping the webpage I had the idea to use Wikiquote’s edit page.

Here’s how it works: Let’s say you want the quotes from the Richard Feynman page. Open it on your browser, tap edit, and you get the edit page for Richard Feynman.

Voilà. It’s much easier to parse! Simply take all lines from the <textarea> that start with one or two asterisks (depending if you want the source) and then filter out a few things like formatting and lines that only contain quotes. A much better scraping experience.

So yeah, that’s my idea/innovation. I wrote a scraper in ruby that you can find on Github which generates the files needed by fortune. Check out the README there.

(Wikiquote image taken from here.)

P.S.: You can follow me on Twitter.