Ripping Up Wikipedia, Subjugating It #
There’s an article that covers taking apart Wikipedia pages with Hpricot on the blog of Shane Vitarana. His commenters also point out the Wikipedia API, which is a fantastic way of getting at the raw data—in YAML even! Still, it gives you the raw Wiki text, untransformed. Hence Hpricot.
Anyway, I just want to make a couple suggestions for his script. These aren’t a big deal, just some shortcuts in Hpricot, which aren’t documented yet. Let’s start getting them known, you know?
First, a simple one: the Element#attributes hash can be used through Element#[].
#change /wiki/ links to point to full wikipedia path (content/:a).each do |link| unless link['href'].nil? if link['href'] =~ %r!^/wiki/! link['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/') end end end
In the section where Shane loops through a bunch of CSS selector and removes everything that matches, I think it’s quicker to join the separate CSS selectors with commas, to form a large selector which Hpricot can find with a single pass.
#remove unnecessary content and edit links (content/items_to_remove.join(',')).remove
And lastly, I’ve checked in a new swap
method in which replaces an element with an HTML fragment.
#replace links to create new entries with plain text (content/"a.new").each do |link| link.swap(link['title']) end
So, yeah. Little things. Hey, are there any other methods you’re itching for?
DerGuteMoritz
Why
swap
and notreplace
? :)Ola Bini
Maybe because it’s a whooping 3 bytes larger! =)
Manfred
Because
replace
replaces parts of the contents of a thing andswap
replaces the complete content of something.shane
Thanks for
swap
why! I really didn’t like theend.remove
syntax. I was going to hackElements::before
but this is much cleaner.FlashHater
W00t, _why is back in the blogging mood, I am happy. I subjugate the Second Life status page and make it control my MPD server durring downtimes.
why
I guess I felt like
replace
is used to replace an Element obj with an Element obj. In this case, an Element is getting replace with a String (which is parsed and turned into an Element.)lukfugl
Regarding
(content/items_to_remove.join(',')).remove
, why not(content/items_to_remove).remove
?Just define:
Of course, condensing and refactoring as necessary.
appleman
One request:
Ability to figure out the distinct path of an element/text node by calling a method on the node. That way, you could always get to that node assuming the document itself does not change. This is opposite of the current approach, where you know the path in advance and use it to get to the data you want. If you don’t know exactly where your data will be, this will help you find its path and find it again later on. It would be even better to analyze the node’s path and figure out the most effective and minimal path (whether in xpath or css) to get back to that node…
Perhaps the ability to figure out how deep you on in a tree and how many siblings you have and where you rank amoung those siblings.
I believe you can do these things by building a recursive method using the traverse_text method and parent method, but it would be nice if it was captured during the traversal of the tree.
appleman
One request:
Ability to figure out the distinct path of an element/text node by calling a method on the node. That way, you could always get to that node assuming the document itself does not change. This is opposite of the current approach, where you know the path in advance and use it to get to the data you want. If you don’t know exactly where your data will be, this will help you find its path and find it again later on. It would be even better to analyze the node’s path and figure out the most effective and minimal path (whether in xpath or css) to get back to that node…
Perhaps the ability to figure out how deep you on in a tree and how many siblings you have and where you rank amoung those siblings.
I believe you can do these things by building a recursive method using the traverse_text method and parent method, but it would be nice if it was captured during the traversal of the tree.
tobinibot
What about a way to set http headers, like user-agent? Or does that already exist, and I’m just missing it?
shanethepain
tobinibot- I don’t remember seeing the user-agent being set in the source. The HTML returned from Wikipedia appends a link to the original article in the Notes section, which is probably due to an unknown user-agent.