RedHanded » Ripping Up Wikipedia, Subjugating It

Ripping Up Wikipedia, Subjugating It #

There’s an article that covers taking apart Wikipedia pages with Hpricot on the blog of Shane Vitarana. His commenters also point out the Wikipedia API, which is a fantastic way of getting at the raw data—in YAML even! Still, it gives you the raw Wiki text, untransformed. Hence Hpricot.

Anyway, I just want to make a couple suggestions for his script. These aren’t a big deal, just some shortcuts in Hpricot, which aren’t documented yet. Let’s start getting them known, you know?

First, a simple one: the Element#attributes hash can be used through Element#[].

 #change /wiki/ links to point to full wikipedia path
 (content/:a).each do |link|
   unless link['href'].nil?
     if link['href'] =~ %r!^/wiki/!
       link['href'].sub!('/wiki/', 'http://en.wikipedia.org/wiki/')
     end
   end
 end

In the section where Shane loops through a bunch of CSS selector and removes everything that matches, I think it’s quicker to join the separate CSS selectors with commas, to form a large selector which Hpricot can find with a single pass.

 #remove unnecessary content and edit links
 (content/items_to_remove.join(',')).remove

And lastly, I’ve checked in a new swap method in which replaces an element with an HTML fragment.

 #replace links to create new entries with plain text
 (content/"a.new").each do |link|
   link.swap(link['title'])
 end

So, yeah. Little things. Hey, are there any other methods you’re itching for?

04 Oct 2006 at 10:24 | 11 comments

DerGuteMoritz

said on 04 Oct 2006 at 10:57

Why swap and not replace? :)

Ola Bini

said on 04 Oct 2006 at 11:03

Maybe because it’s a whooping 3 bytes larger! =)

Manfred

said on 04 Oct 2006 at 11:12

Because replace replaces parts of the contents of a thing and swap replaces the complete content of something.

shane

said on 04 Oct 2006 at 11:33

Thanks for swap why! I really didn’t like the end.remove syntax. I was going to hack Elements::before but this is much cleaner.

FlashHater

said on 04 Oct 2006 at 11:35

W00t, _why is back in the blogging mood, I am happy. I subjugate the Second Life status page and make it control my MPD server durring downtimes.

why

said on 04 Oct 2006 at 11:38

I guess I felt like replace is used to replace an Element obj with an Element obj. In this case, an Element is getting replace with a String (which is parsed and turned into an Element.)

lukfugl

said on 04 Oct 2006 at 13:02

Regarding (content/items_to_remove.join(',')).remove, why not (content/items_to_remove).remove?

Just define:

  def /(selector)
    case selector
    when Array
      self / selector.join(',')
    else
      # the current code
    end
  end

Of course, condensing and refactoring as necessary.

appleman

said on 04 Oct 2006 at 17:53

One request:

Ability to figure out the distinct path of an element/text node by calling a method on the node. That way, you could always get to that node assuming the document itself does not change. This is opposite of the current approach, where you know the path in advance and use it to get to the data you want. If you don’t know exactly where your data will be, this will help you find its path and find it again later on. It would be even better to analyze the node’s path and figure out the most effective and minimal path (whether in xpath or css) to get back to that node…

Perhaps the ability to figure out how deep you on in a tree and how many siblings you have and where you rank amoung those siblings.

I believe you can do these things by building a recursive method using the traverse_text method and parent method, but it would be nice if it was captured during the traversal of the tree.

appleman

said on 04 Oct 2006 at 17:59

One request:

Perhaps the ability to figure out how deep you on in a tree and how many siblings you have and where you rank amoung those siblings.

I believe you can do these things by building a recursive method using the traverse_text method and parent method, but it would be nice if it was captured during the traversal of the tree.

tobinibot

said on 10 Oct 2006 at 08:41

What about a way to set http headers, like user-agent? Or does that already exist, and I’m just missing it?

shanethepain

said on 18 Oct 2006 at 22:52

tobinibot- I don’t remember seeing the user-agent being set in the source. The HTML returned from Wikipedia appends a link to the original article in the Notes section, which is probably due to an unknown user-agent.

Archive

Links

Syndicate

Ripping Up Wikipedia, Subjugating It #

DerGuteMoritz

Ola Bini

Manfred

shane

FlashHater

why

lukfugl

appleman

appleman

tobinibot

shanethepain

PREVIEW PANE