hoodwink.d enhanced


No, XPath on Messy HTML is Just as Easy in Ruby #

by why in inspect

You think XPath is easier in Javascript than in Ruby when it comes to invalid HTML? I’ve heard this from a lot of correspondence over the past week. Because Javascript has the DOM, right?

Use HTree+REXML. HTree cleans and REXML peppers and gobbles. Here’s a hairy, little method that will save some pain:

 require 'htree'
 require 'rexml/document'
 require 'open-uri'

 def read_xhtml_from( uri )
   open( uri ) { |f| HTree.parse f }.each_child do |child|
     if child.respond_to? :qualified_name
       doc = ""; child.display_xml( doc )
       if child.qualified_name == 'html'
         return REXML::Document.new( doc ) 

Okay, so. How to use it? That nice REXML way you’re already used to.

 html = read_xhtml_from "http://redhanded.hobix.com/" 
 html.each_element( "//div[@class='entryFooter']" ) do |e|
   puts e.text( "./a[starts-with(@href, 'http://redhanded.hobix.com/')]" )
said on 26 Aug 2005 at 14:59

Interesting…I might have to use this on my Ruby Greasemonkey replacement, which I am coding as we speak.

said on 26 Aug 2005 at 15:14
Hmmm, I tried the example code and got this error:
No such file to load -- iconv (LoadError)
This was from the encoder.rb file of HTree which obviously requires iconv. A Google search did not turn up an obvious source for iconv…where can I get it?
said on 26 Aug 2005 at 16:02

MrCode: May I propose a name for Ruby greasemonkey?


said on 26 Aug 2005 at 16:29

Very clever. What’s HTree got over HTMLTools ?

said on 26 Aug 2005 at 16:31

MenTaL: Is there any significance to that name, beyond the monkey part? It doesn’t ring any bells for me.

But either way I do like it because I think paying homage to Greasemonkey (since it is clearly the inspiration for this) is a good gesture. I read the “Dive Into Greasemonkey” online book and was pretty impressed. Even though I don’t consider myself a JavaScript hacker (yet), I understood most of what I read. But opening up “web hacking” to Ruby developers is a worthwhile endeavor, even if Greasemonkey itself is a good system.

To let you know, my first code name for this project was Mortar (a play on WEBrick of course), but a Google search turned up another Ruby project of that name. The current code name is Wonderland, which I do kind of like, but I’m not that invested in it yet. That name comes from my first Hoodwink’d post where I felt like using Hoodwink’d was like being on the other side of the web’s looking glass (a la Alice in Wonderland.) But the metaphor is more applicable to just Hoodwink’d and not the entire Greasemonkey realm, so I think Starmonkey is a good contender.

said on 26 Aug 2005 at 17:10

MrCode: The Star Monkey appears in the Poignant Guide. This is why StarMonkey would be a good name. Of course, greasybacon might also be a good name, but it might tend to scare some cholesterol-sensitive folk away.

Also, thanks for creating a GreaseMonkey replacement in Ruby. Having such a thing that runs with any browser (not just the FireFox) will help boost Ruby. So now (when you’re done) we’ll be able to create Web apps in Ruby (Rails) and we’ll be able to look at Web apps with Ruby (StarMonkey?) and at that point we begin to actually take over the world. Seriously. No joke. Not kidding.

said on 26 Aug 2005 at 17:37

Hehehehe, I see now. Though I’m obviously a big fan of RedHanded and a lot of _why’s work, I have yet to read the Poignant Guide. Mostly because I’ve been programming in Ruby for a while and don’t really need to read a book on it. But I think now I’ll just have to read it, if just for the entertainment value (and also so I fully understand all the Fox, Chunky Bacon and Star Monkey jokes.)

The StarMonkey name is definitely taking the lead now.

BTW , anyone have an answer for my iconv question from above? If not, I’ll just use HTML Tools (thanks jvoorhis.)

said on 26 Aug 2005 at 18:07

MrCode, what platform are you running Ruby on? My guess would be Windows, in which case the iconv module may just not work at all.

said on 26 Aug 2005 at 18:44

Hmmm, there is a possible problem with the name StarMonkey: Google shows a video game and some companies with that name. Maybe we can “compete” with them, but it might be better to have another name. Just something to keep in mind.

I guess in the end I can name it whatever I want, but I certainly don’t mind a little community feedback, since I blatently stole the idea from the denizens of RedHanded (though as Edison said, 1% inspiration, 99% perspiration.)

said on 26 Aug 2005 at 18:48

rcoder: Yeah these days I mostly develop on Windows, though I’ll probably attempt a switch to Linux in the next few months. But at the moment it is most convenient to develop on my Windows laptop supplied by my company.

I guess that means no iconv (and therefore no htree) for me. HTML Tools it is.

said on 26 Aug 2005 at 18:50

MrCode: don’t say that. Ruby comes with iconv.

said on 26 Aug 2005 at 19:10
C:\>ruby -v -e "require 'iconv'" 
ruby 1.8.2 (2004-12-25) [i386-mswin32]
c:/ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:18:in `require__': No such file to load -- iconv (LoadError)
        from c:/ruby/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:18:in `require'
        from -e:1
said on 26 Aug 2005 at 19:32

MrCode: If you run into problems with ruby on windows, just install cygwin and you have a magic posix environment running on top of windows with your favorite shell and everything. Cygwin has an optional ruby package. You can have the pleasure of running two seperate copies of ruby on one machine (handy for testing too).

said on 26 Aug 2005 at 19:37

I thought GreasyBacon was pretty clever as a name for the WebProxy GreaseMonkey replacement, but then again, I think a lot of things are clever that are, in fact, not.

said on 26 Aug 2005 at 20:13

Thanks for the tip. HTree breezed through a page that I couldn’t get to work with HTML Tools. Very nice.

said on 26 Aug 2005 at 20:34

JavaScript’s DOM support is generally considered way too long-winded even by people who are sure to be right on this. (You know who you are!)

I think future goals are to simplify all this a lot, though E4X seems to be getting there already.

said on 26 Aug 2005 at 20:46

wow…. if I had known this three weeks ago, I would have been quite happy.

said on 26 Aug 2005 at 21:27

MrCode: regarding alternate names:

MonkeyBacon – it rhymes with ChunkyBacon while still evoking thoughts of grease. Potentially a very good choice, but animal rights folk will wonder if the bacon is made from Monkey which could possibly lead to riots in some areas (just so you’re prepared).

ChunkyMonkey – I’m rather fond of that Ben & Jerry’s flavor and it still has Poignant tie-ins.

Other completely different types of possible names:

Chameleon: What does this thing do? It changes the appearance of web pages chameleon-like.

Cloak-n-Dagger – why? Just ‘cuz it sounds cool.

Scarlet’s Web – has Ruby connotations (Scarlet) and Web connontations (Web) and sounds like a popular children’s book to boot. Why not be the first kid on your block to use a possessive in a project name?

RubyOnBricks (or the very similar RubyOnBlocks) – Ruby, Webrick – what more could you want and it sounds a bit like RubyOnRails which we all know is very popular these days. Downside: it sounds like Ruby lost it’s wheels and is up on the blocks. The Agile Programmer’s Guide to Ruby On Bricks should hit the shelves any day now.

said on 27 Aug 2005 at 04:23

Well, some time ago (last two months or so) I saw a DOM Javascript in a comments post at www.web-graphics.com (but can’t remeber the post anymore).

said on 27 Aug 2005 at 09:31


The One-Click Installer really should include it; it’s not hard to get working.

I’ve just packaged the required binaries for easy extraction (632k). The package includes instructions on where to get them independently if you’re suitably paranoid. Please let me know if you’re a happy customer.

(I got my directions from the Rails wiki.)

said on 27 Aug 2005 at 10:23

Thanks a lot Dave, that did the trick.

I agree it should be included in the One-Click Installer. Maybe we should send an email to Curt Hibbs?

Still, I’m tempted to continue using HTML Tools for Wonderland since my fellow Windows users may not know the tricks needed to get iconv, and therefore HTree, working.

said on 28 Aug 2005 at 14:42

Yeah, I’ve been using HTree to do this very thing for awhile now with the FeedTools gem.

said on 30 Aug 2005 at 02:50

This is really nice, thx for posting.

I was using HTML Tools and way too slow. No need htmltidy anymore now. Just pure ruby.

said on 12 Feb 2006 at 08:46

thanks dave burt, I was wondering how to get htree to work on windows as well

said on 15 Feb 2006 at 07:40

Hmm, does htree really produce real xhtml? Parsing an invalid page with htree and then feeding it to the W3C validator still shows invalid markup (albeit mucj fewer errors).

Try it with: http://www.craiglarman.com/


html = read_xhtml_from "http://www.craiglarman.com/" 
print html

Read the result to a file and paste that into http://validator.w3.org/ (adding the XHTML doctype manually).

Comments are closed for this entry.