No, XPath on Messy HTML is Just as Easy in Ruby #
You think XPath is easier in Javascript than in Ruby when it comes to invalid HTML? I’ve heard this from a lot of correspondence over the past week. Because Javascript has the DOM, right?
Use HTree+REXML. HTree cleans and REXML peppers and gobbles. Here’s a hairy, little method that will save some pain:
require 'htree' require 'rexml/document' require 'open-uri' def read_xhtml_from( uri ) open( uri ) { |f| HTree.parse f }.each_child do |child| if child.respond_to? :qualified_name doc = ""; child.display_xml( doc ) if child.qualified_name == 'html' return REXML::Document.new( doc ) end end end end
Okay, so. How to use it? That nice REXML way you’re already used to.
html = read_xhtml_from "http://redhanded.hobix.com/" html.each_element( "//div[@class='entryFooter']" ) do |e| puts e.text( "./a[starts-with(@href, 'http://redhanded.hobix.com/')]" ) end
MrCode
Interesting…I might have to use this on my Ruby Greasemonkey replacement, which I am coding as we speak.
MrCode
MenTaLguY
MrCode: May I propose a name for Ruby greasemonkey?
starmonkey
jvoorhis
Very clever. What’s HTree got over HTMLTools ?
MrCode
MenTaL: Is there any significance to that name, beyond the monkey part? It doesn’t ring any bells for me.
But either way I do like it because I think paying homage to Greasemonkey (since it is clearly the inspiration for this) is a good gesture. I read the “Dive Into Greasemonkey” online book and was pretty impressed. Even though I don’t consider myself a JavaScript hacker (yet), I understood most of what I read. But opening up “web hacking” to Ruby developers is a worthwhile endeavor, even if Greasemonkey itself is a good system.
To let you know, my first code name for this project was Mortar (a play on WEBrick of course), but a Google search turned up another Ruby project of that name. The current code name is Wonderland, which I do kind of like, but I’m not that invested in it yet. That name comes from my first Hoodwink’d post where I felt like using Hoodwink’d was like being on the other side of the web’s looking glass (a la Alice in Wonderland.) But the metaphor is more applicable to just Hoodwink’d and not the entire Greasemonkey realm, so I think Starmonkey is a good contender.
:{
MrCode: The Star Monkey appears in the Poignant Guide. This is why StarMonkey would be a good name. Of course, greasybacon might also be a good name, but it might tend to scare some cholesterol-sensitive folk away.
Also, thanks for creating a GreaseMonkey replacement in Ruby. Having such a thing that runs with any browser (not just the FireFox) will help boost Ruby. So now (when you’re done) we’ll be able to create Web apps in Ruby (Rails) and we’ll be able to look at Web apps with Ruby (StarMonkey?) and at that point we begin to actually take over the world. Seriously. No joke. Not kidding.
MrCode
Hehehehe, I see now. Though I’m obviously a big fan of RedHanded and a lot of _why’s work, I have yet to read the Poignant Guide. Mostly because I’ve been programming in Ruby for a while and don’t really need to read a book on it. But I think now I’ll just have to read it, if just for the entertainment value (and also so I fully understand all the Fox, Chunky Bacon and Star Monkey jokes.)
The StarMonkey name is definitely taking the lead now.
BTW , anyone have an answer for my iconv question from above? If not, I’ll just use HTML Tools (thanks jvoorhis.)
rcoder
MrCode, what platform are you running Ruby on? My guess would be Windows, in which case the iconv module may just not work at all.
MrCode
Hmmm, there is a possible problem with the name StarMonkey: Google shows a video game and some companies with that name. Maybe we can “compete” with them, but it might be better to have another name. Just something to keep in mind.
I guess in the end I can name it whatever I want, but I certainly don’t mind a little community feedback, since I blatently stole the idea from the denizens of RedHanded (though as Edison said, 1% inspiration, 99% perspiration.)
MrCode
rcoder: Yeah these days I mostly develop on Windows, though I’ll probably attempt a switch to Linux in the next few months. But at the moment it is most convenient to develop on my Windows laptop supplied by my company.
I guess that means no iconv (and therefore no htree) for me. HTML Tools it is.
why
MrCode: don’t say that. Ruby comes with iconv.
MrCode
pangloss
MrCode: If you run into problems with ruby on windows, just install cygwin and you have a magic posix environment running on top of windows with your favorite shell and everything. Cygwin has an optional ruby package. You can have the pleasure of running two seperate copies of ruby on one machine (handy for testing too).
Danno
I thought GreasyBacon was pretty clever as a name for the WebProxy GreaseMonkey replacement, but then again, I think a lot of things are clever that are, in fact, not.
davehal
Thanks for the tip. HTree breezed through a page that I couldn’t get to work with HTML Tools. Very nice.
flgr
JavaScript’s DOM support is generally considered way too long-winded even by people who are sure to be right on this. (You know who you are!)
I think future goals are to simplify all this a lot, though E4X seems to be getting there already.
phil602
wow…. if I had known this three weeks ago, I would have been quite happy.
:{
MrCode: regarding alternate names:
MonkeyBacon – it rhymes with ChunkyBacon while still evoking thoughts of grease. Potentially a very good choice, but animal rights folk will wonder if the bacon is made from Monkey which could possibly lead to riots in some areas (just so you’re prepared).
ChunkyMonkey – I’m rather fond of that Ben & Jerry’s flavor and it still has Poignant tie-ins.
Other completely different types of possible names:
Chameleon: What does this thing do? It changes the appearance of web pages chameleon-like.
Cloak-n-Dagger – why? Just ‘cuz it sounds cool.
Scarlet’s Web – has Ruby connotations (Scarlet) and Web connontations (Web) and sounds like a popular children’s book to boot. Why not be the first kid on your block to use a possessive in a project name?
RubyOnBricks (or the very similar RubyOnBlocks) – Ruby, Webrick – what more could you want and it sounds a bit like RubyOnRails which we all know is very popular these days. Downside: it sounds like Ruby lost it’s wheels and is up on the blocks. The Agile Programmer’s Guide to Ruby On Bricks should hit the shelves any day now.
bmj
Well, some time ago (last two months or so) I saw a DOM Javascript in a comments post at www.web-graphics.com (but can’t remeber the post anymore).
Dave Burt
iconv
The One-Click Installer really should include it; it’s not hard to get working.
I’ve just packaged the required binaries for easy extraction (632k). The package includes instructions on where to get them independently if you’re suitably paranoid. Please let me know if you’re a happy customer.
(I got my directions from the Rails wiki.)
MrCode
Thanks a lot Dave, that did the trick.
I agree it should be included in the One-Click Installer. Maybe we should send an email to Curt Hibbs?
Still, I’m tempted to continue using HTML Tools for Wonderland since my fellow Windows users may not know the tricks needed to get iconv, and therefore HTree, working.
sporkmonger
Yeah, I’ve been using HTree to do this very thing for awhile now with the FeedTools gem.
mr/
This is really nice, thx for posting.
I was using HTML Tools and way too slow. No need htmltidy anymore now. Just pure ruby.
Jason
thanks dave burt, I was wondering how to get htree to work on windows as well
Peter
Hmm, does htree really produce real xhtml? Parsing an invalid page with htree and then feeding it to the W3C validator still shows invalid markup (albeit mucj fewer errors).
Try it with: http://www.craiglarman.com/
Using:Read the result to a file and paste that into http://validator.w3.org/ (adding the XHTML doctype manually).
Comments are closed for this entry.