More Splatters of Hpricot #
Element#siblings_at, Element#nodes_at
>> doc.at(:h3) => {elem h3 {text "bottles and cans"} h3} >> doc.at(:h3).siblings_at(0..2) => #<Hpricot::Elements[ {elem <h3> {text "bottles and cans"} </h3>}, {elem <p> {text "just clap your hands"} </p>}, {elem <ul> {elem <li> {text "2 turntables"} </li>} {elem <li> {text "a microphone"} </li>} </ul>}]>
Basically: grab this element, as well as the two siblings below it. The nodes_at
method will include text elements and comments.
>> doc.at(:h3).nodes_at(0..2) => #<Hpricot::Elements[ {elem <h3> {text "bottles and cans"} </h3>}, {text "\n"}, {elem <p> {text "just clap your hands"} </p>}]>
You can also do stuff like nodes_at(-2, 2, 5)
to grab specific elements. Nodes positioned at two places above, two places below and five places below the selected element. (doc)
text() and comment()
>> doc.search("p/text()") => #<Hpricot::Elements[{text "just clap your hands"}]> >> doc.at("//comment()") => {comment "<!-- insert mp3 of applause here -->"}
Element#to_original_html
>> doc = Hpricot("<p>a bunch of <b>messy <i>messy</b> html that" + "doesn't</u> match up<!_ egg _!>") => #<Hpricot::Doc {elem <p> {text "a bunch of "} {elem <b> {text "messy "} {elem <i> {text "messy"}} </b>} {text " html that doesn't"} {bogusetag</uu</u>>} {text " match up<!_ egg _!>"}}> >> puts doc.to_original_html <p>a bunch of <b>messy <i>messy</b> html that doesn't match up<!_ egg _!>
XPath indices
>> doc.at("li[1]") => {elem li {text "a microphone"} li}
Those indices work like E:nth-of-type
.
Version 0.5 of Hpricot is nearing. Please test the latest gem to help me figure out any subtleties. Also, rdoc is now included. So, yeah: HELP!
gem install hpricot --source code.whytheluckystiff.net
Boris K
Great work. TY.
Evan Farrar
Wow. That is really going to clean up the scraper I wrote last week. Thanks!
Adam
Keen, I played with hpricot a bit, but I needed to demuck some namespaces, which didn’t seem to work :(
Otherwise it’s a prime example of the mythic _why and ruby combination making the world less badder.
Dr Nic
Hpricot must be Best Gem of 2006.
Mark
I’m loving Hpricot and all the other gems that have built upon it such as WWW ::Mechanize and RDig.
Any chance of pulling in the `all_text` method that WWW ::Mechanize adds to Hpricot?
Also, I’ve posted some comments under ticket 13 on trac regarding Hpricot’s choking on ridiculously long .NET viewstates. Any thoughts on how this might be fixed?
I couldn’t even imagine doing my current scraping project without having it Hpricot flavoured. Great stuff!