hoodwink.d enhanced
RSS
2.0
XHTML
1.0

RedHanded

More Splatters of Hpricot #

by why in bits

Element#siblings_at, Element#nodes_at

  >> doc.at(:h3)
  => {elem h3 {text "bottles and cans"} h3}
  >> doc.at(:h3).siblings_at(0..2)
  => #<Hpricot::Elements[
       {elem <h3> {text "bottles and cans"} </h3>}, 
       {elem <p> {text "just clap your hands"} </p>}, 
       {elem <ul> {elem <li> {text "2 turntables"} </li>}
         {elem <li> {text "a microphone"} </li>} </ul>}]>

Basically: grab this element, as well as the two siblings below it. The nodes_at method will include text elements and comments.

  >> doc.at(:h3).nodes_at(0..2)
  => #<Hpricot::Elements[
       {elem <h3> {text "bottles and cans"} </h3>},
       {text "\n"}, 
       {elem <p> {text "just clap your hands"} </p>}]>

You can also do stuff like nodes_at(-2, 2, 5) to grab specific elements. Nodes positioned at two places above, two places below and five places below the selected element. (doc)

text() and comment()

  >> doc.search("p/text()")
  => #<Hpricot::Elements[{text "just clap your hands"}]>
  >> doc.at("//comment()")
  => {comment "<!-- insert mp3 of applause here -->"}

Element#to_original_html

  >> doc = Hpricot("<p>a bunch of <b>messy <i>messy</b> html that" +
       "doesn't</u> match up<!_ egg _!>")
  => #<Hpricot::Doc {elem <p> {text "a bunch of "} {elem <b> {text "messy "} 
       {elem <i> {text "messy"}} </b>} {text " html that doesn't"} 
       {bogusetag</uu</u>>} {text " match up<!_ egg _!>"}}>
  >> puts doc.to_original_html
  <p>a bunch of <b>messy <i>messy</b> html that doesn't match up<!_ egg _!>

XPath indices

  >> doc.at("li[1]")
  => {elem li {text "a microphone"} li}

Those indices work like E:nth-of-type.


Version 0.5 of Hpricot is nearing. Please test the latest gem to help me figure out any subtleties. Also, rdoc is now included. So, yeah: HELP!

 gem install hpricot --source code.whytheluckystiff.net
said on 27 Dec 2006 at 08:45

Great work. TY.

said on 27 Dec 2006 at 10:01

Wow. That is really going to clean up the scraper I wrote last week. Thanks!

said on 27 Dec 2006 at 10:55

Keen, I played with hpricot a bit, but I needed to demuck some namespaces, which didn’t seem to work :(

Otherwise it’s a prime example of the mythic _why and ruby combination making the world less badder.

said on 27 Dec 2006 at 18:39

Hpricot must be Best Gem of 2006.

said on 28 Dec 2006 at 03:24

I’m loving Hpricot and all the other gems that have built upon it such as WWW ::Mechanize and RDig.

Any chance of pulling in the `all_text` method that WWW ::Mechanize adds to Hpricot?

Also, I’ve posted some comments under ticket 13 on trac regarding Hpricot’s choking on ridiculously long .NET viewstates. Any thoughts on how this might be fixed?

I couldn’t even imagine doing my current scraping project without having it Hpricot flavoured. Great stuff!

11 Jul 2010 at 20:52

* do fancy stuff in your comment.

PREVIEW PANE