RedHanded » Okay, Give Hpricot 0.2 a Go

Okay, Give Hpricot 0.2 a Go #

This time I’m giving a balloon out which can be used for quick testing.

http://balloon.hobix.com/hpricot

Or, if you want to install Hpricot 0.2:

gem install hpricot --source code.whytheluckystiff.net

So the Hpricot parser is basically complete. There’s still lots of fiddling ahead: it doesn’t handle Javascript whatsoever and it’s not yet as flexible as HTree. However, it does fix alot of HTML that RubyfulSoup and the htmltools won’t.

Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.

                     user     system      total        real
 hpricot:       10.515625   0.000000  10.515625 ( 10.610571)
 scrapi:        32.546875   0.093750  32.640625 ( 32.923535)
 htree:         56.609375   0.023438  56.632812 ( 57.096530)
 rubyfulsoup:   29.289062   0.046875  29.335938 ( 29.586510)
 mechanize:(*) 148.132812   1.101562 149.234375 (150.621922)
 htmltok:(*)    19.632812   0.007812  19.640625 ( 19.795446)

(*) These libs are a bit more primitive, focusing only on reading documents, no calls are given for modifying documents.

The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.

I didn’t include scrapi because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all p.posted elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.

Update: Thanks to assaf, I got scrapi working with libtidy and reporting back the right answers. Thankya! Update #2: An htmltokenizer benchmark.

05 Jul 2006 at 11:44 | 30 comments

Dan W

said on 05 Jul 2006 at 13:21

Excellent work. Who’s going to be the first to make a Ruby version of Pornolize then?

anon

said on 05 Jul 2006 at 13:38

Sounds great, I’ll give it a try on http://news.bbc.co.uk – RubyfulSoup handles that so slowly last time I tried.

netghost

said on 05 Jul 2006 at 13:46

I don’t know what I’ll use it for… but I have a deep desire to play with this. Much love for the JQuery style expresions! They made me want to use JQuery… but I couldn’t figure out what I’d use it for.

FlashHater

said on 05 Jul 2006 at 14:56

W00t! Balloon is truley usefull!

serg

said on 05 Jul 2006 at 17:25

So, it works if I pass in an IO object to Hpricot.parse (from either a file or a url, like the balloon). It doesn’t work if I pass in a string (i.e., the name of the file to parse, like the example.)

assaf

said on 05 Jul 2006 at 17:26

For scrapi, you’ll have to use Tidy for now. The non-tidy parser doesn’t deal with bad HTML , which is why I’m looking for an alternative that can clean HTML well and fast.

With today’s code drop you can do something like:

# Set it to use Tidy.
Scraper::Base.tidy_options({})

# Define a scraper.
boing_boing = Scraper.define do
  array :posts
  process "p.posted", :posts=>:node
  result :posts
end

# Scrape away!
puts boing_boing.scrape(html).size

why

said on 05 Jul 2006 at 18:44

seg: Ohhh, you’re right. That example was totally wrong! Hpricot.parse takes an HTML string or an IO object containing HTML .

assaf: Hurray, that works.

josh

said on 05 Jul 2006 at 19:28

Just to ask a really dumb question, but if I have an Element, how can I get the text found within the tag?

Thanks for the library and sorry for the question.

why

said on 05 Jul 2006 at 20:17

For now, you’ll need to loop through the children of the Element. Some of those will be Text objects which have a content property containing the string.

The next version will have an innerHTML property on every element.

Jerome

said on 06 Jul 2006 at 01:24

Is there a reason there was no comparison with HTMLTokenizer? Or is that because it’s not even in the same ballpark as the rest?

Hank

said on 06 Jul 2006 at 02:03

Okay, for anyone else searching for some straight example code from the Hpricot posts, here’s some that works:

wget http://redhanded.hobix.com/index.html

require 'rubygems'
require_gem 'hpricot'
require 'open-uri'
doc = Hpricot.parse(open("index.html"))
(doc/:p/:a).each do |link|
  p link.attributes
end

why

said on 06 Jul 2006 at 09:10

Jerome: Okay. Htmltokenizer is pretty quick, but read-only. But I’m really glad you mentioned this one, because I could offer access to the Hpricot tokenizer, which would speed things up by literally an order of magnitude.

In fact, you can already get access to this by using Hpricot.scan.

 doc = Hpricot.scan(open("index.html)) do |token|
   p token
 end

Which give you back:

 [:doctype, "html", {"system_id"=>"\"DTD/xhtml1-transitional.dtd\"", "publid_id"=>"PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" 
"}, "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"DTD/xhtml1-transitional.dtd\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "html", {"xml:lang"=>"en", "lang"=>"en", "xmlns"=>"http://www.w3.org/1999/xhtml"}, "<html xmlns=\"http://www.w3.org/1999/xh
tml\" lang=\"en\" xml:lang=\"en\">"]
 [:text, "\n", nil, "\n"]
 [:stag, "head", nil, "<head>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "meta", {"content"=>"text/html; charset=utf-8", "http-equiv"=>"Content-Type"}, "<meta http-equiv=\"Content-Type\" conte
nt=\"text/html; charset=utf-8\" />"]
 [:text, "\n", nil, "\n"]
 [:stag, "title", nil, "<title>"]
 [:text, "RedHanded &raquo; sneaking Ruby through the system", nil, "RedHanded &raquo; sneaking Ruby through the system"]
 [:etag, "title", nil, "</title>"]
 [:text, "\n", nil, "\n"]
 [:emptytag, "link", {"href"=>"http://redhanded.hobix.com/index.xml", "title"=>"RSS", "rel"=>"alternate", "type"=>"application/rss+x
ml"}, "<link rel=\"alternate\" type=\"application/rss+xml\" title=\"RSS\" href=\"http://redhanded.hobix.com/index.xml\" />"]
 [:text, "\n", nil, "\n"]
 ...

Basically: (1) a symbol describing the element type, (2) the tag name or text content, (3) an attributes hash, and (4) the raw string which formed this token.

The scanning stage is easy. It’s the figuring out the layout of the document and coercing wellformedness that’s the spiny one.

why

said on 06 Jul 2006 at 12:13

anon: news.bbc.co.uk was broken (for me) in Hpricot 0.2, but it’s working in trunk. So is McSweeney’s (awful HTML .) More, more, anymore really really bad HTML sites I can use?

thomas

said on 06 Jul 2006 at 12:15

Found a “bug”. The Scanner fails when it encounters 

msg = “negative string size (or size too big) (ArgumentError)”

thomas

said on 06 Jul 2006 at 12:18

hey Preview shows something else than the actual Comment! Well anyways, the scanner fails when it encounters an empty HTML Comment. See if this is right 

why

said on 06 Jul 2006 at 12:31

thomas: That little oddity is fixed in trunk. McSweeney’s has one of those suckers.

thomas

said on 06 Jul 2006 at 12:44

Tried trunk, didnt work. @ svn co https://code.whytheluckystiff.net/svn/hpricot/trunk hpricot cd hpricot rake install

“Successfully installed hpricot, version 0.2”

require ‘rubygems’ require_gem ‘hpricot’, ”>=0.2”

doc = Hpricot.parse(”<!—>”) @ Fails .. missing something?

thomas

said on 06 Jul 2006 at 12:45

sorry these comments are killing me, I should RTFM

why

said on 06 Jul 2006 at 13:05

Oh, do:

 cd hpricot
 rake ragel
 rake install

You’ll need Ragel installed to build the new scanner.

thomas

said on 06 Jul 2006 at 13:16

Thanks, that did it.

Not sure if that is of any use to you. But I needed it: http://rafb.net/paste/results/bVlGWd11.html

doc.get_elements_by_tag_name('h3').each { |tag| puts tag.inner_text }

luke redpath

said on 07 Jul 2006 at 06:27

Great little library – love it. I’ve written a small extension for Test::Unit that lets you test your Rails views using hpricot instead of the clunky assert_tag function.

Hpricot Test Extension for Rails

trans

said on 08 Jul 2006 at 12:41

Cleary Hpricot is for HTML , but how might it fair with strict XML ?

need for speed

said on 08 Jul 2006 at 14:33

In the benchmark i miss a comparison with ruby-libxml!

jm

said on 10 Jul 2006 at 09:08

so does this not work on windows?

probablyCorey

said on 14 Jul 2006 at 08:48

If you can parse this site then Hpricot is the magical!

why

said on 18 Jul 2006 at 11:58

probablyCorey: Oh, wow, that is hideous. Three nested HTML pages. Hpricot does it, but I really don’t know what’s correct in this case.

not

said on 18 Jul 2006 at 18:47

Any chance of making a “pure” ruby version?

why

said on 19 Jul 2006 at 11:25

Binaries will be out in 0.5. Watch the map.

ryan

said on 21 Jul 2006 at 15:03

I found a borken page for you. At least it broke Hpricot 0.3.

Broken


`build_node': [bug] unknown structure: [:xmlprocins, "@include(\"ocregister/includes/global/login_table.php\");", nil, nil] (Exception)

mae

said on 25 Jul 2006 at 01:59

will hpricot work in ruby 1.8.2?

Archive

Links

Syndicate

Okay, Give Hpricot 0.2 a Go #

Dan W

anon

netghost

FlashHater

serg

assaf

why

josh

why

Jerome

Hank

why

why

thomas

thomas

why

thomas

thomas

why

thomas

luke redpath

trans

need for speed

jm

probablyCorey

why

not

why

ryan

mae

PREVIEW PANE