Okay, Give Hpricot 0.2 a Go
This time I’m giving a balloon out which can be used for quick testing.
Or, if you want to install Hpricot 0.2:
gem install hpricot --source code.whytheluckystiff.net
Here’s a benchmark parsing the Boing Boing home page fifty times. It’s a good page to test because it’s big and there’s some bogus end tags and old-style tables and break tags.
user system total real hpricot: 10.515625 0.000000 10.515625 ( 10.610571) scrapi: 32.546875 0.093750 32.640625 ( 32.923535) htree: 56.609375 0.023438 56.632812 ( 57.096530) rubyfulsoup: 29.289062 0.046875 29.335938 ( 29.586510) mechanize:(*) 148.132812 1.101562 149.234375 (150.621922) htmltok:(*) 19.632812 0.007812 19.640625 ( 19.795446)
(*) These libs are a bit more primitive, focusing only on reading documents, no calls are given for modifying documents.
The mechanize benchmark parses and converts to a REXML document, since mechanize itself only gives you links, form elements, nothing complex. So this may be unfair.
I didn’t include
scrapi because, although it parses the page, it fails some of my other tests. For example, when using a selector to find all
p.posted elements, I get back only one element with scrapi, when the others all report back sixty elements. So, I’ll post a benchmark when I understand what I’m doing wrong.
Update: Thanks to assaf, I got scrapi working with libtidy and reporting back the right answers. Thankya! Update #2: An htmltokenizer benchmark.