Hpricot Strikes Back

November 3rd 19:13
by why

My my. How the sensationalist press does carry on.

Peter Cooper: On an Hpricot vs Nokogiri benchmark, Nokogiri clocked in at 7 times faster at initially loading an XML document, 5 times faster at searching for content based on an XPath, and 1.62 times faster at searching for content via a CSS-based search. These are impressive results, since Hpricot was previously considered to be quite speedy itself.

I feel just awful (just supreeeeemely lousy) that these benchmarks were only good for four days. Nokogiri is no longer seven times faster than Hpricot.

And that means these guys have to go back through all their docs and promotional materials and… wow, what a job it’s going be. It’s just a tough situation, folks. My heart goes out to all the fine young lads who worked so hard to bring Hpricot down, only to discover, hey, boss, there she goes! Hpricot is strolling right along the boardwalk, smiling, waving, checking its watch, fit as a fiddle.

This fruit is tiny, shiny and can be spit-polished in a single weekend.


Before we get to the news… Here’s the XML those Nokogiri benchmarks are based on: sin.xml.

Someone help me here. Am I reading that right? There are six XML tags in that whole file. Is this for reals? Six, right?

<location>
<refUrl>http://wikitravel.org/en/Singapore</refUrl>
  <info>
    &lt;b&gt;Singapore&lt;/b&gt; is an island-state in Southeast
    Asia, connected by bridges to Malaysia. Founded as a British trading colony 
    in 1819, since independence it has become one of the &lt;b&gt;world's 
    most prosperous countries&lt;/b&gt; and sports the world's busiest 
    port.   Combining the skyscrapers and subways of a &lt;b&gt;modern, 
    affluent city&lt;/b&gt; with a medley of Chinese, Indian and Malay 
    influences and a &lt;b&gt;tropical climate&lt;/b&gt;, with 
    tasty food, good shopping and a vibrant nightlife scene, this Garden City 
    makes a great stopover or springboard into the region.
  </info>
</location>

This benchmark was linked all over the place last week. Does anyone look at this stuff?


Okay, great. Let’s battle!

Time for a new benchmark based on timeline.xml from John Nunemaker’s libxml vs. hpricot stuff.

gist: 21936:

    user system total real
hpricot:doc 2.630000 0.030000 2.660000 ( 2.655527)
hpricot2:doc 0.340000 0.000000 0.340000 ( 0.349340)
nokogiri:doc 0.600000 0.020000 0.620000 ( 0.611570)
    user system total real
hpricot:xpath 1.910000 0.000000 1.910000 ( 1.911496)
hpricot2:xpath 0.890000 0.010000 0.900000 ( 0.897664)
nokogiri:xpath 0.060000 0.000000 0.060000 ( 0.061546)
    user system total real
hpricot:css 1.880000 0.000000 1.880000 ( 1.889301)
hpricot2:css 0.680000 0.010000 0.690000 ( 0.677072)
nokogiri:cssbenchmark.rb:77: [BUG] Bus Error
ruby 1.8.6 (2007-09-24) [i686-darwin8.11.1]

Try it yourself with the hpricot-0.6.170 gem, which includes source code, so you’ll need a compiler.

This is only a rewrite of the parser, not the Ragel lexer. I’m actually surprised the XPath and CSS parser numbers are cut in half, basically, just by changing my object structures. I’ve just finished a new Ragel-based CSS selector parser which should cause those searches to drop dramatically. I am considering dropping XPath support this time.

I haven’t got the new parser totally switched in yet. Right now you call Hpricot.scan without a block. Once I’m finished testing the two side-by-side, I’ll swap in the new parser and release 0.7.


I feel some regret posting a benchmark at all, because I don’t want to detract from my main point.

Someday, Nokogiri may be seven times faster than Hpricot. Someday it may be twelve times slower. In fact, on one single day, it may be five time faster, then fourteen times slower, then eleven-point-three times faster!

But Nokogiri has no fuzzy fruited emblem. And it does not dwell in an orchard of markup. (Such a very yummy orchard, you’d never believe!) I can put those statements in my promos and newsreels and they’ll never change.

Now begin the comments …

40 comments

jontyjont

said on November 3rd 14:14

Well said _Why!!

I think perhaps that some people are unripe fruit coloured at your excellence!

Alistair Holt

said on November 3rd 14:18

Hooray for Hpricot!

Matt Aimonetti

said on November 3rd 14:30

Great stuff, I’m actually pretty happy that nokogiri and hpricot are both trying to improve perfs. I know of people saying that the Ruby community doesn’t have an awesome parser like python’s. I’m pretty happy to see that things are improving!

Thanks _why for your valuable contribution.

-Matt

Aaron Patterson

said on November 3rd 14:31

If my only contribution is that I motivate you to update Hpricot, then I have achieved my goal.

Thanks.

_why

said on November 3rd 14:55

Aaron Patterson: Wait a minute. But I am only working on Hpricot to put some pressure on Nokogiri and LibXML. So they can live up to their potential. So that you can grow as a person.

I actually think that, for xml, Nokogiri is going to school Hpricot’s sorry ass-shaped apricot crease. I can’t turn off my flexible parsing mode, it’s built-in. I can’t do checks for well-formedness, nor am I able to parse utf-16 and utf-32. And, like I said, I’ll probably drop XPath support since JQuery did the same.

Aaron Patterson

said on November 3rd 15:05

_why: Thanks for the motivation! I can feel the pressure! urnnghhhh! ack!

defunkt

said on November 3rd 15:24

Hpricot is dead, long live Hpricot!

doki_pen

said on November 3rd 15:39

Sowwy _why. To hew wit pawsing speeds. He wins on haiw alone.

Jon

said on November 3rd 15:58

I just want to thank Aaron Patterson, _why, and the LibXML-Ruby team for putting together these great, open source libraries. It’s so amazing to have choices and seeing you guys are constantly upping the ante, that I just may have to start committing code to y’all

leethal

said on November 3rd 16:30

The Return of the Epic and Forgotten Hpricot Library.

Peter Cooper

said on November 3rd 16:40

I’m proud to be considered sensationalist by the king of sensationalism himself. If it gets people talking, thinking, and doing, it’s a very rewarding thing to stir the pot as you know yourself!

_why

said on November 3rd 16:50

Get your grimey, upskirt-hungry camera lenses out of here, Peter Coooper!! And take your no-good besmirched and libel-stained microphones with you. Humpf.

Dr Nic

said on November 3rd 16:56

I wish hpricot could render its bountiful logo on my terminal after I had finished install it.

Dr Nic

said on November 3rd 17:01

I had to change line 9 to

content = open("http://railstips.org/assets/2008/8/9/timeline.xml").read

to avoid this error

`read': No such file or directory - http://railstips.org/assets/2008/8/9/timeline.xml (Errno::ENOENT)

_why

said on November 3rd 17:04

And tell your goons in the Ruby Inside shirts to unhand me. I’m very delicate.

Peter Cooper

said on November 3rd 17:39

You can tell I’ve been taking Rupert Murdoch’s correspondence course by tape! Anyway, stop giving me ideas – I’m now trying to think of how I can work the word “upskirt” into my next post.

Peter Cooper

said on November 3rd 17:41

I’ve updated the Ruby Inside post to recognize the new greatness that is Hpricot.

Alex Pooley

said on November 3rd 17:45

Noooo.. whyyyyy.. people will die and children will bleed when you drop xpath support.

How else will I so elegantly extract values of nodes and attributes with just CSS selectors?

Think of the children why. The children!

Dr Nic

said on November 3rd 17:55

A quick search of this post finds that the first reference to “upskirt” is in your own comment… what were you looking at just before you commented here? :)

Dr Nic

said on November 3rd 17:55

Actually I wish that was true… I suck at in-line browser searching apparently.

collintmiller

said on November 3rd 18:05

here here!

Glad to hear about giving xpath the boot.

Never used it. And project managers go apesh*t for xpath, seemingly ignorant that we can reuse our css skillset. And just use json on the wire…

Curses

Senator the Unicorn

said on November 3rd 19:10

Removing X-Path will help a great many PTSD sufferers in dire need of a framework which doesn’t bring back terrible memories of a past where they were forced to create XSLT documents all day long.

The community thanks you ever so muchly for your consideration and kind heartedness towards these oft-forgotten tech veterans.

Chu Yeow

said on November 3rd 19:38

Just a bit of trivia: that Nokogiri vs. Hpricot benchmark was mine and it was a specific revision of Nokogiri vs. a specific revision of Hpricot that we were using on our own XML API (the http://static.bezurk.com/fragments/wikitravel/sin.xml file that had like 6 XML tags), it was never intended to be a comprehensive benchmark :) It sure convinced us to use Nokogiri at this time though since it benchmarked actual code in my own application (and thus was an entirely practical benchmark for my own purposes of deciding whether to switch).

Anyway, I’m liking the competition if it means nicer and faster code all around!

rick

said on November 3rd 19:40

When can we get one of these in the stdlib to replace rexml?

Alastair Brunton

said on November 4th 03:37

Top class why, you are my hero!

hosiawak

said on November 4th 07:33

rick: why would you want to replace rexml? Can you parse a large xml file as a stream using Hpricot or Nokogiri ?

PeeDee

said on November 4th 13:21

Benchmarks. My benchmark:

  • Time spent coding scripts to chop up html or xml to make it useful: 99%.
  • Time spent running the code on target pages: 1% (and I can do something else in the meantime).

It’s the elegance and ease of use that matters to me, _why. Thanks.

Peter Szinek

said on November 4th 14:43

Dropping XPath support would suck – for example scRUBYt! is relying solely on Hpricot XPath support, and I somewhat doubt it’s the only gem depending on Hpricot using XPaths (?).

topfunky

said on November 4th 18:20

Such drama!

I especially liked the part where nokogiri threw a Bus Error. I did not expect that.

anildigital

said on November 5th 09:58

seriously nice drama! I hope competition continues!

SeanJA

said on November 5th 12:10

Throwing buses is never the answer

stepheneb

said on November 6th 16:55

I like the xpath support in hpricot — I use it for transmogrifying xml documents … I also like that I can use it in jruby.

HULK

said on November 7th 14:47

HULK tell SeanJA: Throwing buses ALWAYS good answer!

RubyPanther

said on November 7th 19:01

I wouldn’t recommend actually using the xpath support in a real project, but I think it’s really useful to have it there for when you want to get down and dirty, or when you’re stuck sharecropping, or even debugging.

Not to mention that I for one am usually pleased by backwards compatibility.

Lawrence Pit

said on November 9th 05:05

Nokogiri released Oct 31st. Nov 7th it slipped into webrat which previously used hpricot. Nov 8th Merb releases it’s v1.0, which requires webrat, and hence nokogiri.

I didn’t run into a bus btw :

      user     system      total        real
hpricot:doc  1.640000   0.030000   1.670000 (  1.694201)
hpricot2:doc  0.210000   0.000000   0.210000 (  0.219147)
nokogiri:doc  0.300000   0.010000   0.310000 (  0.314110)
      user     system      total        real
hpricot:xpath  1.100000   0.010000   1.110000 (  1.113361)
hpricot2:xpath  0.570000   0.010000   0.580000 (  0.579962)
nokogiri:xpath  0.030000   0.000000   0.030000 (  0.036501)
      user     system      total        real
hpricot:css  1.040000   0.010000   1.050000 (  1.058107)
hpricot2:css  0.440000   0.000000   0.440000 (  0.450744)
nokogiri:css  0.260000   0.000000   0.260000 (  0.260388)

Gem used is nokogiri-1.0.3.

Anyways, I’ll always love those fruity markup!

ryan_a

said on November 9th 21:10

Nnokogirl threw a bus? Them’s some big muscle arms!!

huh?

said on November 18th 19:44

Yes I’m new and only understand half the things said here. But let me be so bold as to ask…what is wrong with XPath? We use it all the time with Hpricot and it works well.

lzell

said on November 22nd 15:30

Another vote for keeping XPath support. What is motivating you to drop it?

Anko Painting

said on January 1st 20:37

Hey Why,
what’s going on with the hpricot bug tracker? Any plans to support 1.9.1? :)

why what the heck happened to hpricot?

said on January 2nd 16:49

I googled hpricot…
Clicked on the first result…
https://code.whytheluckystiff.net/hpricot/
Failed to Connect…

Comments are closed for this entry.