The Story of Streaming HTTP Through MouseHole (With Subsequent Adventures in Continuations and Duck Typing) #
Hacking MouseHole has been considerably fun and challenging, especially since much of the work involves subclassing and reshaping existing classes. It’s like glass blowing. Overriding methods here and there, but holding on to the underlying class as much as possible.
(If you’re new, MouseHole is an idea conceived by the readers of this blog: a scriptable proxy for rewriting the web and sharing little web applications. The wiki goes into detail.)
I’m sure one of the things that deters many from using MouseHole is the word proxy. No one wants to hinder their browsing with longer wait times while a proxy gets its act together. Cleaning the HTML, parsing it, rewriting it, rebuilding it. Sounds like a lot.
So, I’ve been working on adding streaming support to the proxy. So downloads will get a progress bar in the browser. And so that images, css, unhandled pages, etc. will travel through the proxy quicker. It’s working well, much improvement, but just a couple notes on what it took.
The Net::HTTP and WEBrick Cable
At heart, the WEBrick::HTTPProxy
uses a bit of code to pass its request onto the web and then hand it back to the browser. Trimming out a bit of cruft, it looks like this:
begin http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port) http.start{ case req.request_method when "GET" then response = http.get(path, header) when "POST" then response = http.post(path, req.body || "", header) when "HEAD" then response = http.head(path, header) else raise HTTPStatus::MethodNotAllowed, "unsupported method `#{req.request_method}'." end } rescue => err logger.debug("#{err.class}: #{err.message}") raise HTTPStatus::ServiceUnavailable, err.message end # Convert Net::HTTP::HTTPResponse to WEBrick::HTTPProxy res.status = response.code.to_i choose_header(response, res) set_cookie(response, res) set_via(res) res.body = response.body # Process contents if handler = @config[:ProxyContentHandler] handler.call(req, res) end
So three things:
- Use Net::HTTP to retrieve the resource from the web.
- Convert the response to WEBrick’s own response object type.
- Pass the response to a handler, if one is setup.
In the case of MouseHole: yes, we’ve got a handler. Our handler checks to see if a MouseHole script wants to mess with the resource.
Now, how do we add streaming to this? The above code is using Net::HTTP
to download the whole resource before moving on. We can’t have this. We want to just get the headers and let the handler decide what to do with the response body.
One other thing about WEBrick that I discovered: if you give it a response where the @body
is an IO object, it’ll stream that object to the output. Perfect, right?
Pulling the Thread That Goes Back in Time
Digging around in the Net::HTTP
code was taking forever and I really wanted to get something working. So I thought I’d try wrapping a Generator
around the whole thing.
In my subclass of the proxy, I overrode the method containing the above code with:
response = nil begin http = Net::HTTP.new(uri.host, uri.port, proxy_host, proxy_port) g = Generator.new do |g| http.start do chunkd = proc do |gres| g.yield gres gres.read_body do |gstr| g.yield gstr end end case req.request_method when "GET" then http.request_get(path, header, &chunkd) when "POST" then http.request_post(path, req.body || "", header, &chunkd) when "HEAD" then http.request_head(path, header, &chunkd) else raise WEBrick::HTTPStatus::MethodNotAllowed, "unsupported method `#{req.request_method}'." end end end # Use the generator to mimick an IO object def g.read sz = 0; next? ? self.next : nil end def g.size; 0 end def g.close; while next?; self.next; end end def g.is_a? klass; klass == IO ? true : super(klass); end rescue => err logger.debug("#{err.class}: #{err.message}") raise WEBrick::HTTPStatus::ServiceUnavailable, err.message end response = g.next # Convert Net::HTTP::HTTPResponse to WEBrick::HTTPProxy res.status = response.code.to_i choose_header(response, res) set_cookie(response, res) set_via(res) res.body = g def res.send_body(socket) if @body.respond_to? :read send_body_io(socket) else send_body_string(socket) end end # Process contents if handler = @config[:ProxyContentHandler] handler.call(req, res) end
This is a ridiculous hack. Maybe. When the generator is created, the code inside isn’t executed. It all gets skipped. But when I run g.next
, the code whirs into motion. And the first yield
passes us a response at the moment the headers are parsed, but before the body has been read.
The hackiness has to do with getting WEBrick to actually accept the generator as an IO object. Naturally, there has to be a read
method, which just calls next
to get chunks of the stream. Adding size
and close
methods made sense as well. But I actually has to override is_a?
to let my duck-typed generator pass through. Ugghh.
Overall, it works well. I didn’t even notice much slowness. Until memory got gobbled up. Which happens rather quickly with continuations.
Punching Holes in Net:HTTP
Now that I had a decent angle on this, I decided to take a different approach: to rework Net::HTTP as an IO object. I would have loved to use OpenURI
but it doesn’t support all the HTTP request methods. Also, http-access2
seemed to have the same problems as Net::HTTP
.
The problem is that Net::HTTP
will only work as a stream when it’s given a block. If no block is given, the whole stream is read into memory and returned.
The gist of the hack is this:
require 'net/http' module Net class HTTPIO < HTTP def request(req, body = nil, &block) begin_transport req req.exec @socket, @curr_http_version, edit_path(req.path), body begin res = HTTPResponse.read_new(@socket) end while HTTPContinue === res res.instance_eval do def read len = nil; ... end def body; true end def close req, res = @req, self @http.instance_eval do end_transport req, res finish end end def size; 0 end def is_a? klass; klass IO ? true : super(klass); end end res end end
Since all the request methods route through Net::HTTP#request
, it was just a matter of replacing that method. We’ve come to expect this in object-oriented languages. But what pushes the lever further in Ruby is how I can also redefine portions of the HTTPResponse
object (using singleton methods), reshape it without needing to affect its original class.
In all, I feel like some of the same old good practices could have facilitated the hack easier:
- Please, no
is_a?
or=
tests for IO objects. Duck type: useinput.respond_to? :read
. - If you have a streaming IO class, don’t require a block in order to stream. I might need to take that stream out of scope with me. Also, what if I need to start and stop the stream?
In the end, I’m left pretty guilty myself. I’ve hacked classes for my own use but they’re pretty worthless outside of MouseHole. I gotta find a way to push this back into the original classes or something.
Kevin Ballard
Hey why, can you please add syntax colorizing to your code here? It sure would make it easier to read.
MrCode
Well it is like they say, you can never predict the strange way customers will use your products. In this case you are the customer and the product is Net::HTTP. I agree that using respond_to? is better than is_a?, and clearly here we have a good example of why.
Ruby in particular is possibly more “vulnerable to customer hacks” than other languages, because of how the classes are open. So you might as well write your code to be as flexible as possible.
BTW , this is all pretty impressive with the streaming and all. I took a quick look in the Wonderland days and it seemed tricky.
mu's
baffle! wooaang!
Comments are closed for this entry.