Screen Scrape No More…Seriously!

February 10, 2008 Parker Thompson

This week I had the pleasure of attending Dapper Camp put on by the folks at Dapper for their user community. Mitch Kapor kicked it off with a talk on disruptive technology, openness, and innovation. We then got to hear both from the Dapper team about new and upcoming features, and folks like Aaron Fulkerson of MindTouch about using Dapper to repurpose data. All around it was pretty interesting.

What I love about Dapper is that it helps solve one of the big issues I see our clients have: data. We can build just about anything, but if an application needs some specific data (and many do), products must be launched with sub-par (but available) data, or worse launches can end up being delayed. In many cases, we can end up spending a large amount of time (aka money) getting/munging data rather than developing features. Note: I also think the ease if pushing data out of apps via instant gadgets makes Dapper very interesting but that’s a whole separate post.

The “Dapp Factory,” a Rhino-based server application and a web front-end that deals with just about any site by proxying your requests and modeling the DOM on the proxy, then recording your actions for later replay. But, their secret sauce is a super-cool algorithm that figures out the structure of pages in such a way that your API can withstand changes to the target site, making your feed resilient to all but massive site overhauls. You then simply consume an XML or JSON feed, or use a simple API to dynamically construct paramaterized feeds.

There are other companies trying to make data less painful. Metaweb, for example, provides an incredibly fast graph engine and relational schemas (think RDF) that makes real-time use interesting. But, if the data you need isn’t in Freebase (a likelihood until they get larger), or your data is continually being updated, you will still be stuck scraping and relating the data, and that’s generally where most of the work is.

Take as a small example Dav’s awesome Vacation Planner. The concept is simple, the feature set is small, but getting the data is a pain (see article). Some sites don’t have APIs, and those that do provide unstansardized, sometimes buggy, and are often often missing the data you need.

I could imagine writing a dapp parser akin to ActiveResource pretty easily (I hear a ruby SDK came out of DapperCamp, but I can’t find it). With a little more work, it would probably be easy to add cache_fu support, and ruby modules that could be mixed into models (for asynchronous data gathering) and controllers (for serve now vs. polling) to easily support Dav’s polling mechanism.

This would leave Dav with pretty simple model (data) code, and the luxury of focussing on whether to add wikipedia integration for population figures or the Big Mac Index, rather than tweaking his Mechanize xpaths all weekend. I vote for the Big Mac Index.

So, the next time someone suggests you screen scrape to get that data you need, tell them to give Dapper a shot. And if anyone wants to write ‘ActiveDapp’ let me know. It could be really fun.

Note: In the spirit of full disclosure, Jon Aizen (Dapper CTO) is a friend of mine and they gave me a free t-shirt…and a sandwich (thanks).

About the Author


Java Stink
Java Stink

After about two years in which the only Java I wrote had a "Script" after it, I've recently started working...

Happy Path Testing With Selenium RC Fu
Happy Path Testing With Selenium RC Fu

Selenium RC Fu is a fantastic system for testing Ruby On Rails applications. It is the blending of xUnit te...

SpringOne 2021

Register Now