We find ourselves with a project with a very large dataset, more than 2 million items. This dataset changes frequently. The changes need to be transported to their respective servers ready to be served out to clients.
We decided to use a queuing architecture to distribute data. Objects are serialized and pushed to a queue. The large size of the dataset requires us to optimize as much as possible. There are only so many hours in a day and there is a lot of data to transport.
A question was raised in standup as to what was the fastest serialization method: YAML::dump or Marshal.dump. It seemed appropriate to write a quick script and work out which would be appropriate for our particular situation.
The objects we are serializing are simple hashes. I thought I’d write something that was representative of our situation in order to present a nice clear decision.
Here’s some code:
require 'yaml' obj = {:a => "hello", :b => "goodbye", :c => "new string", :d => {:da => 1, :db => 2}, :e => 1} start = Time.now (0..10000).each do ser_obj = YAML::dump(obj) new_obj = YAML::load(ser_obj) end puts "YAML::dump time" puts Time.now - start start = Time.now (0..10000).each do ser_obj = Marshal.dump(obj) new_obj = Marshal.load(ser_obj) end puts "Marshal.dump time" p Time.now - start
I think we all knew how the results would look. It was nice to see that for our particular case there was a clear winner.
YAML::dump time 5.397909 Marshal.dump time 0.280292
Seems fairly cut and dried to me.
I personally prefer YAML for test result comparison. Maybe we’ll put something in our spec_helper to use YAML for testing and Marshal for production.
About the Author