Friday, the end of the working day, no sings of trouble. "You've got mail" - ruby weekly. The latest much-talked of posts in blogs, jobs (skipping) ... wait, software engineer, London? The description of a vacancy and a firm ... a problem, that's interesting. "A local variable named log contains an array of hashes with timestamped events like ..." - piece of cake!
First solution was quick, but it works with 8 elements. What if there are 10000008 elements? 32 seconds of calculation is too long. Simple optimisation 'merge' to 'merge!' is done, which leads to 22 seconds. No way!
All the way home I was thinking over the problem, trying hard to solve it but everything was in vein. The answer came unexpectedly as usual - #[email protected] There I was reminded about 'group_by' method. Of course, how could I forget about it? I won't give the solution in this post, not to spoil (it will be in my application form).
All benchmarking was carried out on my home iMac (21.5-inch, Late 2012) 2.7 i5, 8 Gb.
The solution (ruby 2.1.5)
user system total real reduce symbols: 18.100000 1.420000 19.520000 ( 19.857926)
Not bad, change 'merge' to 'merge!'
user system total real reduce symbols: 5.770000 0.060000 5.830000 ( 5.821176)
For these tests I've used hash keys as symbols which, as it is well known, aren't swept by GC during the program execution, therefore memory will not be cleaned.
So, why not change hash keys to strings, for memory cleaning?
user system total real reduce strings : 20.730000 0.610000 21.340000 ( 21.349411)
user system total real reduce strings : 8.940000 0.210000 9.150000 ( 9.145551)
As I expected, calculation took more time.
New GC was introduced in ruby 1.9, then 2.1 and at last in 2.2 (which is still in beta). Let's try ruby 2.2 and see what results we'll get.
user system total real reduce symbols: 20.100000 0.170000 20.270000 ( 20.273468)
user system total real reduce strings : 6.620000 0.050000 6.670000 ( 6.668298)
user system total real reduce strings : 22.200000 0.210000 22.410000 ( 22.412580)
user system total real reduce strings : 6.610000 0.050000 6.660000 ( 6.664764)
Since new GC in ruby 2.2 sweeps symbols, 2.2 is a bit slower than ruby 2.1.5.
When we work with big data, we have to be careful with tools, a single method can make a great change.
p.s There is an excellent video about ruby code optimisation.