The other day I found this piece of code at work:
case status
when 'booked'
MyNamespace::Success
when 'cancelled', 'canceled'
MyNamespace::Cancelled
when 'pending'
MyNamespace::Pending
else
MyNamespace::Unknown
end
and I remembered that in one of her talks, Sandi Metz used a Ruby Hash to select a class for a factory. Something like this:
{
'booked' => MyNamespace::Success,
'cancelled' => MyNamespace::Cancelled,
'canceled' => MyNamespace::Cancelled,
'pending' => MyNamespace::Pending,
}.fetch(status) { MyNamespace::Unknown }
I guessed that the Hash would be a somewhat faster approach, so I decided to benchmark it.
My constraints for the problem
To test this out, I set some limitations to my scope.
I'd use just random data. No fancy statistically accurate distributions.
No classes, just regular old strings as values
I'd test just a couple of variations
Use the benchmark-ips gem
The setup
Here's the setup:
First the case version as seen on the original code
def case_status_for(status)
case status
when 'booked'
'SUCCESS'
when 'cancelled', 'canceled'
'CANCELLED'
when 'pending'
'PENDING'
else
'UNKNOWN'
end
end
Then, a version using regular expressions
def case_regex_status_for(status)
case status
when /booked/
'SUCCESS'
when /cancell?ed/
'CANCELLED'
when /pending/
'PENDING'
else
/UNKNOWN/
end
end
Finally, the Hash approach
def hash_status_for(status)
{
'booked' => 'SUCCESS',
'cancelled' => 'CANCELLED',
'canceled' => 'CANCELLED',
'pending' => 'PENDING',
}.fetch(status) { 'UNKNOWN' }
end
And a helper method for getting a random status from a list
def status
[ 'booked', 'cancelled', 'canceled', 'pending',].sample
end
The benchmarks
Then, I added the benchmarks
require 'benchmark/ips'
Benchmark.ips do |x|
x.config(time: 25, warmup: 2)
x.report("case") {
case_status_for(status)
}
x.report("case with regexes") {
case_regex_status_for(status)
}
x.report("hash") do |times|
hash_status_for(status)
end
x.compare!
end
The results
The results surpassed all my expectations. As I said, I knew that the Hash would be faster, but we're talking about more than 25000 times faster. And we're not even mentioned the 80000 times compared to using regexes!
# coding: utf-8
# >> Warming up --------------------------------------
# >> case 85.888k i/100ms
# >> case with regexes 38.234k i/100ms
# >> hash 56.338k i/100ms
# >> Calculating -------------------------------------
# >> case 1.507M (± 2.6%) i/s - 37.705M in 25.032657s
# >> case with regexes 489.697k (± 0.8%) i/s - 12.273M in 25.064364s
# >> hash 39.847B (±18.7%) i/s - 670.057B
# >>
# >> Comparison:
# >> hash: 39846575958.7 i/s
# >> case: 1507310.6 i/s - 26435.54x slower
# >> case with regexes: 489696.9 i/s - 81369.87x slower
Pros, cons and other considerations
As for all approaches in programming, this is not a Silver Bullet and has some considerations to have in mind
Readability
I don't mind the way the Hash version reads at all, in fact I quite like it. But readability is a very subjective matter and you might find it unreadable (and I understand if that's the case).
Multiple keys and repeated values
In the example I used, there's a possibility that the status will come either as 'canceled'
or 'cancelled'
and the result is the same ('CANCELLED'
). In the case statement option, both options go through the same branch, but when using a Hash this changes. In that case we need to duplicate an option. If there are too many of those, the code might become ugly and using a case statement with regular expressions could be a much better option in terms of readability.
Regular expressions
If we need to switch using regular expressions, the Hash is definitely out of the question.
Complex selection policy
For more complex cases, for example, if we need to perform some kind of calculation for selecting, we might want to use Policy
classes and/or lambdas. In that case, the case statement is the best and only solution.
Speed!!
Now… if you come across a case where the selection is as simple as the one on the example, using a Hash will speed up your code infinitely. Specially if:
- There are a lot of options… the more the better
Hashes have a constant (
O(1)
) access time, while for a case statement the access time is linear (O(n)
), which means that having more options will increase it's access time, rendering an increasingly better comparison in favor of the Hash- Frequently used code
If the code is accessed regularly, the speed boost will make itself notice
Thanks for reading.
Saluti