Ruby – include? versus intersection
After my last article about intersection and arrays, I got to thinking how this could apply elsewhere.
I decided to see if I could write a faster include?() method for Array using the intersection operator. So here is what I did:
1) Tested via irb (of course)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[pullmonkey]$ irb # build the array for conceptual testing irb(main):003:0> a = [1,2,3] => [1, 2, 3] # see how include?() does it irb(main):004:0> a.include?(2) => true irb(main):005:0> a.include?(4) => false # see what could be done with intersect irb(main):006:0> a & [2] => [2] irb(main):007:0> (a & [2]).empty? => false # this should return true, so negate it irb(main):008:0> !(a & [2]).empty? => true irb(main):009:0> !(a & [4]).empty? => false |
2) Extended the Array class with the faster_include?() method:
1 2 3 4 5 6 |
class Array def faster_include?(n) !(self & [n]).empty? end end |
3) Wrote a performance test comparing elapsed time for true and false results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# size of the array x = 4000000 # build the array a = (1..x).to_a p "Finding #{x/2} ..." t1 = Time.now p "include?() - returns true: #{a.include?(x/2)}" t2 = Time.now p "faster_include?() - returns true: #{a.faster_include?(x/2)}" t3 = Time.now p "include?() - returns false: #{a.include?(-50)}" t4 = Time.now p "faster_include?() - returns false: #{a.faster_include?(-50)}" t5 = Time.now p "Elapsed time for include?() returning true: #{t2 - t1}." p "Elapsed time for faster_include?() returning true: #{t3 - t2}." p "Elapsed time for include?() returning false: #{t4 - t3}." p "Elapsed time for faster_include?() returning false: #{t5 - t4}." |
4) Just to see if there was a difference I got results and here they are:
Attempt 1:
1 2 3 4 5 6 7 8 9 10 |
"Finding 2000000 ..." "include?() - returns true: true" "faster_include?() - returns true: true" "include?() - returns false: false" "faster_include?() - returns false: false" "Elapsed time for include?() returning true: 0.308243." "Elapsed time for faster_include?() returning true: 0.271465." "Elapsed time for include?() returning false: 0.629375." "Elapsed time for faster_include?() returning false: 0.242673." |
Attempt 2:
1 2 3 4 5 6 7 8 9 10 |
"Finding 2000000 ..." "include?() - returns true: true" "faster_include?() - returns true: true" "include?() - returns false: false" "faster_include?() - returns false: false" "Elapsed time for include?() returning true: 0.323652." "Elapsed time for faster_include?() returning true: 0.272212." "Elapsed time for include?() returning false: 0.594751." "Elapsed time for faster_include?() returning false: 0.277882." |
Note that faster_include?() is consistent in elapsed time for a result of true and a result of false. However, include?() takes much longer in both attempts to return false than it does to return true, well actually about twice as long. include?() goes through each element, and I am using the midway point of the array in all tests, so it makes sense that include?() would return true in half the time it would take to return false. Very interesting, so now that I see performance improvement with faster_include?(), I need to do a wider spread of tests and for more attempts:
5) Better testing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
results = {} n = 5 # size of the array [1,50,500,5000,5000000].each do |x| a = (1..x).to_a n.times do |t| # build the array t1 = Time.now a.include?(x/2) t2 = Time.now a.faster_include?(x/2) t3 = Time.now a.include?(-50) t4 = Time.now a.faster_include?(-50) t5 = Time.now results[x] ||= {} results[x]["true"] ||= [] results[x]["false"] ||= [] results[x]["true"] << (t2 - t1) - (t3 - t2) results[x]["false"] << (t4 - t3) - (t5 - t4) end detailed_results = results results[x] = {"true" => detailed_results[x]["true"].sum / n, "false" => detailed_results[x]["false"].sum / n} end pp results |
My new results:
The results are based on 5 attempts for each array for both true and false for a total of 50 attempts.
1 2 3 4 5 6 |
{5000000=>{"true"=>0.0921126, "false"=>0.4374644}, 5000=>{"true"=>5.84e-05, "false"=>0.0004404}, 500=>{"true"=>1.4e-06, "false"=>4.16e-05}, 50=>{"true"=>-3.4e-06, "false"=>1.0e-06}, 1=>{"true"=>-4.8e-06, "false"=>-3.8e-06}} |
Note that arrays with 50 items or less seem to perform better using include?(), but barely. On the other hand with an array of 5M items, you shave off half a second using faster_include?().
So there you have it, faster_include?() can keep its name 🙂
Sorry, comments for this entry are closed at this time.