PullMonkey Blog

02 Apr

Ruby - include? versus intersection


After my last article about intersection and arrays, I got to thinking how this could apply elsewhere.

I decided to see if I could write a faster include?() method for Array using the intersection operator. So here is what I did:

1) Tested via irb (of course)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

[pullmonkey]$ irb
# build the array for conceptual testing
irb(main):003:0> a = [1,2,3]
=> [1, 2, 3]
# see how include?() does it
irb(main):004:0> a.include?(2)
=> true
irb(main):005:0> a.include?(4)
=> false
# see what could be done with intersect
irb(main):006:0> a & [2]
=> [2]
irb(main):007:0> (a & [2]).empty?
=> false
# this should return true, so negate it
irb(main):008:0> !(a & [2]).empty?
=> true
irb(main):009:0> !(a & [4]).empty?
=> false

2) Extended the Array class with the faster_include?() method:

1
2
3
4
5
6

class Array
  def faster_include?(n)
    !(self & [n]).empty?
  end
end

3) Wrote a performance test comparing elapsed time for true and false results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# size of the array
x = 4000000
# build the array
a = (1..x).to_a
p "Finding #{x/2} ..."
t1 = Time.now
p "include?() - returns true: #{a.include?(x/2)}"
t2 = Time.now
p "faster_include?() - returns true: #{a.faster_include?(x/2)}"
t3 = Time.now
p "include?() - returns false: #{a.include?(-50)}"
t4 = Time.now
p "faster_include?() - returns false: #{a.faster_include?(-50)}"
t5 = Time.now
p "Elapsed time for include?() returning true: #{t2 - t1}."
p "Elapsed time for faster_include?() returning true: #{t3 - t2}."
p "Elapsed time for include?() returning false: #{t4 - t3}."
p "Elapsed time for faster_include?() returning false: #{t5 - t4}."

4) Just to see if there was a difference I got results and here they are:

Attempt 1:

1
2
3
4
5
6
7
8
9
10

"Finding 2000000 ..."
"include?() - returns true: true"
"faster_include?() - returns true: true"
"include?() - returns false: false"
"faster_include?() - returns false: false"
"Elapsed time for include?() returning true: 0.308243."
"Elapsed time for faster_include?() returning true: 0.271465."
"Elapsed time for include?() returning false: 0.629375."
"Elapsed time for faster_include?() returning false: 0.242673."

Attempt 2:

1
2
3
4
5
6
7
8
9
10

"Finding 2000000 ..."
"include?() - returns true: true"
"faster_include?() - returns true: true"
"include?() - returns false: false"
"faster_include?() - returns false: false"
"Elapsed time for include?() returning true: 0.323652."
"Elapsed time for faster_include?() returning true: 0.272212."
"Elapsed time for include?() returning false: 0.594751."
"Elapsed time for faster_include?() returning false: 0.277882."

Note that faster_include?() is consistent in elapsed time for a result of true and a result of false. However, include?() takes much longer in both attempts to return false than it does to return true, well actually about twice as long. include?() goes through each element, and I am using the midway point of the array in all tests, so it makes sense that include?() would return true in half the time it would take to return false. Very interesting, so now that I see performance improvement with faster_include?(), I need to do a wider spread of tests and for more attempts:

5) Better testing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

results = {}
n = 5
# size of the array
[1,50,500,5000,5000000].each do |x|
  a = (1..x).to_a
  n.times do |t|
    # build the array
    t1 = Time.now
    a.include?(x/2)
    t2 = Time.now
    a.faster_include?(x/2)
    t3 = Time.now
    a.include?(-50)
    t4 = Time.now
    a.faster_include?(-50)
    t5 = Time.now
    results[x]          ||= {}
    results[x]["true"]  ||= []
    results[x]["false"] ||= []
    results[x]["true"]  << (t2 - t1) - (t3 - t2)
    results[x]["false"] << (t4 - t3) - (t5 - t4)
  end
  detailed_results = results
  results[x] = {"true" => detailed_results[x]["true"].sum / n, "false" => detailed_results[x]["false"].sum / n}
end
pp results

My new results:

The results are based on 5 attempts for each array for both true and false for a total of 50 attempts.

1
2
3
4
5
6

{5000000=>{"true"=>0.0921126, "false"=>0.4374644},
 5000=>{"true"=>5.84e-05, "false"=>0.0004404},
 500=>{"true"=>1.4e-06, "false"=>4.16e-05},
 50=>{"true"=>-3.4e-06, "false"=>1.0e-06},
 1=>{"true"=>-4.8e-06, "false"=>-3.8e-06}}

Note that arrays with 50 items or less seem to perform better using include?(), but barely. On the other hand with an array of 5M items, you shave off half a second using faster_include?().



So there you have it, faster_include?() can keep its name :)



Filed under: Home, development, ruby Tags:

Sorry, comments for this entry are closed at this time.