We have seen the heapq module in the previous paragraph, which makes it really simple to always get the smallest number from a list, and therefore makes it easy to sort a list of objects. While the heapq module appends items to form a tree-like structure, the bisect module inserts items in such a way that they stay sorted. A big difference is that adding/removing items with the heapq module is very light whereas finding items is really light with the bisect module. If your primary purpose is searching, then bisect should be your choice.
As is the case with heapq, bisect does not really create a special data structure. It just works on a standard list and expects that list to always be sorted. It is important to understand the performance implications of this; simply adding items to the list using the bisect algorithm can be very slow because an insert on a list takes O(n). Effectively, creating a sorted list using bisect takes O(n*n), which is quite slow, especially because creating the same sorted list using heapq or sorted takes O(n * log(n)) instead.
The log(n) refers to the base 2 logarithm function. To calculate this value, the math.log2() function can be used. This results in an increase of 1 every time the number doubles in size. For n=2, the value of log(n) is 1, and consequently for n=4 and n=8, the log values are 2 and 3, respectively.
This means that a 32-bit number, which is 2**32 = 4294967296, has a log of 32.
[ 77 ]
If you have a sorted structure and you only need to add a single item, then the bisect algorithm can be used for insertion. Otherwise, it's generally faster to simply append the items and call a .sort() afterwards.
To illustrate, we have these lines:
>>> import bisect
Using the regular sort: >>> sorted_list = []
>>> sorted_list.append(5) # O(1) >>> sorted_list.append(3) # O(1) >>> sorted_list.append(1) # O(1) >>> sorted_list.append(2) # O(1)
>>> sorted_list.sort() # O(n * log(n)) = O(4 * log(4)) = O(8) >>> sorted_list
[1, 2, 3, 5]
Using bisect:
>>> sorted_list = []
>>> bisect.insort(sorted_list, 5) # O(n) = O(1) >>> bisect.insort(sorted_list, 3) # O(n) = O(2) >>> bisect.insort(sorted_list, 1) # O(n) = O(3) >>> bisect.insort(sorted_list, 2) # O(n) = O(4) >>> sorted_list
[1, 2, 3, 5]
For a small number of items, the difference is negligible, but it quickly grows to a point where the difference will be large. For n=4, the difference is just between 4 * 1 + 8 = 12 and 1 + 2 + 3 + 4 = 10 making the bisect solution faster. But if we were to insert 1,000 items, it would be 1000 + 1000 * log(1000) = 10966 versus 1 + 2 + … 1000 = 1000 * (1000 + 1) / 2 = 500500. So, be very careful while inserting many items.
Searching within the list is very fast though; because it is sorted, we can use a very simple binary search algorithm. For example, what if we want to check whether a few numbers exist within the list?
>>> import bisect
[ 78 ] >>> def contains(sorted_list, value):
... i = bisect.bisect_left(sorted_list, value)
... return i < len(sorted_list) and sorted_list[i] == value
>>> contains(sorted_list, 2) True >>> contains(sorted_list, 4) False >>> contains(sorted_list, 6) False
As you can see, the bisect_left function finds the position at which the number is supposed to be. This is actually what the insort function does as well; it inserts the number at the correct position by searching for the location of the number.
So how is this different from a regular value in sorted_list? The biggest difference is that bisect does a binary search internally, which means that it starts in the middle and jumps left or right depending on whether the value is bigger or smaller than the value. To illustrate, we will search for 4 in a list of numbers from 0 to 14:
sorted_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] Step 1: 4 > 7 ^
Step 2: 4 > 3 ^ Step 3: 4 > 5 ^ Step 4: 4 > 5 ^
As you can see, after only four steps (actually three; the fourth is just for illustration), we have found the number we searched for. Depending on the number (7, for example), it may go faster, but it will never take more than O(log(n)) steps to find a number.
With a regular list, a search would simply walk through all items until it finds the desired item. If you're lucky, it could be the first number you encounter, but if you're unlucky, it could be the last item. In the case of 1,000 items, that would be the difference between 1,000 steps and log(1000) = 10 steps.
[ 79 ]
Summary
Python has quite a few very useful collections built in. Since more and more collections are added regularly, the best thing to do is simply keep track of the collections manual. And do you ever wonder how or why any of the structures works? Just look at the source here:
https://hg.python.org/cpython/file/default/Lib/collections/__init__. py
After finishing this chapter, you should be aware of both the core collections and the most important collections from the collections module, but more importantly the performance characteristics of these collections in several scenarios. Selecting the correct data structure within your applications is by far the most important performance factor that your code will ever experience, making this essential knowledge for any programmer.
Next, we will continue with functional programming which covers lambda functions, list comprehensions, dict comprehensions, set comprehensions and an array of related topics. This includes some background information on the mathematics involved which could be interesting but can safely be skipped.
[ 81 ]