# Python bitarray - 60x speedup for comparisons

Published on Dec 24, 2020 by Impaktor.

This is intended to show how to use bitarray/bitstrings in python, on a beginner friendly way, to speed up code, by using more efficient data representation

## Introduction — The problem to solve

Recently I found myself needing to speed up a bottle neck in my code, that was essentially querying a dictionary if it contained a specified key (datetime object), or specifically a function:

def is_workday(date): """True if weekday, false if weekend or holiday/red day. Input is Timestamp or DateTime object""" import holidays # Saturday and Sunday are index 5, 6 return date.weekday() < 5 and date not in holidays.USA(years=date.year)

where I am using the holidays package, which defines a dictionary of holidays for each year and country. I have 57 million dates that need to be checked, and this would take ~30 hours where the above code was the main bottle neck.

So is there a faster way to see if `date`

is in the dictionary? There are
several (I’ll mention some in the end), but we’ll look at bit representation
here.

## The solution — bitarray

### What is a bit-array and a bit-mask?

The reader is familiar that computers represents numbers as “ones and zeros”. E.g. the first numbers represented in binary as (here using 3 bits):

Decimal | Binary |
---|---|

0 | 000 |

1 | 001 |

2 | 010 |

3 | 011 |

4 | 100 |

5 | 101 |

6 | 110 |

7 | 111 |

But what if we instead interpret, for instance, “2” to literally mean “string of all zeros, where second-to-last bit is flipped to 1” to represent something other than the number “2”? This is quite common practice among C/C++ programming, but not something I’ve seen done much in the python community, thus this write up.

So instead of asking `date in dictionary_of_holidays`

, we represent a date
as a bitarray, and then use a bit-mask and bitwise AND operator.

E.g. to check if the second day in a year is on a weekend or not, could do something like the following:

01000000000000... AND 00000110000011... = 00000000000000...

where the top string has a `1`

at the second bit (where we now define that
we count positions starting from left), representing the second day, and we
then operate on this bitarray with a bitwise AND together with the
bit-mask, where every (two) 7th position is 1, for Saturday and Sunday
(assuming that year started on a Monday), and the resulting output has a
’1’ if true (i.e. “the second day is on a weekend”), and all zeros if false
(“the second day of the year is on a working day”).

Naturally, we want to also include holidays in the bit-mask string, and encode these in the right relative starting positions, which will be different each year.

### Implementation

Since every fourth year in our Gregorian calendar has an extra day, the number of bits in our bitarrays will be:

```
N_bits = 366
```

i.e. for the other three years, the last bit will be unused.

We will use the bitarrays (v. 1.6.1) package, and initialize the bitarray
representation of a date as an `N_bits`

long string with all zeros, and
then flip the correct bit to a `1`

. We define the left most bit to be
position 0, and represent the first day in the year.

There are two different ways to initialize a bit array with all zeros (as I understood it), one that takes 2µs (micro seconds):

a = N_bits * bitarray('0')

another one that is almost 20 times faster, with 115ns.

a = bitarray(N_bits).setall(False)

However, this last method did not work reliably for me, either due to bug, or me doing something silly/being tired, so I opted for the first method. Besides, initializing the bitarray is not the bottle neck, nor the interesting part.

I did not find any support for shifting bits `>>`

and `<<`

operators, which
might be the traditional way to shift a bit to the correct position, but I
can access the `n`

-th bit in the bitarray element using the `[n]`

accessor,
thus a function that converts a Timestamp or DateTime object to bitarray
representation is:

from bitarray import bitarray def date_to_bit(date): """ Convert DateTime or Timestamp to bitarray representation Parameters ---------- date : datetime Returns ------- bitarray object N_bits bits long (366), with one bit flipped """ a = N_bits * bitarray('0') # flip bit of year (dayofyear [1,356]): # dayofyear = date.dayofyear # only for Timestamp (fast) dayofyear = int(date.strftime('%j')) # for Timestamp & datetime a[dayofyear-1] = True return a

if `date`

is a Timestamp object we could use `date.dayofyear`

to find the
position in the array to flip to one, but I use `strftime('%j')`

instead,
as it works on both DateTime and Timestamp objects. These give January 1st
as day “one”, so need to offset by `-1`

, since python is 0-indexed (unlike
e.g. Lua or Fortran).

### Constructing the Bitmask

In the holiday package, every date that is a holiday is stored in a dict-like object:

In [1]: import holidays In [2]: holidays.USA(years=2019) Out[2]: {datetime.date(2019, 1, 1): "New Year's Day", datetime.date(2019, 1, 21): 'Martin Luther King Jr. Day', datetime.date(2019, 2, 18): "Washington's Birthday", datetime.date(2019, 5, 27): 'Memorial Day', datetime.date(2019, 7, 4): 'Independence Day', datetime.date(2019, 9, 2): 'Labor Day', datetime.date(2019, 10, 14): 'Columbus Day', datetime.date(2019, 11, 11): 'Veterans Day', datetime.date(2019, 11, 28): 'Thanksgiving', datetime.date(2019, 12, 25): 'Christmas Day'}

It varies from country to country if Sunday and Saturday are included, so we can flip every seven bit, assuming we know the starting position for the first Sunday and/or Saturday for the year we want to encode:

bit_mask = 366 * bitarray('0') # Initialize empty bitstring bit_mask[first_saturday::7] = True # Saturdays bit_mask[first_saturday+1::7] = True # Sundays

Results in

bitarray('011000001100000.....')

We can create many masks, i.e. one for Sundays, one for Saturdays, and one for the other holidays, and combine them using a bitwise OR operation. Below this is demonstrated for holidays found in the dictionary:

import holidays for key, val in holidays.USA(years=year).items(): bit_date = N_bits * bitarray('0') # Initialize empty bitmask of lenght N_bits dayinyear = int(key.strftime('%j')) # Find index of day in a year, in interval [1,366] bit_date[dayinyear - 1] = True # Flip the correct bit to '1' bit_mask = bit_mask | bit_date # OR the mask for that day, to the other

### Benchmark

Although I’m not showing the exact code I used to build my bitmask, the new implementation of the first code, shown in the Introduction, has more lines of code, but it will be worth it:

def is_workday(date): """True if weekday, false if red or weekend Parameters ---------- date : datetime A date time object / or timestamp Returns ------- bool True if it's (approximately) a normal working (bank) day """ bit_date = date_to_bit(date) # convert date to bitarray object global bit_masks # dictionary of bit-masks for every year year = date.year # get year, e.g. "2019" if year not in bit_masks: # if a new year, construct new bitmask: bit_masks[year] = build_bitmask(year) # Bitwise AND on date and mask, is there a '1' in the resulting bitstring? return not bool((bit_date & bit_masks[year]).count())

The bitmask for each year is only constructed once, and can then be stored
in a `bit_masks`

dict, keyed on year. Then our 57 million dates can just be
represented as 366 long bitarrays.

To benchmark, we select 10k rows from a our pandas dataframe, in iPython:

In [1251]: %time tmp.time.apply(lambda x: is_workday(x)) CPU times: user 3.46 s, sys: 8 µs, total: 3.46 s Wall time: 3.47 s

3.47 seconds for 10k lines, if `apply`

be the method of our choosing, which
could be debated.

After re-defining our `is_workday`

to use our new bitarrays implementation,
we get:

In [1253]: %time tmp.time.apply(lambda x: is_workday(x)) CPU times: user 62.4 ms, sys: 0 ns, total: 62.4 ms Wall time: 61.9 ms

61.9 ms, i.e. 56 times faster! As a bonus, we got to learn how to use efficient bitarrays, which can come in handy for one-hot encoding of sets or creating digital DNA for genetic evolution, or some other neat thing.

## Epilogue

In hindsight, for my problem, I could have used caching instead, as the number of dates to look-up vastly outnumber the number of unique dates to compute. I actually implemented this as well, so the bit array logic above only runs to populate a lookup dictionary.

Also, in this post I’ve simplified the actual the problem I faced, for brevity.