PlaceIQ has divided the entire USA into regions of 100*100 meters-square. We have precise data in CouchDB docs which tells us whether a particular region of 100*100 meter-square has a particular attribute. The attribute can be a gas-station, a multiplex, a fire-station or any such similar entity. So we know which region has what all provisions.
Data-Scientists do various sort of analysis on this data and for this, they need fast ways to query the BigData. In particular, we know that Data-Scientists mostly query for 5 kinds of attributes. And they want us to somehow make the query response as fast as possible just for these 5 attributes. The data-set is un-imaginably HUGE, as I mentioned in my previous post –
Although I cant disclose much details about the project, to give you an estimate of how long the query takes, consider this : For one of the smaller benchmarkings, we recently used 1 master node, 10 Mapper nodes and 10 Reducer nodes (Small ec2 instances) and our program ran for 55 hours, which was a big optimization from the previous version which took around a week to complete.
Since queries normally take days to execute, even the smallest of optimizations largely affect the outputs. Using the most optimal DataStructure (Decisions like using HashMaps instead of ArrayLists) and writing the most optimized code is absolutely essential.
We had already been optimizing the MapReduce programs deployed on Hadoop for querying the huge CouchDB Database but this solution needed to implement an exceptionally fast way of querying whether one of those specified 5 attributes exists in a region or not.
In other words, we needed to find a solution where we could provide answers without having to lookup CouchDB at all. And we could compromise slightly on accuracy. After lot of Brainstorming, we zeroed on using the Probabilistic Data Structure, Bloom Filters. (In case you wish to learn about Bloom Filters, I have written a brief introduction below.)
We created separate Bloom Filters for each of the attributes. Then we parsed the entire Couch dataset once and made entries into the Bloom filters. Initially we started with one hash-table in each Bloom-Filters, so there were a lot of collisions. As we increased the number of hash tables, the collisions reduced and thus the chances of returning a False Positive also reduced. We eventually kept 3 Hash-Tables of equal length but different Hashing Functions and benchmarking showed that it reduced the chances of error to less then 0.5%.
What really mattered was that the Query response time was Blazing fast and we had achieved our Goal !!!
The rest of this post is about what I have learnt/know Bloom Filters. Read on in case you too are interested to learn :)
Bit of Introduction to Bloom Filters #NotesToSelf
Bloom filters are probabilistic data-structures built with the aim of handling specific usecases of huge DataSets while keeping the memory consumption minimum. Bloom filters are created by parsing the dataset once and once the entire dataset is parsed, the Bloom filters can be used to quickly Query if a particular Data item is there in the dataset or not.
An important thing to note is that Bloom filters can return False Positives but never False negatives. That means if a Query made on a Bloom Filter returns that an item doesnt exist in a dataset we can be sure about this result, but if the Query made on the Bloom Filter returns that data exists in the dataset, there is a small chance that the element might not exist in the dataset.
BloomFilters consist of a Number of hashtables of fixed sizes. Initially all the bits are set to zero in all the tables.
Each word of the dataset is hashed individually into the Bloom Filters using a single Hash Function or separate hash Functions, and the mod of the value returned by the hash is %(mod) with the size of the hashtable and the result key is set to 1.
The data-structure is probabilistic because there is a low but finite probability that two words will have collisions in each of the hash tables and Query might return true for one of them even though it might not be present in the original dataset. However, the chances of such errors are very low and BloomFilters can be customized, configured and optimized to better suit the given DataSet.
Many kinds of variations exist. One which use a single hash function for all tables has tables of varying lengths, so the % for each table comes out to be different. Another implementation uses multiple hash functions but just a single hash table, where all the generated hashes are set.
Further readings –
Additional, you can read on about Bloom Filters at http://www.javamex.com/tutorials/collections/bloom_filter.shtml –
- we allocate m bits to represent the set data;
- we write a hash function that, instead of a single hash code, produces k hash codes for a given object;
- to add an object to the set, we derive bit indexes from all k hash codes and set those bits;
- to determine if an object is in the set, we again calculate the corresponding hash codes and bit indexes, and say that it is present if and only if all corresponding bits are set.