[Information about Hadoop Compiled from various sources]
This week I got a chance to work on PlaceIQ project at Kuliza. The idea behind this project is really interesting. What these people have done is that they divided the entire United States of America into chunks of 100*100 meters. They then store location specific data for each grid in the HUGE matrix. For example, they would store whether a particular grid has a Restaurant, a gas-station etc. and also they have somehow managed to get interesting analytics for each grid like how much population density a grid is likely to witness at a particular hour of the day. Also, what will be the age group and sex of the majority of the population. Data-scientists do a lot of number number-crunching on this HUGE database and provide analysis which they sell to willing third-parties.
For example, if I were to run a campaign on the streets of USA for my product, I would consult the data-scientists to provide me the optimal time and place according to my product, which sex/age-group it adheres to so that I can get the maximum conversion through my campaign.
As you can imagine, the PlaceIQ Database is HUGE and for this BigData project, Kuliza’s task was to optimize query response time for the Data-Scientists. We use the Hadoop Infrastructure for distributed computing and try to think of each problem as MapReducible, write the Mapper and Reducer code and then optimize the program as much as possible.
Although I cant disclose much details about the project, to give you an estimate of how long the query takes, consider this : For one of the smaller benchmarkings, we recently used 1 master node, 10 Mapper nodes and 10 Reducer nodes (Small ec2 instances) and our program ran for 55 hours, which was a big optimization from the previous version which took around a week to complete.
Since queries normally take days to execute, even the smallest of optimizations largely affect the outputs. Using the most optimal DataStructure (Decisions like using HashMaps instead of ArrayLists) and writing the most optimized code is absolutely essential.
Bit of introduction to Hadoop #NotesToSelf
Hadoop is a large-scale distributed batch processing infrastructure. Batch processing is execution of a series of programs (jobs) on a computer without manual intervention.
Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
Hadoop is designed to handle hardware failure and data congestion issues very robustly.
In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. [Data is distributed across nodes at load time.]
The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable. Even though the file chunks are replicated and distributed across several machines, they form a single namespace, so their contents are universally accessible.
Hadoop will not run just any program and distribute it across a cluster. Programs must be written to conform to a particular programming model, named “MapReduce.”
The best resource I found for getting started was on Yahoo Developers Network –