Todd introduces the audience to the world of huge datasets, what you can do with it, profiling users and customers for example.
False assumptions learned in the last 10 years (that Hadoop has been building with this in mind)
- Machines are reliable, Hadoop separates fault tolerance logic from code logic
- Machines deserve identities, I put data in a cluster, I don’t care which particular machine hosts the data; Hadoop can swap in and swap out machines across the cluster
- Your analysis fits on one machine, Hadoop scales linearly with data size or analysis complexity
A typical Hadoop installation : 5 to 4000 commodity servers (8 cores, 24 GB RAM, 4 to 12 TB hard drive; 2 levels network architecture, 20 to 40 nodes per rack)
The cluster nodes are composed of m
- master nodes : 1 NameNode (metadata) and 1 jobtracker
- slaves nodes (1 to 4000 each) data nodes and tasktrackers
To access the file system, you would not mount it (even if you could, with fuse), you can use an API, HDFS API (in Java)
Hadoop will write on chunks of 64 MB, which will get replicated across the nodes.
Using HDFS, you will use 2 functions : map() and reduce(); they are run on the node containing the data, so no network overhead to get the info, but HDFS can interpret bytes as key; then reduce is used to aggregate the value.
Hadoop is not only map/reduce, with Hive, you can also use SQL; but there are other tools on top of Hadoop, Pig (DataFlow) or Sqoop (RDBMS compatibilty)
Who uses Hadoop : Yahoo (>82 PB, >40 k machines); FaceBook, 15 TB data/day, 1200 machines; Twitter, etc…
Mozilla uses Hadoop to analyze crash data (FF crashes, you send a report, and they get and analyze the data)
Hadoop Java brings some good tooling (along with integration tools such as Apache, Ivy, etc..) but some bad things such as JVM bugs, JNI libraries to add for non standard features (specific to the OS)
http://www.eclipsecon.org/2011/sessions/?page=sessions&id=2370