|
|
Todd introduces the audience to the world of huge datasets, what you can do with it, profiling users and customers for example.
False assumptions learned in the last 10 years (that Hadoop has been building with this in mind)
- Machines are reliable, Hadoop separates fault tolerance logic from code logic
- Machines deserve identities, I put data in a cluster, I don’t care which particular machine hosts the data; Hadoop can swap in and swap out machines across the cluster
- Your analysis fits on one machine, Hadoop scales linearly with data size or analysis complexity
A typical Hadoop installation : 5 to 4000 commodity servers (8 cores, 24 GB RAM, 4 to 12 TB hard drive; 2 levels network architecture, 20 to 40 nodes per rack)
The cluster nodes are composed of m
- master nodes : 1 NameNode (metadata) and 1 jobtracker
- slaves nodes (1 to 4000 each) data nodes and tasktrackers
To access the file system, you would not mount it (even if you could, with fuse), you can use an API, HDFS API (in Java)
Hadoop will write on chunks of 64 MB, which will get replicated across the nodes.
Using HDFS, you will use 2 functions : map() and reduce(); they are run on the node containing the data, so no network overhead to get the info, but HDFS can interpret bytes as key; then reduce is used to aggregate the value.
Hadoop is not only map/reduce, with Hive, you can also use SQL; but there are other tools on top of Hadoop, Pig (DataFlow) or Sqoop (RDBMS compatibilty)
Who uses Hadoop : Yahoo (>82 PB, >40 k machines); FaceBook, 15 TB data/day, 1200 machines; Twitter, etc…
Mozilla uses Hadoop to analyze crash data (FF crashes, you send a report, and they get and analyze the data)
Hadoop Java brings some good tooling (along with integration tools such as Apache, Ivy, etc..) but some bad things such as JVM bugs, JNI libraries to add for non standard features (specific to the OS)
http://www.eclipsecon.org/2011/sessions/?page=sessions&id=2370
Michael began his talk remainding us that Eclipse configuration lies everywhere :
- eclipse.ini
- configuration/eclipse.ini
- /home/.eclipse
- .metadata/
- $project/.settings/
- runtime option -vmargs …
Michael recommends the audience to try to configure as much as possible preferences on project level.
You can manage team preferences documenting them in a wiki , but it is boring for the user, and also very hard to maintain.
Or you can use Eclipse EPF files (Eclipse Preference File), that you can import manually using the import wizard, but same drawback as wiki documentation, people forget about it.
Better than that, you can manage EPF files with one of those 3 tools :
- Eclipse Team etceteras, is a plugin that can do automatic and manual EPF import over HTTP ,set preference between workspace, and tells the user if he has not imported the EPF, suggesting him to download it
- Another tool, workspace mechanic, it is a task oriented configuration engine(using Groovy, Java, or other); thing it is file system based, no possibility to transfer preferences through HTTP
- Bug 334016 (Common preferences), it is an automatic EPF import over HTTP, without asking the user if he wants it
Which one to choose ?
- just importing through HTTP, use ETE,
- if EPF are not enough for you, use Workspace Mechanic
- need enforcement ? use Common preferences
Then we had a demo of ETE, starting eclipse in a new workspace, a dialog appeared to suggest the user to import the preferences; then changing workspace, there was a dialog to copy settings and preferences.
http://www.eclipsecon.org/2011/sessions/?page=sessions&id=2057
Bernhard and Frederic began their talk defining software architecture and architectural erosion, which means that your system becomes deprecated and overloaded
Then, Bernhard evoked Findbugs, saying that
- back in March 2004, only 4 packages for version 0.7.2, very simple
- new features added for annotations for example, few months later, still looking good
- in May 2005, a first cyclic dependency appeared, in the svn log, « temporary hack »appears
- June 2006, version 1.0.0 many packages, new cyclic dependency,
- 2009 : many, many tangles
Can you still maintain easily this project ?
He followed his explanation with a tool called Sotoarc to show the dependency between classes and packages; he also mentioned PDE dependency visualization
To check your dependency, you can perform architectural inspection with some architectural tools.
Then Frederic explained to the audience the critical aspect of the architectural quality of an open source project: it is used by many many consumers.
He then enumerated the risks of Erosion in FOSS :
- contributors from several organizations (different processes)
- lower pressure from management
- hazardous funding
What about Eclipse project then ? Bernhard analyzed the architecture of JDT, and found out that the org.eclipse.ant.ui plugin uses JDT, and this dependency could have be avoided.
Also, duplication of code can cause problems of maintenance; he demonstrated this problem with a DialogField class in Eclipse, duplicated since 2.0
Frederic introduced to us the recommendations of the Architecture Council.
Also, some new eclipse projects, such as Orion, the web based ide, introduce new architectures.
To migrate your plugins from Eclipse 3.x to e4 , you have to go through an architectural modernization process which consists in auditing, testing, etc…
To do that, he suggests to use models to represent and manipulate artifacts of existing projects, using MoDisco; so that you can get in EMF the model of a plugin (sources folder, plugin.xml, etc..
Once you have the model represented, you can proceed to the migration of this model to the model of e4 for example.
You need to analyze and proceed to modernization of your tools.
http://www.eclipsecon.org/2011/sessions/?page=sessions&id=2001
|
|