@Zigo: Why I don't package Hadoop myself

A quick reply to Zigo’s post:

Well, I looked at the Bigtop efforts because I needed Hadoop packages. But they are not very useful. They have lots of issues (including empty packages, naming conflicts etc.).

I filed a few bugs, and I even uploaded my fixes to Github. Some of that went unnoticed, because Sean Owen of Cloudera decided to remove all Debian packaging from Spark. But in the end, even with these fixes, the resulting packages do not live up to Debian quality standards (not to say, they would outright violate policy).

If you wanted to package Hadoop properly, you should ditch Apache Bigtop, and instead use the existing best practises for packaging. Using any of the Bigtop work just makes your job harder, by pulling in additional dependencies like their modified Groovy.

But whatever you do, you will be stuck in .jar dependency hell. Whatever you look at, it pulls in another batch of dependencies, that all need to be properly packaged, too. Here is the dependency chain of Hadoop:

[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0:compile
[INFO] |  +- com.google.guava:guava:jar:11.0.2:compile
[INFO] |  +- org.mortbay.jetty:jetty:jar:6.1.26:compile
[INFO] |  +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO] |  +- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO] |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] |  |  \- asm:asm:jar:3.1:compile
[INFO] |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.4:compile
[INFO] |  +- commons-io:commons-io:jar:2.4:compile
[INFO] |  +- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO] |  +- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO] |  +- javax.servlet.jsp:jsp-api:jar:2.1:compile
[INFO] |  +- log4j:log4j:jar:1.2.17:compile
[INFO] |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] |  +- javax.servlet:servlet-api:jar:2.5:compile
[INFO] |  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO] |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO] |  +- tomcat:jasper-runtime:jar:5.5.23:compile
[INFO] |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO] |  +- io.netty:netty:jar:3.6.2.Final:compile
[INFO] |  +- xerces:xercesImpl:jar:2.9.1:compile
[INFO] |  |  \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO] |  \- org.htrace:htrace-core:jar:3.0.4:compile
[INFO] +- org.apache.hadoop:hadoop-auth:jar:2.6.0:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.5:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore:jar:4.2.4:compile
[INFO] |  +- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile
[INFO] |  |  +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile
[INFO] |  |  +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile
[INFO] |  |  \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile
[INFO] |  +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile
[INFO] |  |  +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO] |  |  \- jline:jline:jar:0.9.94:compile
[INFO] |  \- org.apache.curator:curator-framework:jar:2.6.0:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:2.6.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:2.6.0:compile
[INFO] |  |  \- jdk.tools:jdk.tools:jar:1.6:system
[INFO] |  +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- commons-net:commons-net:jar:3.1:compile
[INFO] |  +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO] |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO] |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] |  |  |     +- javax.xml.stream:stax-api:jar:1.0-2:compile
[INFO] |  |  |     \- javax.activation:activation:jar:1.1:compile
[INFO] |  |  +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO] |  |  \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile
[INFO] |  |  \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:compile
[INFO] |  +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO] |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO] |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] |  +- org.apache.avro:avro:jar:1.7.4:compile
[INFO] |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
[INFO] |  |  \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
[INFO] |  +- com.google.code.gson:gson:jar:2.2.4:compile
[INFO] |  +- com.jcraft:jsch:jar:0.1.42:compile
[INFO] |  +- org.apache.curator:curator-client:jar:2.6.0:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.6.0:compile
[INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO] |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO] |     \- org.tukaani:xz:jar:1.0:compile
[INFO] +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
[INFO] |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] |  +- tomcat:jasper-compiler:jar:5.5.23:compile
[INFO] |  +- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
[INFO] |  |  \- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
[INFO] |  +- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
[INFO] |  |  \- ant:ant:jar:1.6.5:compile
[INFO] |  +- commons-el:commons-el:jar:1.0:compile
[INFO] |  +- hsqldb:hsqldb:jar:1.8.0.10:compile
[INFO] |  +- oro:oro:jar:2.0.8:compile
[INFO] |  \- org.eclipse.jdt:core:jar:3.1.1:compile

So the first step for packaging Hadoop would be to check which of these dependencies are not yet packaged in Debian… I guess 1/3 is not.

Maybe, we should just rip out some of these dependencies with a cluebat. For the stupid reason of making a webfrontend (which doesn’t provide a lot of functionality, and I doubt many people use it at all), Hadoop embeds not just one web server, but two: Jetty and Netty…

Things would also be easier if e.g. S3 support, htrace, the web frontend, and different data serializations were properly put into modules. Then you could postpose S3 support, for example.

As I said, the deeper you dig, the crazier it gets.

If the OpenDataPlatform efforts of Hortonworks, Pivotal and IBM were anything but a marketing gag, they would try to address these technical issues. Instead, they make things worse by specifying yet another fatter core, including Ambari, Apaches attempt to automatically make a mess out of your servers - essentially, they are now adding the ultimate root shell, for all those cases where unaudited puppet commands and “curl | sudo bash” was not bad enough:

Example:
  command1 = as_sudo(["cat,"/etc/passwd"]) + " | grep user"

(from the Ambari python documentation)

The closer you look, the more you want to rather die than use this.

P.S. I have updated the libtrove3-java package (Java collections for primitive types; but no longer the fastest such library), so that it is now in the local maven repository (/usr/share/maven-repo) and that it can be rebuilt reproducible (the build user name is no longer in the jar manifest).