09/23/2013: Technologies Used In Developing Applications Using Apache Accumulo
I was recently talking about how people train themselves for Big Data projects. The technology stack is fairly daunting. Below are the technologies that I find helpful.
I'll add the list as I remember more:
Systems Administration Technologies
System administrators are absolutely essential to successful projects. They ensure that software is installed and configured correctly. More importantly, they ensure repeatable builds and deployments. Oh, and they really need to understand the widely varying failure modes. Read the excellently written Aphyr blog to learn about some of them.
OpenStack - The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering its users to provision resources through a web interface.
Ganglia - (optional) A scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Vim - While you may prefer to use a nice graphical IDE like NetBeans, Eclipse or IntelliJ, you'll be totally lost if you don't understand Vim. It can easily open large files that utterly crush any IDE.
Java - The programming language for Accumulo. Actually you can probably use most JVM-based languages.
Ant - While knowledge if Ant is not required, I use it to run both Java and Map-Reduce jobs. Its ability to orchestrate multiple targets can prove valuable.
Gitorious - (optional) A infrastructure for hosting open source
projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.
Systems Administration Technologies
System administrators are absolutely essential to successful projects. They ensure that software is installed and configured correctly. More importantly, they ensure repeatable builds and deployments. Oh, and they really need to understand the widely varying failure modes. Read the excellently written Aphyr blog to learn about some of them.
OpenStack - The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering its users to provision resources through a web interface.
Puppet - Puppet Open Source is a flexible, customizable framework available under
the Apache 2.0 license designed to help system administrators automate
the many repetitive tasks they regularly perform.
Bash - The command-line for Unix-based operating systems. While you learn about the Bash shell, make sure you also become proficient in Perl, Ruby, or Python scripting language.
Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.
Gitorious - (optional) A infrastructure for hosting open source
projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.
Jira - (optional) Issue Tracker software. If you're using the Agile methodology, make sure to get the Jira Agile version.
Application Developer Technologies
Vim - While you may prefer to use a nice graphical IDE like NetBeans, Eclipse or IntelliJ, you'll be totally lost if you don't understand Vim. It can easily open large files that utterly crush any IDE.
Ant - While knowledge if Ant is not required, I use it to run both Java and Map-Reduce jobs. Its ability to orchestrate multiple targets can prove valuable.
Git - A distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
This is the version control system used by Accumulo. It seems fairly safe to say that you need at least a basic knowledge of Git to excel.
Maven - A software project management and comprehension tool. Based on the concept of a project object model
(POM), Maven can manage a project's build, reporting and documentation from a central piece of information. Apache Accumulo uses Maven to compile, test and build jar files.
Tomcat or Jetty - (optional) Web applications are the main way to interact with users and these two web servers are good to develop with.
Hadoop - The Hadoop project develops open-source software for
reliable, scalable, distributed computing. There are several distributions of the Hadoop stack (MapR, Cloudera, etc...) that you can use.
Zookeeper - A centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by
distributed applications. Accumulo depends on Zookeeper.
Accumulo - a sorted, distributed key/value store is a robust, scalable, high performance
data storage and retrieval system.
Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.
Solr - includes powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geo-spatial search. Solr is extremely useful when integrating between Big Data and applications. You can analyze the heck out of your data and then store the results in Solr.
Solr - includes powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geo-spatial search. Solr is extremely useful when integrating between Big Data and applications. You can analyze the heck out of your data and then store the results in Solr.
05/29/2013: Using Apache Accumulo as the backing store for Apache Gora - a tutorial
Apache Gora (http://gora.apache.org/) provides an abstraction layer to work with various
data storage engines. In this tutorial, we'll see how to use Gora with Apache Accumulo
as the storage engine.
I like to start projects with the Maven `pom.xml` file. So here is mine. It's important to
use Accumulo 1.4.3 instead of the newly released 1.5.0 because of an API incompatibility.
Otherwise, the `pom.xml` file is straightforward.
Now create a `src/main/resources/gora.properties` file configuring Gora by
specifying how to find Accumulo.
There are some important items to note. Firstly, we'll be using the MockInstance of
Accumulo so that you don't actually need to have it installed. Secondly, the password
needs to be blank if you are depending on Accumulo 1.4.3, change the password to
'''secret''' if using an earlier version.
That's all it takes to configure Gora. Now let's create a json file to define a very
simple object - a Person with just a first name. Create a json file with the
following:
This is the simplest object I could think of. Not very useful for real applications, but
great for a simple proof-of-concept project.
The json file needs to be compiled into a Java file with the Gora compiler. Hopefully, you
have installed Gora and put its ```bin``` directory onto your path. Run the following to
generate the Java code:
One last bit of setup is needed. Create a ```src/main/resources/gora-accumulo-mapping.xml```
file with the following:
Finally we get to the fun part. Actually writing a Java program to create, save, and
read a Person object. The code is straightforward so I won't explain it, just show it. Create
a src/main/java/com/affy/Create_Save_Read_Person_Driver.java file like this:
This program has this output:
Hopefully, I'll be able to post more complex examples in the future.
use Accumulo 1.4.3 instead of the newly released 1.5.0 because of an API incompatibility.
Otherwise, the `pom.xml` file is straightforward.
<project ...> <modelVersion>4.0.0</modelVersion> <groupId>com.affy</groupId> <artifactId>pojos-in-accumulo</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>POJOs in Accumulo</name> <url>http://affy.com</url> <properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <!-- Dependency Versions --> <accumulo.version>1.4.3</accumulo.version> <gora.version>0.3</gora.version> <slf4j.version>1.7.5</slf4j.version> <!-- Maven Plugin Dependencies --> <maven-compiler-plugin.version>2.3.2</maven-compiler-plugin.version> <maven-jar-plugin.version>2.4</maven-jar-plugin.version> <maven-dependency-plugin.version>2.4</maven-dependency-plugin.version> <maven-clean-plugin.version>2.4.1</maven-clean-plugin.version> </properties> <dependencies> <dependency> <groupId>org.apache.accumulo</groupId> <artifactId>accumulo-core</artifactId> <version>${accumulo.version}</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.accumulo</groupId> <artifactId>accumulo-server</artifactId> <version>${accumulo.version}</version> <type>jar</type> </dependency> <dependency> <groupId>org.apache.gora</groupId> <artifactId>gora-core</artifactId> <version>${gora.version}</version> </dependency> <dependency> <groupId>org.apache.gora</groupId> <artifactId>gora-accumulo</artifactId> <version>${gora.version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>${slf4j.version}</version> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> <version>${slf4j.version}</version> </dependency> <!-- TEST --> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.8.2</version> <scope>test</scope> </dependency> </dependencies> </project>
Now create a `src/main/resources/gora.properties` file configuring Gora by
specifying how to find Accumulo.
gora.datastore.default=org.apache.gora.accumulo.store.AccumuloStore gora.datastore.accumulo.mock=true gora.datastore.accumulo.instance=instance gora.datastore.accumulo.zookeepers=localhost gora.datastore.accumulo.user=root gora.datastore.accumulo.password=
There are some important items to note. Firstly, we'll be using the MockInstance of
Accumulo so that you don't actually need to have it installed. Secondly, the password
needs to be blank if you are depending on Accumulo 1.4.3, change the password to
'''secret''' if using an earlier version.
That's all it takes to configure Gora. Now let's create a json file to define a very
simple object - a Person with just a first name. Create a json file with the
following:
{ "type": "record", "name": "Person", "namespace": "com.affy.generated", "fields": [ {"name": "first", "type": "string"} ] }
This is the simplest object I could think of. Not very useful for real applications, but
great for a simple proof-of-concept project.
The json file needs to be compiled into a Java file with the Gora compiler. Hopefully, you
have installed Gora and put its ```bin``` directory onto your path. Run the following to
generate the Java code:
gora goracompiler src/main/avro/person.json src/main/java
One last bit of setup is needed. Create a ```src/main/resources/gora-accumulo-mapping.xml```
file with the following:
<gora-orm> <class table="people" keyClass="java.lang.String" name="com.affy.generated.Person"> <field name="first" family="f" qualifier="q" /> </class> </gora-orm>
Finally we get to the fun part. Actually writing a Java program to create, save, and
read a Person object. The code is straightforward so I won't explain it, just show it. Create
a src/main/java/com/affy/Create_Save_Read_Person_Driver.java file like this:
package com.affy; import com.affy.generated.Person; import org.apache.avro.util.Utf8; import org.apache.gora.store.DataStore; import org.apache.gora.store.DataStoreFactory; import org.apache.gora.util.GoraException; import org.apache.hadoop.conf.Configuration; public class Create_Save_Read_Person_Driver { private void process() throws GoraException { Person person = new Person(); person.setFirst(new Utf8("David")); System.out.println("Person written: " + person); DataStore<String, Person> datastore = DataStoreFactory.getDataStore(String.class, Person.class, new Configuration()); if (!datastore.schemaExists()) { datastore.createSchema(); } datastore.put("001", person); Person p = datastore.get("001"); System.out.println("Person read: " + p); } public static void main(String[] args) throws GoraException { Create_Save_Read_Person_Driver driver = new Create_Save_Read_Person_Driver(); driver.process(); } }
This program has this output:
Person written: com.affy.generated.Person@20c { "first":"David" } Person read: com.affy.generated.Person@20c { "first":"David" }
Hopefully, I'll be able to post more complex examples in the future.