Technologies Used In Developing Applications Using Apache Accumulo
I was recently talking about how people train themselves for Big Data projects. The technology stack is fairly daunting. Below are the technologies that I find helpful.
I'll add the list as I remember more:
Systems Administration Technologies
System administrators are absolutely essential to successful projects. They ensure that software is installed and configured correctly. More importantly, they ensure repeatable builds and deployments. Oh, and they really need to understand the widely varying failure modes. Read the excellently written Aphyr blog to learn about some of them.
OpenStack - The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering its users to provision resources through a web interface.
Ganglia - (optional) A scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.
Vim - While you may prefer to use a nice graphical IDE like NetBeans, Eclipse or IntelliJ, you'll be totally lost if you don't understand Vim. It can easily open large files that utterly crush any IDE.
Java - The programming language for Accumulo. Actually you can probably use most JVM-based languages.
Ant - While knowledge if Ant is not required, I use it to run both Java and Map-Reduce jobs. Its ability to orchestrate multiple targets can prove valuable.
Gitorious - (optional) A infrastructure for hosting open source
projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.
Systems Administration Technologies
System administrators are absolutely essential to successful projects. They ensure that software is installed and configured correctly. More importantly, they ensure repeatable builds and deployments. Oh, and they really need to understand the widely varying failure modes. Read the excellently written Aphyr blog to learn about some of them.
OpenStack - The technology consists of a series of interrelated projects that control pools of processing, storage, and networking resources throughout a datacenter, all managed through a dashboard that gives administrators control while empowering its users to provision resources through a web interface.
Puppet - Puppet Open Source is a flexible, customizable framework available under
the Apache 2.0 license designed to help system administrators automate
the many repetitive tasks they regularly perform.
Bash - The command-line for Unix-based operating systems. While you learn about the Bash shell, make sure you also become proficient in Perl, Ruby, or Python scripting language.
Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.
Gitorious - (optional) A infrastructure for hosting open source
projects that use Git. It can be used inside your firewalls to provide secure access to git repositories.
Jira - (optional) Issue Tracker software. If you're using the Agile methodology, make sure to get the Jira Agile version.
Application Developer Technologies
Vim - While you may prefer to use a nice graphical IDE like NetBeans, Eclipse or IntelliJ, you'll be totally lost if you don't understand Vim. It can easily open large files that utterly crush any IDE.
Ant - While knowledge if Ant is not required, I use it to run both Java and Map-Reduce jobs. Its ability to orchestrate multiple targets can prove valuable.
Git - A distributed version control system designed to handle everything from small to very large projects with speed and efficiency.
This is the version control system used by Accumulo. It seems fairly safe to say that you need at least a basic knowledge of Git to excel.
Maven - A software project management and comprehension tool. Based on the concept of a project object model
(POM), Maven can manage a project's build, reporting and documentation from a central piece of information. Apache Accumulo uses Maven to compile, test and build jar files.
Tomcat or Jetty - (optional) Web applications are the main way to interact with users and these two web servers are good to develop with.
Hadoop - The Hadoop project develops open-source software for
reliable, scalable, distributed computing. There are several distributions of the Hadoop stack (MapR, Cloudera, etc...) that you can use.
Zookeeper - A centralized service for maintaining configuration information, naming,
providing distributed synchronization, and providing group services.
All of these kinds of services are used in some form or another by
distributed applications. Accumulo depends on Zookeeper.
Accumulo - a sorted, distributed key/value store is a robust, scalable, high performance
data storage and retrieval system.
Jenkins - An application that monitors executions of repeated jobs, such as building a software project or jobs run by cron.
Solr - includes powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geo-spatial search. Solr is extremely useful when integrating between Big Data and applications. You can analyze the heck out of your data and then store the results in Solr.
Solr - includes powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geo-spatial search. Solr is extremely useful when integrating between Big Data and applications. You can analyze the heck out of your data and then store the results in Solr.