10/02/2012: How To Run Multiple Instances of Accumulo on One Hadoop Cluster
On the Accumulo User mailing list, Kristopher K. asked:
Eric N. replied:
I built 1.5 from source last night and wanted to try it out on my existing Hadoop cluster without overwriting my current 1.4 set. Is there a way to specify the /accumulo directory in HDFS such that you can run multiple instances?
Eric N. replied:
From the monitoring user interface, see the Documentation link, then Configuration, see the first property:
instance.dfs.dir
You'll also change all the port numbers from the defaults. And there's a port number in conf/generic_logger.xml that points to the logging port on the monitor.
For example, here are some entries from my conf/accumulo-site.xml file:
<property>
<name>master.port.client</name>
<value>10010</value>
</property>
<property>
<name>tserver.port.client</name>
<value>10011</value>
</property>
<property>
<name>gc.port.client</name>
<value>10101</value>
</property>
<property>
<name>trace.port.client</name>
<value>10111</value>
</property>
<property>
<name>monitor.port.client</name>
<value>11111</value>
</property>
<property>
<name>monitor.port.log4j</name>
<value>1560</value>
</property>
And conf generic_logger.xml:
<!-- Send all logging data to a centralized logger -->
<appender name="N1" class="org.apache.log4j.net.SocketAppender">
<param name="remoteHost" value="${org.apache.accumulo.core.host.log}"/>
<param name="port" value="1560"/>
<param name="application" value="${org.apache.accumulo.core.application}:${org.apache.accumulo.core.ip.localhost.hostname}"/>
<param name="Threshold" value="WARN"/>
</appender>
10/02/2012: How Accumulo Compresses Keys and Values
From the
General consensus seemed to favordouble compression - compression both at the application level (i.e., compress the values) and let Accumulo compress as well (i.e., the relative encoding).
In support of double compression, Ameet K. said:
Acccumulo User
mailing list, Keith T said:There are two levels of compression in Accumulo. First redundant
parts of the key are not stored. If the row in a key is the same as
the previous row, then its not stored again. The same is done for
columns and time stamps. After the relative encoding is done a block
of key values is then compressed with gzip. As data is read from an RFile, when the row of a key is the same as
the previous key it will just point to the previous keys row. This is
carried forward over the wire. As keys are transferred, duplicate
fields in the key are not transferred.
General consensus seemed to favor
In support of double compression, Ameet K. said:
I've switched to double compression as per previous posts and
its working nicely. I see about 10-15% more compression over just
application level Value compression.