2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2020

11/23/2013: Reading Accumulo Metadata Table to Learn How Many Entries Are In Each Tablet.

After compacting the table, you can run the following program to learn how many entries are in each table. Accumulo does a nice job of splitting tables by byte size but if you have small records then it's fairly easy to run the "Curse of the Last Reducer!" I've run into situations where some tablets have 50K and other with 50M. package com.affy;

import java.io.IOException;
import java.io.InputStream;
import java.util.Map.Entry;
import java.util.Properties;
import org.apache.accumulo.core.Constants;
import org.apache.accumulo.core.client.AccumuloException;
import org.apache.accumulo.core.client.AccumuloSecurityException;
import org.apache.accumulo.core.client.Connector;
import org.apache.accumulo.core.client.Instance;
import org.apache.accumulo.core.client.IsolatedScanner;
import org.apache.accumulo.core.client.IteratorSetting;
import org.apache.accumulo.core.client.Scanner;
import org.apache.accumulo.core.client.TableNotFoundException;
import org.apache.accumulo.core.client.ZooKeeperInstance;
import org.apache.accumulo.core.client.impl.Tables;
import org.apache.accumulo.core.client.security.tokens.PasswordToken;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.KeyExtent;
import org.apache.accumulo.core.data.Range;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterators.user.RegExFilter;
import org.apache.accumulo.core.util.ByteBufferUtil;
import org.apache.hadoop.io.Text;

public class GetEntryCountForTable {

    public static void main(String[] args) throws IOException, AccumuloException, AccumuloSecurityException, TableNotFoundException {

        String accumuloTable = "tableA";

        Properties prop = new Properties();
        ClassLoader loader = Thread.currentThread().getContextClassLoader();
        InputStream in = loader.getResourceAsStream("accumulo.properties");
        prop.load(in);

        String user = prop.getProperty("accumulo.user");
        String password = prop.getProperty("accumulo.password");
        String instanceInfo = prop.getProperty("accumulo.instance");
        String zookeepers = prop.getProperty("accumulo.zookeepers");

        Instance instance = new ZooKeeperInstance(instanceInfo, zookeepers);

        Connector connector = instance.getConnector(user, new PasswordToken(password));

        String tableId = Tables.getNameToIdMap(instance).get(accumuloTable);

        Scanner scanner = new IsolatedScanner(connector.createScanner(Constants.METADATA_TABLE_NAME, Constants.NO_AUTHS));
        scanner.fetchColumnFamily(Constants.METADATA_DATAFILE_COLUMN_FAMILY);
        scanner.setRange(new KeyExtent(new Text(tableId), null, null).toMetadataRange());

        int fileSize = 0;
        int numEntries = 0;
        int numSplits = 1;
        for (Entry entry : scanner) {
            String value = entry.getValue().toString();
            String[] components = value.split(",");
            fileSize += Integer.parseInt(components[0]);
            numEntries += Integer.parseInt(components[1]);
            numSplits++;
        }

        int average = numEntries / numSplits;

        System.out.println(String.format("fileSize: %,d", fileSize));
        System.out.println(String.format("numEntries: %,d", numEntries));
        System.out.println(String.format("average: %,d", average));

    }
}


11/22/2013: How to Run Accumulo Continuous Testing (well ... some of them)

Accumulo comes with a lot of tests. This note is about the scripts in the test/system/continuous directory. The README is very descriptive so there is no need for me to discuss what the tests do. I'm just doing a show and tell. After creating an Accumulo cluster, you'll ssh to the master node to install Parallel SSH (pssh).
  1. Start an Accumulo cluster using https://github.com/medined/Accumulo_1_5_0_By_Vagrant
  2. vagrant ssh master
  3. cd ~/accumulo_home/software
  4. git clone http://code.google.com/p/parallel-ssh
  5. cd parallel-ssh
  6. sudo python setup.py install
Now you can run the continuous programs. I've created the editable files so you can just copy my versions (Step two below). The start_ingest.sh script starts ingest processes on the slave nodes which was not immediately obvious to me. Watch Watch http://affy-master:50095/ to see the ingest rate. When you've got enough entries, run the stop-ingest.sh.
  1. cd ~/accumulo_home/bin/accumulo/test/system/continuous
  2. cp /vagrant/files/config/accumulo/continuous/* .
  3. ./start-ingest.sh
  4. ./stop-ingest.sh
The figure below shows the ingest rate running two nodes on my MacBook Pro inside VirtualBox. My setup won't win any speed awards!


The next scripts we'll run are the walker scripts. They walk the entries produced by the ingest script. The output from the walker scripts are found on the slave nodes in the /home/vagrant/accumulo_home/bin/accumulo/test/system/continuous/logs directory. Watch http://affy-master:50095/ to see the scan rate.
  1. ./start-walkers.sh
  2. ./stop-walkers.sh
Below is an example of the scan rate 


And finally there is the verify script which took about 15 minutes to run on my setup. You can visit http://affy-master:50030/jobtracker.jsp to see the job running


  1. ./run-verify.sh