Implement Word Count in Accumulo
It seems like the first map-reduce program that everyone tries is counting words. This first program reads a piece of text using the mapper to tokenize the text and outputs a "1" for each token. Then the reducer adds up the "1" values to produce the word counts.
Accumulo provides the same functionality without needing to write a single line of code by using a SummingCombiner iterator. Below is a complete example.
Actually this example is more powerful because the same code can be used to sum across any time dimension.
This example shows how to sum across days. First start the accumulo shell.Then follow these steps:
Insert records for a daily rollup.
Get all counts for a given day:
Let's talk about that "--no-default-iterators" parameter for a moment. Normally, Accumulo uses an iterator that only displays the one value (the value with the latest timestamp) based on the uniqueness of the key/column family/column qualifer combination. If you leave that iterator in place, your counters will get essentially reset to one each time a compaction is done.
Actually this example is more powerful because the same code can be used to sum across any time dimension.
This example shows how to sum across days. First start the accumulo shell.Then follow these steps:
> createtable --no-default-iterators wordtrack wordtrack> setiter -t wordtrack -p 10 -scan -minc -majc -class org.apache.accumulo.core.iterators.user.SummingCombiner SummingCombiner interprets Values as Longs and adds them together. A variety of encodings (variable length, fixed length, or string) are available ----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: true ----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: ----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: ----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING
Insert records for a daily rollup.
wordtrack> insert "Robert" "2011.Nov.12" "" 1 wordtrack> insert "Robert" "2011.Nov.12" "" 1 wordtrack> insert "Parker" "2011.Nov.12" "" 1 wordtrack> insert "Parker" "2011.Nov.12" "" 1 wordtrack> insert "Parker" "2011.Nov.12" "" 1 wordtrack> insert "Parker" "2011.Nov.23" "" 1 wordtrack> scan Parker 2011.Nov.12: [] 3 Parker 2011.Nov.23: [] 1 Robert 2011.Nov.12: [] 2
Get all counts for a given day:
wordtrack> scan -c 2011.Nov.12 Parker 2011.Nov.12: [] 3 Robert 2011.Nov.12: [] 2
Let's talk about that "--no-default-iterators" parameter for a moment. Normally, Accumulo uses an iterator that only displays the one value (the value with the latest timestamp) based on the uniqueness of the key/column family/column qualifer combination. If you leave that iterator in place, your counters will get essentially reset to one each time a compaction is done.