2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2020

04/24/2003: Lucene; Enhancement; The PrecedenceQuery Class.

As I was working with Lucene the other day, I envisioned looking for a document about a terrorist (call me XXX) and how the resulting list of results should be ranked. So I created the PrecedenceQuery class so that I could ask for all document containing XXX and a series of tokens such as /iraq/syria/israel/usa/united states.

The document should be ranked higher (more relevant) as the list moves from left to right. This type of search is applicable to many different situations. In the realm of computer applications, this technique could be used in configuration management so that global values are overridden by environment which in turn can be overridden by machine-specific information. Here is the code for the class:

/*
 * Created on Apr 24, 2003
 *
 */
package com.affy.lucene;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;

/**
 * @author medined
 */
public class PrecedenceQuery extends Query {
	private Term term;
	private String delimiter;

	/**
	 * Accepts a Term whose value looks like
	 * \default\development\dmedinets. The idea
	 * is that the score of the located documents
	 * are boosted higher for each successive token
	 * in the Term. The weight given to each term
	 * is double that of the previous term. I hope
	 * that this simplistic weighting scheme is ok.
	 */
	public PrecedenceQuery(Term term, String delimiter) {
		this.term = term;
		this.delimiter = delimiter;
	}

	public PrecedenceQuery(Term term) {
		this.term = term;
		this.delimiter = "/";
	}

	public Query rewrite(IndexReader reader) throws IOException {
		BooleanQuery query = new BooleanQuery();
		StringTokenizer st = new StringTokenizer(this.term.text(), this.delimiter);
		int boost = 1;
		while (st.hasMoreTokens()) {
		  String token = st.nextToken();
		  TermQuery tq = new TermQuery(new Term(this.term.field(), token));
		  tq.setBoost(boost);
		  query.add(tq, false, false);
		  boost = boost + boost;
		}
		return query;
	}

	public Query combine(Query[] queries) {
		return Query.mergeBooleanQueries(queries);
	}

	/** Prints a user-readable version of this query. */
	public String toString(String field) {
		StringBuffer buffer = new StringBuffer();
		if (!term.field().equals(field)) {
			buffer.append(term.field());
			buffer.append(":");
		}
		buffer.append(term.text());
		buffer.append('*');
		if (getBoost() != 1.0f) {
			buffer.append("^");
			buffer.append(Float.toString(getBoost()));
		}
		return buffer.toString();
	}

}

04/23/2003: Lucene; Bug; TestPhrasePrefixQuery in 2003.04.21 Build Has Misleading Code?

The TestPhrasePrefixQuery looks like it is searching for "blueberry pi*" and it even seems to work at first glance. However, the test data is not extensive enough to show what is really happening.

The searching technique implemented in TestPhrasePrefixQuery will find not only "blueberry pie" but also "blueberry strudel" if both exist in the documents.

The reason is that the IndexReader.terms(Term termToMatch) method looks for the first term equal or larger than termToMatch and then returns *all* terms from that point in the index to the end.

One potential solution might be something like the following:

String pattern = "pi*";
TermEnum te = ir.terms(new Term("body", pattern));
while (te.term().text().matches(pattern)) {
    termsWithPrefix.add(te.term());
    if (te.next() == false)
        break;
    }
}

Of course, the code above only works with JDK1.4 because of the pattern matching.

Comments?