Tuesday, July 28, 2009

Counting and Grouping Queries in Lucene

When using a Lucene index to look up some information you have access to some querying facilities not found in other kind of repositories. However, in a classical trade-off, you lose some features such as the aggregate queries easily performed in relational databases.

Anyway, if you need to perform this kind of operations, they may be easily implemented using hit collectors. So, I've included in lucis two simple operations, counting and grouping results:

The LucisQuery object is used to decouple index control policy (when to open and close it, etc) from the queries themselves.
The counting query just needs the Lucene query to perform and the (optional) filter to apply. The result holds the number of documents found and the time the query needed.
For the grouping query you must provide the list of field names you want to group by (in order) and the query result is the same that the counting query plus the root group (the one corresponding to the first field name), where a group is something like (partial API showed, see the source):

So, for each collected value of the provided field you get a child group which itself contains the groups representing the nested fields. The number of hits in a group may not be equal to the sum of the hits in the children groups if any of the fields is multivalued.