Lucene is one of the most popular open source search tools offering high scalability, robustness and versatility leading to many an entire enterprise search servers/engines built around it. Solr and Elastic Search come immediately to mind. Twitter with its humongous data volumes and scalability requirements has in place a search architecture built around a customized lucene version. Lucene offers excellent real world functionality like hit highlighting, spell checking, tokenizing and analyzing etc but one of the powerful and oft used feature is boosting. Most well designed and built websites today offer some degree of search functionality to them which range from searching plain text content within the site to specific content hidden inside binary documents. Lucene in conjunction with many other plugins/tools plays a big part in this.

 

So what is boosting anyway?

 

One of the real world functionality mentioned in the paragraph above is the concept of boosting, you might have inadvertently experienced this in your searches in some website out there. A good example is google in itself where you are shown some search results boosted to the top (or a place which catches your attention) as they would be from a sponsored source. In essence google has boosted the sponsored search result to the top to bring it to prominence. A well designed search interface would provide the ability to adapt to user input and modify the search results accordingly, say in drilling down or choose from among a first of equals. This is where boosting can play a big part. It therefore becomes important to understand boosting in its entirety. Given the complicated inner workings of how lucene gets boosting to work it would be better to understand this in phases. With that in mind I present to you the first of the 3 part series on what boosting is and how it works. A quick glance on what the three part series has to offer,

1. Part 1 (current article) — What is boosting? The different types of boosting and a quick look into the some of the underlying concepts like scoring and norm.

2. Part 2 — A deeper look into scoring with special focus on customizing the scoring to our need. This part will be further broken up into individual pieces covering such topics as custom query implementation, custom score provider, scoring using expressions etc.

3. Part 3 — Lucene by default uses a combination of the the Tf/Idf Vector space and Boolean models for scoring purposes. There are many other models apart from the default one used by Lucene which will be looked into in this part. This part will complement and drill deeper into the areas covered in part two.

                So let’s get started with part 1 but first a quick look into the prerequisites and the code that comes along with this article.

 

Prerequisites:
1. It is expected that the reader is aware of the basic concepts of Lucene like Document, Indexing and Analyzing, tokens, terms and querying.
2. Reader should at minimum be acquainted with the use of the basic Lucene API objects like IndexReader, IndexWriter, Query, Directory etc.

 

Code samples:
Present here is the example code to be used in conjunction with this article to understand the topic at hand. The code demonstrates the 2 types of boosting in Lucene (Indexing and Query Time) and also prints out the various scoring information associated with the results. The code is in the form of a Maven project and uses a RAMDirectory for ease of use.

 

Notes to set up and run the demo program,
1. Download the source code. The code makes use of the latest version (as of date of writing this article) of Lucene -> 4.6.
2. Run mvn package which will generate the JAR –> boost-imaginea-demo-1.0.jar
3. Place this jar along with the following Jars in a folder say “C:\Imaginea-Boost-Demo”.
         a. lucene-analyzers-common-4.6.0.jar
         b. lucene-core-4.6.0.jar
         c. lucene-queryparser-4.6.0
         d. lucene-queries-4.6.0
4. The program usage is as below,

 

Param 1: Type of boost:

index — Index Time Boosting

query — Query Time Boosting

both — Demo both Index and Query boosting

Param 2: Print scoring info: Either true or false

 

5. Example commands are as below,
C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples index false

 

C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples query false

 

C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples both false

First up in this article we need to pay a visit to the very important concepts of Scoring and Information Retrieval Models whose understanding will lay a good foundation towards understanding how boosting works beneath the hood.

 

Scoring:

You would most certainly have run into scoring in your routine Lucene search queries, after all, Lucene sorts the query results based on their “score” if you don’t specify any sorting criteria. Every document has a score to it indicating how relevant it is to the search query specified. Lucene assigns a score to every document brought up by the search after running some number crunching (more of it to come in this article) and presents the results sorting on this score with the highest valued ones first. This scoring process begins the moment the query has been processed and submitted to the IndexSearcher object. The first set of documents retrieved from the search are by means of a Boolean model (see information retrieval models below) which basically checks to see if the document at hand has the term/token or not. Once the basic subset of documents from the index have been retrieved that the scoring process begins which involves assigning of score to each document in this subset. It is by means of manipulation of the score attached to a given document that it is possible to selectively elevate the score of a subset of documents and boost them to top of the search results.

 

Information Retrieval Model:    

Now to understand how the scoring process crunches numbers and assigns a score to each document we will need to bring into context the concept of the Information Retrieval models. The theoretical world of information retrieval is rife with several models which deal with coming up with information relevant to a search query. When Lucene started out only the Boolean and the Vector Space models were implemented in it. The Vector space model is still the default Lucene model but the first subset of documented returned by the search before they are scored is always through the Boolean model which checks the presence of the search tokens in the documents. The more recent version of Lucene have had more number of information retrieval models added to them. The complete list is as below,

1. Vector Space Model

2. Probabalistic Relevance Models. There are many flavours to this like DFR (Divergence From Randomness) and BM25.

3. Language Models.

As mentioned earlier Lucene by defalt uses the Vector Space Model. Lucene permits the changing of the models used for scoring using the Similarity class. We will be looking at the changing and implementing of custom scoring and information models in parts 2 and 3 respectively. For now refer to this link to understand how Lucene implements the vector space model.

 

Different types of boosting:

Lucene supports two types of boosting, they are as below,

1. Index time boosting.

2. Query time boosting.

Although index time boosting earlier comprised of both field boosting and the document as a whole, the latter was discarded in later versions due to its irrelevance and other associated issues. For now index time boosting is only possible at a field level. Let us delve into both these in greater depth.

 

Index Time Boosting:

You would have come across the following type in the Field.Index Enum (stands deprecated now starting version 4) –> “ANALYZED_NO_NORMS“. Note the term “NORM” which is relevant in the context of index time boosting. More on it in a short while but first to define Index Time Boosting. Index time boosting is basically programatically setting the score of a field(s) (and thus impacting that of the overall document) at the time of indexing. However, you are not actually setting the score here, score is dependent on a lot of factors (for example the tokens in the query in itself which adds to the score), so what is being set is a number against a field which plays a part in the calculation of the score based on the query. This is where NORM comes into play. Norm is basically that one number against the field which affects the document’s score and thus position in the search result pecking order. Norm basically is short for normalized value. The Norm values are added to the index and this can potentially (again, potentially) help increase the query time.

When should I use index time boosting?

This pretty much depends on the business scenario at hand. For those scenarios where you know which subset of documents need to be boosted before hand, index time boosting would come in useful. Let us take a real world example here, say you have a shopping site selling cars with visitors from around the world. It is required that the search results for cars be boosted to the country of the user currently logged in. Say boost all products which are based in India to those users who have current address country as India?

Let us go ahead and add some documents to the index,

public void populateIndex() {
try {
	System.out.println(printBoostTypeInformation());
	indexWriter = new IndexWriter(ramDirectory, config);
	boostPerType("Lada Niva", "Brown", "2000000", "Russia", "SUV");
	boostPerType("Tata Aria", "Red", "1600000", "India", "SUV");
	boostPerType("Nissan Terrano", "Blue", "2000000", "Japan", "SUV");
	boostPerType("Mahindra XUV500", "Black", "1600000", "India", "SUV");
	boostPerType("Ford Ecosport", "White", "1000000", "USA", "SUV");
	boostPerType("Mahindra Thar", "White", "1200000", "India", "SUV");
	indexWriter.close();
	} catch (IOException | NullPointerException ex) {
	System.out.println("Something went wrong in this sample code -- "
				               + ex.getLocalizedMessage());
		        }
	}

	protected void boostPerType(String itemName, String itemColour,
			String itemPrice, String originOfItem, String itemType)
			throws IOException {
	  Document docToAdd = new Document();
	  docToAdd.add(new StringField("itemName", itemName,
			Field.Store.YES));

	  docToAdd.add(new StringField("itemColour", itemColour,
                                               Field.Store.YES));
	  docToAdd.add(new StringField("itemPrice", itemPrice,
                                               Field.Store.YES));
	  docToAdd.add(new StringField("originOfItem", originOfItem,
				               Field.Store.YES));

	  TextField itemTypeField = new TextField("itemType", itemType,
                                               Field.Store.YES);
	  docToAdd.add(itemTypeField);
	  //Boost items made in India
	  if ("India".equalsIgnoreCase(originOfItem)) {
		itemTypeField.setBoost(2.0f);
	  }
	  indexWriter.addDocument(docToAdd);
	}

 


The cars have been added to the index in a random order. Notice these particular lines of code in the method boostPerType,

          //Boost items made in India
	  if ("India".equalsIgnoreCase(originOfItem)) {
	 	itemTypeField.setBoost(2.0f);
	  }

Here, the field “originOfItem” is being specifically matched against the text “India” and a specific boost is being assigned to the field. Let us write a query which just does a term search for “suv” against the itemType field. The query would be as below,

itemType:suv

The code which performs the search is as below,

public void searchAndPrintResults() {
try {
  IndexReader idxReader = DirectoryReader.open(ramDirectory);
  IndexSearcher idxSearcher = new IndexSearcher(idxReader);
  Query queryToSearch = new QueryParser(Version.LUCENE_46, "itemType",
                                  analyzer).parse(getQueryForSearch());

  System.out.println(queryToSearch);
  TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
  idxSearcher.search(queryToSearch, collector);
  ScoreDoc[] hitsTop = collector.topDocs().scoreDocs;
  System.out.println("Search produced " + hitsTop.length + " hits.");
  System.out.println("----------");
  for(int i=0;i<hitsTop.length;++i) {
  int docId = hitsTop[i].doc;
  Document docAtHand = idxSearcher.doc(docId);
  System.out.println(docAtHand.get("itemName") + "\t" +
                                         docAtHand.get("originOfItem")
       	  + "\t" + docAtHand.get("itemColour") + "\t" +
                                         docAtHand.get("itemPrice")
	  		         + "\t" + docAtHand.get("itemType"));

 if (printExplanation) {
    Explanation explanation = idxSearcher.explain(queryToSearch,
                                                    hitsTop[i].doc);
    System.out.println("----------");
    System.out.println(explanation.toString());
    System.out.println("----------");
     }
   }
} catch (IOException | ParseException ex) {
    System.out.println("Something went wrong in this sample code -- "
                   + ex.getLocalizedMessage());
} finally {
			ramDirectory.close();
   }		

}

 
Let us take a look at the results, you can also try this in the demo code by running the following command,

C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples index false

The output would be as below, notice that all the documents with the country India have been boosted.

index-without-explanation  We can actually take a look at how Lucene has calculated the score for our query result documents. It will be seen that the score of the boosted car results with origin as India will have a score much higher than the others. You can also try this in the demo code by running the following command,

C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples index false

 
index-with-explanation   It is seen that the boosted India origin cars have a higher score than the ones not boosted. 1.69 > 0.8.

Query Time Boosting

We noted that in index time boosting, the normalized value is assigned to a field which is later used in calculating score at the time of querying. In Query time boosting the boost value is directly specified at the time of querying. You could this directly using the setBoost method of the various query objects or directly in the query. Let us look at an example using the same data set of cars. There is a slight change in requirement though. It is now required that the cars of white colour are boosted to the top of search results. Let us write a query for this,

itemColour:white ^2 OR itemType:suv

 


Note the text “^2” which immediately follows the term itemColour:white. Here we have boosted that those documents which have a colour white be assigned higher rank and thus boosted. Let us take a look at the results, you can also try this in the demo code by running the following command,

C:\Imaginea-Boost-Demo>java -cp boost-imaginea-demo-1.0.jar;lucene-analyzers-common-4.6.0.jar;lucene-core-4.6.0.jar;lucene-queryparser-4.6.0.jar com.imaginea.boost.BoostExamples query false

 
query-without-explanation When should I use query time boosting?   When you require the search results to be driven by the user input or if you need to bring in specific boosts — for example you look up an external service to look up sponsored cars and boost these in specific, you did not have this information pre-hand and were thus unable to boost at index time.
 Using the explain method of the searcher to understand what happens under the hood
 In the example code above you would have noticed the following line

Explanation explanation = idxSearcher.explain(queryToSearch, hitsTop[i].doc);
System.out.println("----------");
System.out.println(explanation.toString());

 
The explain method of the IndexSearcher object is a powerful tool to understand how Lucene has calculated the score and will be helpful in debugging as well.

 

—————————————————————————————————————-
Hope this part one was useful in understanding the basics of boosting. More on boosting coming up in parts 2 and 3. Please do feel free to leave any comments for feedback or corrections in the content.