Sunday, 26 July 2015

Exploring Solr Internals : The Lucene Inverted Index

Introduction

When playing with Solr systems, understanding and properly configuring the underline Lucene Index is fundamental to deeply control your search.
With a better knowledge of how the index looks like and how each component is used, you can build a more performant, lightweight and efficient solution.
Scope of this blog post is to explore the different components in the Inverted Index.
This will be more about data structures and how they contribute to provide Search related functionalities.
For low level approaches to store the data structures please refer to the official Lucene documentation [1]

Inverted Index

The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents.
It's pretty much quite similar to the index in the end of a book.
From wikipedia :
"In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents."

In Memory / On Disk

The Inverted index is the core data structure that is used to provide Search.
We are going to see in details all the components involved.
It's important to know where the Inverted index will be stored.
Assuming we are using a FileSystem Lucene Directory, the index will be stored on the disk for durability ( we will not cover here the Commit concept and policies, so if curious [2]) . 
Modern implementation of the FileSystem Directory will leverage the OS Memory Mapping feature to actually load into the memory ( RAM ) chunk of the index ( or possibly all the index) when necessary.
The index in the file system will look like a collection of immutable segments.
Each segment is a fully working Inverted Index, built from a set of documents.
The segment is a partition of the full index, it represents a part of it and it is fully searchable.
Each segment is composed by a number of binary files, each of them storing a particular data structure relevant to the index, compressed [1] .
To simplify, in the life of our index, while we are indexing data, we build segments, which are merged from time to time ( depending of the configured Merge Policy).
But the scope of this post is not the Indexing process but the structure of the Index produced.

Hands on !

Let's assume in input 3 documents, each of them with 2 simple fields, and see how a full inverted Index will look like :

Doc0
    {  "id":"c",
        "title":"video game history"
},

Doc1
    {  "id":"a",
        "title":"game video review game"
},

Doc2
    {  "id":"b",
        "title":"game store"
},

Depending of the configuration in the schema.xml, at indexing time we generate the related data structure.

Let's see the inverted index in the complete form, then let's explain how each component can be used , and when to omit part of it.


Field id
Ordinal Term Document Frequency Posting List
0 a 1 : 1 : [1] : [0-1]
1 b 1 : 1 : [1] : [0-1]
2 c 1 : 1 : [1] : [0-1]


Field title
Ordinal Term Document Frequency Posting List
0 game 3 0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]
1 history 1 : 1 : [3] : [11-18]
2 review 1 : 1 : [3] : [11-17]
3 store 1 2 : 1 : [2] : [5-10]
4 video 2 0 : 1 : [1] : [0-5],
1 : 1 : [2] : [5-10],


This sounds scary at the beginning let's analyse the different component of the data structure.

Term Dictionary

The term dictionary is a sorted skip list containing all the unique terms for the specific field.
Two operations are permitted, starting from a pointer in the dictionary :
next() -> to iterate one by one on the terms
advance(ByteRef b) -> to jump to an entry >= than the input  ( this operation is O(n) = log n where n= number of unique terms).

An auxiliary Automaton is stored in memory, accepting a set of smart calculated prefixes of the terms in the dictionary.
It is a weighted Automaton, and a weight will be associated to each prefix ( i.e. the offset to look into the Term Dictionary) .
This automaton is used at query time to identify a starting point to look into the dictionary.

When we run a query ( a TermQuery for example) :

1) we give in input the query to the In Memory Automaton, an Offset is returned
2) we access the location associated to the Offset in the Term Dictionary
3) we advance to the ByteRef representation of the TermQuery
4) if the term is a match for the TermQuery we return the Posting List associated

Document Frequency

This is simply the number of Documents in the corpus containing the term  t in the field f .
For the term game we have 3 documents in our corpus that contain the term in the field title

Posting List

The posting list is the sorted skip list of DocIds that contains the related term.
It's used to return the documents for the searched term.
Let's have a deep look to a complete Posting List for the term game in the field title :

0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]

Each element of this posting list is :
Document Ordinal : Term Frequency : [array of Term Positions] : [array of Term Offset] .

Document Ordinal -> The DocN Ordinal (Lucene ID) for the DocN document in the corpus containing the related term.
Never relies at application level on this ordinal, as it may change over time ( during segments merge for example).
e.g.
According to the starting ordinals of each Posting list element :
Doc0 0,
Doc1 1,
Doc2 2
contain the term game in the field title .

Term Frequency -> The number of occurrences of the term in the Posting List element .
e.g.
0 : 1 
Doc0 contains 1 occurrence of the term game in the field title.
1 : 2
Doc1 contains 2 occurrence of the term game in the field title.
: 1
Doc2 contains 1 occurrence of the term game in the field title.

Term Positions Array -> For each occurrence of the term in the Posting List element, it contains the related position in the field content .
e.g.
0 : [2]
Doc0 1st occurrence of the term game in the field title occupies the 2nd position in the field content.
"title":"video(1) game(2) history(3)" )
1 : [1, 4] 
Doc1 1st occurrence of the term game in the field title occupies the 1st position in the field content.
Doc1 2nd occurrence of the term game in the field title occupies the 4th position in the field content.
"title":"game(1) video(2) review(3) game(4)" )
: [1] 
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
"title":"game(1) store(2)" )

Term Offsets Array ->  For each occurrence of the term in the Posting List element, it contains the related character offset in the field content .
e.g.
0 : [6-10]
Doc0 1st occurrence of the term game in the field title starts at 6th char in the field content till 10th char ( excluded ).
"title":"video(1) game(2) ..." )
"title":"01234 5 6789 10  ..." )
1 : [0-4, 18-22]
Doc1 1st occurrence of the term game in the field title starts at 0th char in the field content till 4th char ( excluded )..
Doc1 2nd occurrence of the term game in the field title starts at 18th char in the field content till 22nd char ( excluded ).
"title":"game(1) video(2) review(3) game(4)" )
"title":"0123 4   video(2) review(3) 18192021 22" )
[0-4]
Doc0 1st occurrence of the term game in the field title occupies the 1st position in the field content.
"title":"game(1) store(2)" )
"title":"0123 4 store(2)" )

Live Documents

Live Documents is a lightweight data structure that keep the alive documents at the current status of the index. 
It simply associates a bit  1 ( alive) , 0 ( deleted) to a document.
This is used to keep the queries aligned to the status of the Index, avoiding to return deleted documents.
Note : Deleted documents in index data structures are removed ( and disk space released) only when a segment merge happens.
This means that you can delete a set of documents, and the space in the index will be claimed back only after the first merge involving the related segments will happen.
Assuming we delete the Doc2, this is how the Live Documents data structure looks like :

Ordinal Alive
0 1
1 1
2 0

Ordinal -> The Internal Lucene document ID
Alive -> A bit that specifies if the document is alive or deleted.

Norms

Norms is a data structure that provides length normalisation and boost factor per Field per Document.
The numeric values are associated per FIeld and per Document.
This value is associated to the length of the content value and possibly an Indexing time boost factor as well.
Norms are used when scoring Documents retrieved for a query.
In details it's a way to improve the relevancy of Documents containing the term in short fields .
A boost factor can be associated at indexing time as well.
Short field contents will win over long field field contents when matching a query with Norms enabled.

Field title
Doc Ordinal Norm
0 0.78
1 0.56
2 0.98

Schema Configuration

When configuring a field in Solr ( or directly in Lucene) it is possible to specify a set of field attributes to control which data structures are going to be produced .
Let's take a look to the different Lucene Index Options

Lucene Index Option Solr schema Description To Use When ...
NONE indexed="false" The inverted index will not be built. You don't need to search in your corpus of documents.
DOCS

omitTermFreqAnd
Positions="true"
The posting list for each term will simply contain the document Ids ( ordinal) and nothing else.

e.g.
game -> 0,1,2 
- You don't need to search in your corpus with phrase or positional queries. 
-You don't need score to be affected by the number of occurrences of a term in a document field.
DOCS_AND_FREQS omitPositions="true" The posting list for each term will simply contain the document Ids ( ordinal) and term frequency in the document.

e.g.
game -> 0 : 11 : 2 : 1 
- You don't need to search in your corpus with phrase or positional queries.
- You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_
POSITIONS
Default when indexed="true" The posting list for each term will contain the term positions in addition.

e.g.
0 : 1 : [2], 1 : 2 : [1, 4], : 1 : [1] 
- You  do need  to search in your corpus  with phrase or positional queries.
- You do need scoring to take Term Frequencies in consideration
DOCS_AND_FREQS_AND_
POSITIONS_AND_OFFSETS


storeOffsetsWithPositions ="true"


The posting list for each term will contain the term offsets in addition.

e.g.game ->
0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
: 1 : [1] : [0-4]
- You want to use the Posting Highlighter.
A fast version of highlighting that uses the posting list instead of the term vector.
omitNorms omitNorms="true" The norms data structure will not be built - You don't need to boost short field contents
- You don't need Indexing time boosting per field


[1] Lucene File Formats
[2] Understanding commits and Tlog in Solr

Saturday, 11 July 2015

Solr : " You complete me! "

Introduction

If there's one thing that months of Solr-user mailing list have taught me is that the Autocomplete feature in a Search Engine is vital and around Solr Autocomplete there's as much hype as confusion.
In this blog I am going to try to clarify as much as possible all the kind of Suggesters that can be used in Solr, exploring in details how they work and showing some real world example.
It's not scope of this blog post to explore in details the configurations.
Please use the official wiki [1] and this really interesting blog post [2] to integrate this resource.
Let's start with the definition of the Suggester component.

Solr Suggester

From the official Solr wiki [1]:
" The SuggestComponent in Solr provides users with automatic suggestions for query terms. You can use this to implement a powerful auto-suggest feature in your search application.
This approach utilizes Lucene's Suggester implementation and supports all of the lookup implementations available in Lucene.
The main features of this Suggester are:
  • Lookup implementation pluggability
  • Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation
  • Distributed support "
For the details of the configuration parameter I suggest you the official wiki as a reference.
Our focus will be the practical use of the different Lookup Implementation , with clear examples.

Term Dictionary

The Term Dictionary defines the way the terms (source for the suggestions) are retrieved.
There are different ways of retrieving the terms, we are going to focus on the DocumentDictionary ( the most common and simple to use).
For details about the other Dictionaries implementation please refer to the official documentation as usual. 
The DocumentDictionary uses the Lucene Index to provide the list of possible suggestions, and specifically a field is set to be the source for these terms.

Suggester Building

Building a suggester is the process of : 
  • retrieving the terms (source for the suggestions) from the dictionary
  • build the data structures that the Suggester requires for the lookup at query time
  • Store the data structures in memory/disk
The produced data structure will be stored in memory in first place.
It is suggested to additionally store on disk the built data structures, in this way it will available without rebuilding, when it is not in memory anymore.
For example when you start up Solr, the data will be loaded from disk to the memory without any rebuilding to be necessary.
This parameter is:
storeDir” for the FuzzyLookup
indexPath” for theAnalyzingInfixLookup

The built data structures will be later used by the suggester lookup strategy, at query time.
In details, for the DocumentDictionary during the building process, for ALL the documents in the index :
  • the stored content of the configured field is read from the disk ( stored="true" is required for the field to have the Suggester working)
  • the compressed content is decompressed ( remember that Solr stores the plain content of a field applying a compression algorithm [3] )
  • the suggester data structure is built
We must be really careful here to this sentence :
"for ALL the documents" -> no delta dictionary building is happening 

So extra care every time you decide to build the Suggester !
Two suggester configuration are strictly related to this observation :


Parameter Description
buildOnCommit or buildOnOptimize If true then the lookup data structure will be rebuilt after each soft-commit. If false, the default, then the lookup data will be built only when requested by query parameter suggest.build=true.
Because of the previous observation is quite easy to understand that the buildOnCommit is highly discouraged.
buildOnStartup If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found.
Again, is highly discouraged to set this to true, or our Solr cores could take really long time to start up.

A good consideration at this point would be to introduce a delta approach in the dictionary building.
Could be a good improvement, making more sense out of the "buildOnCommit" feature.
I will follow up verifying the technical feasibility of this solution.
Now let's step to the description of the various lookup implementations with related examples.

Note: when using the field type "text_en" we refer to a simple English analyser with soft stemming and stop filter enabled.
The simple corpus of document for the examples will be the following :

[
      {
        "id":"44",
        "title":"Video gaming: the history"},
      {
        "id":"11",
        "title":"Video games are an economic business"},
      {
        "id":"55",
        "title":"The new generation of PC and Console Video games"},
      {
        "id":"33",
        "title":"Video games: multiplayer gaming"}]





And a simple synonym mapping : multiplayer, online

AnalyzingLookupFactory

<lst name="suggester">
  <str name="name">AnalyzingSuggester</str>
  <str name="lookupImpl">AnalyzingLookupFactory</str>
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="weightField">price</str>
  <str name="suggestAnalyzerFieldType">text_en</str>
</lst>


Description
Data Structure FST
Building For each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType.
The tokens produced are added to the Index FST.
Lookup strategy The query is analysed,  the tokens produced are added to the query FST.
An intersection happens between the Index FST and the query FST.
The suggestions are identified starting at the beginning of the field content.
Suggestions returned The entire content of the field . 

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of the analysis chain provided with the field.
It will be possible in this way to provide suggestions considering synonyms, stop words, stemming and any other token filter used in the analysis.

Let's see some example:

Query to autocomplete Suggestions Explanation
"Video gam"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The suggestions coming are simply the result of the prefix match. No surprises so far.
"Video Games"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The input query is analysed, and the tokens produced are the following : "video" "game".
The analysis was applied at building time as well, producing the same stemmed terms for the beginning of the titles.
"video gaming" -> "video" "game"
 "video games" -> "video" "game"
So the prefix match applies.

"Video game econ"
  • "Video games are an economic business"
In this case we can see that the stop words were not considered when building the index FST. Note :
position increments MUST NOT be preserved for this example to work, see the configuration details.
"Video games online ga"
  • "Video game: multiplayer gaming"
Synonym expansion has happened and the match is returned as online and multiplayer are considered synonyms by the suggester, based on the analysis applied.


FuzzyLookupFactory

<lst name="suggester">
  <str name="name">FuzzySuggester</str>
  <str name="lookupImpl">FuzzyLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="weightField">price</str>
  <str name="suggestAnalyzerFieldType">text_en</str>
</lst>



Description
Data StructureFST
BuildingFor each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType.
The tokens produced are added to the Index FST.
Lookup strategyThe query is analysed,  the tokens produced are then expanded producing for each token all the variations accordingly to the max edit configured for the String distance function configured ( default is Levestein Distance[4]).
The finally produced tokens are added to the query FST keeping the variations.
An intersection happens between the Index FST and the query FST.
The suggestions are identified starting at the beginning of the field content.
Suggestions returnedThe entire content of the field . 

This suggester is quite powerful as it allows to provide suggestions at the beginning of a field content, taking advantage of a fuzzy search on top of the analysis chain provided with the field.
It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and support also misspelled terms by the user.
It is an extension of the Analysis lookup.

IMPORTANT : Remember the proper order of processing happening at query time :

  • FIRST, the query is analysed, and tokens produced
  • THEN, the tokens are expanded with the inflections based on the Edit distance and distance algorithm configured


Let's see some example:

Query to autocompleteSuggestionsExplanation
"Video gmaes"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The input query is analysed, and the tokens produced are the following : "video" "gmae".
Then the FST associated is expanded with new statuses containing the inflections for each token.
For example "game" will be added to the query FST because it has a distance of 1 from the original token.
And the prefix matching is working fine returning the expected suggestions.
"Video gmaing"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The input query is analysed, and the tokens produced are the following : "video" "gma".
Then the FST associated is expanded with new statuses containing the inflections for each token.
For example "gam" will be added to the query FST because it has a distance of 1 from the original token.
So the prefix match applies.

"Video gamign"
  • No suggestion returned
This can seem odd at first, but it is coherent with the Look up implementation.
The input query is analysed, and the tokens produced are the following : "video" "gamign".
Then the FST associated is expanded with new statuses containing the inflections for each token.
For example "gaming" will be added to the query FST because it has a distance of 1 from the original token.
But no prefix matching will apply because in the Index FST we have "game", the stemmed token for "gaming"

AnalyzingInfixLookupFactory

<lst name="suggester">
  <str name="name">AnalyzingInfixSuggester</str>
  <str name="lookupImpl">AnalyzingInfixLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="weightField">price</str>
  <str name="suggestAnalyzerFieldType">text_en</str>
</lst>


Description
Data StructureAuxiliary Lucene Index
BuildingFor each Document, the stored content from the field is analyzed according to the suggestAnalyzerFieldType and then additionally EdgeNgram token filtered.
Finally an auxiliary index is built with those tokens.
Lookup strategyThe query is analysed according to the suggestAnalyzerFieldType.
Than a phrase search is triggered against the Auxiliary Lucene index
The suggestions are identified starting at the beginning of each token in the field content.
Suggestions returnedThe entire content of the field . 

This suggester is really common nowadays as it allows to provide suggestions in the middle of a field content, taking advantage of the analysis chain provided with the field.
It will be possible in this way to provide suggestions considering synonymsstop words, stemming and any other token filter used in the analysis and match the suggestion based on internal tokens.

Let's see some example:

Query to autocompleteSuggestionsExplanation
"gaming"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The input query is analysed, and the tokens produced are the following : "game" .
In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:
"v","vi","vid"… , "g","ga","gam","game" .
So the match happens and the suggestion are returned
"ga"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The input query is analysed, and the tokens produced are the following : "ga" .
In the Auxiliary Index , for each of the field content we have the EdgeNgram tokens:
"v","vi","vid"… , "g","ga","gam","game" .
So the match happens and the suggestion are returned

"game econ"
  • "Video games are an economic business"
Stop words will not appear in the Auxiliary Index.
Both "game" and "econ" will be, so the match applies.

BlendedInfixLookupFactory

We are not going to describe the details  of this lookup strategy as it's pretty much the same of the AnalyzingInfix.
The only difference appears scoring the suggestions, to weight prefix matches across the matched documents. The score will be higher if a hit is closer to the start of the suggestion or vice versa.

FSTLookupFactory

<lst name="suggester">
  <str name="name">FSTSuggester</str>
  <str name="lookupImpl">FSTLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
</lst>


Description
Data StructureFST
BuildingFor each Document, the stored content is added to the Index FST.
Lookup strategyThe query is added to the query FST.
An intersection happens between the Index FST and the query FST.
The suggestions are identified starting at the beginning of the field content.
Suggestions returnedThe entire content of the field . 

This suggester is quite simple as it allows to provide suggestions at the beginning of a field content, with an exact prefix match.

Let's see some example:

Query to autocompleteSuggestionsExplanation
"Video gam"
  • "Video gaming: the history"
  • "Video games are an economic business"
  • "Video game: multiplayer gaming"
The suggestions coming are simply the result of the prefix match. No surprises so far.
"Video Games"
  • No Suggestions
The input query is not analysed,  and no field content in the documents starts with that exact prefix

"video gam"
  • no Suggestions
The input query is not analysed,  and no field content in the documents starts with that exact prefix
"game"
  • no Suggestions
This lookup strategy works only at the beginning of the field content. So no suggestion is returned.


For the following lookup strategy we are going to use a slightly modified corpus of documents :

[
      {
        "id":"44",
        "title":"Video games: the history"},
      {
        "id":"11",
        "title":"Video games the historical background"},
      {
        "id":"55",
        "title":"Superman, hero of the modern time"},
      {
        "id":"33",
        "title":"the study of the hierarchical faceting"}]

FreeTextLookupFactory

<lst name="suggester">
  <str name="name">FreeTextSuggester</str>
  <str name="lookupImpl">FreeTextLookupFactory</str> 
  <str name="dictionaryImpl">DocumentDictionaryFactory</str>
  <str name="field">title</str>
  <str name="ngrams">3</str>
  <str name="separator"> </str>
  <str name="suggestFreeTextAnalyzerFieldType">text_general</str>
</lst>


Description
Data StructureFST
BuildingFor each Document, the stored content from the field is analyzed according to the suggestFreeTextAnalyzerFieldType.
As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=<ngrams>.
The final tokens produced are added to the Index FST.
Lookup strategyThe query is analysed according to the suggestFreeTextAnalyzerFieldType.
As a last token filter is added a ShingleFilter with a minShingle=2 and maxShingle=<ngrams>.
Only the latest "ngrams" tokens will be evaluated to produce
Suggestions returnedngram tokens suggestions

This lookup strategy is completely different from the others seen so far, its main difference is that the suggestions are ngram tokens ( and NOT the full content of the field).
We must take extra care in using this suggester as it is quite easily prone to errors, some guidelines :

  • Don't use an heavy Analyzers, the suggested terms will come from the index, so be sure they are meaningful tokens. A really basic analyser is suggested, stop words and stemming are not 
  • Be sure you use the proper separator(' ' is suggested), the default will be encoded in "#30;"
  • ngrams parameter will set the last n tokens to be considered from the query


Let's see some example:

Query to autocompleteSuggestionsExplanation
"video g"
  • "video gaming"
  • "video games"
  • "generation"
The input query is analysed, and the tokens produced are the following : "video g" "g" 
The analysis was applied at building time as well, producing 2-3 shingles.
"video g" matches by prefix 2 shingles from the index FST .
"g" matches by prefix 1 shingle from the index FST.

"games the h"
  • "games the history"
  • "games the historical"
  • "the hierarchical"
  • "hero"
The input query is analysed, and the tokens produced are the following : "games the h" "the h""h" 
The analysis was applied at building time as well, producing 2-3 shingles.
"games the h" matches by prefix 2 shingles from the index FST .
"the h" matches by prefix 1 shingle from the index FST.
"h" matches by prefix 1 shingle from the index FST.


[1] Suggester Solr wiki
[2] Solr suggester
[3] Lucene Storing Compression
[4] Levenstein Distance

Saturday, 4 July 2015

Solr Document Classification - Part 1 - Indexing Time

Introduction

In the previous blog [1] we have explored the world of Lucene Classification and the extension to use it for Document Classification .
It comes natural to integrate Solr with the Classification module and allow Solr users to easily manage the Classification out of the box .

N.B.  This is supported from Solr 6.1

Solr Classification

Taking inspiration from the work of a dear friend [2] , integrating the classification in Solr can happen 2 sides :
  • Indexing time - through an Update Request Processor
  • Query Time - through a Request handler ( similar to the More like This )
In this first article we are going to explore the Indexing time integration :
The Classification Update Request Processor.

Classification Update Request Processor

First of all let's describe some basic concepts :
An Update Request Processor Chain, associated to an Update handler, is a pipeline of Update processors, that will be executed in sequence.
It takes in input the added Document (to be indexed) and return the document after it has been processed by all the processors in the chain in sequence.
Finally the document is indexed.
An Update Request Processor is the unit of processing of a chain, it takes in input a Document and operates some processing before it is passed to the following processor in the chain if any.

The main reason for the Update processor is to add intermediate processing steps that can enrich, modify and possibly filter documents , before they are indexed.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.
For further details, follow the official documentation [3].

Description

The Classification Update Request Processor is a simple processor that will automatically classify a document ( the classification will be based on the latest index available) adding a new field  containing the class, before the document is indexed. 
After an initial valuable index has been built with human assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.
The processing steps are quite simple :
When a document to be indexed enters the Update Processor Chain, and arrives to the Classification step, this sequence of operations will be executed :
  • The latest Index Reader is retrieved from the latest opened Searcher
  • A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
  • A Class is assigned by the classifier taking in consideration all the relevant fields from the input document
  • A new field is added to the original Document, with the class
  • The Document goes through the next processing step

Configuration

Let's see the detailed configuration for the Update Processor with examples :

e.g. K Nearest Neighbours Classifier
<updateRequestProcessorChain name="classification">
  <processor class="solr.ClassificationUpdateProcessorFactory">
    <str name="inputFields">title^1.5,content,author</str>
    <str name="classField">cat</str>
    <str name="algorithm">knn</str>
    <str name="knn.k">20</str>
    <str name="knn.minTf">1</str>
    <str name="knn.minDf">5</str>
  </processor>
</updateRequestProcessorChain>


e.g. Simple Naive Bayes Classifier
<updateRequestProcessorChain name="classification">
  <processor class="solr.ClassificationUpdateProcessorFactory">
    <str name="inputFields">title^1.5,content,author</str>
    <str name="classField">cat</str>
    <str name="algorithm">bayes</str>
  </processor>
</updateRequestProcessorChain>


e.g. Update Handler Configuration
<requestHandler name="/update" >
  <lst name="defaults">
    <str name="update.chain">classification</str>
  </lst>
</requestHandler>



Parameter Default Description
inputFields
This config param is mandatory The list of fields (comma separated) to be taken in consideration for doing the classification.
Boosting syntax is supported for each field.
classField
This config param is mandatory The field that contains the class of the document. It must appear in the indexed documents .
If knn algorithm it must be stored .
If bayes algorithm it must be indexed and ideally not heavily analysed.
algorithm
knn The algorithm to use for the classification:
- knn ( K Nearest neighbours )
- bayes ( Simple Naive Bayes )
knn.k
10 Advanced - the no. of top docs to select in the MLT results to find the nearest neighbor
knn.minDf
1 Advanced - A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
knn.minTf
1 Advanced - A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input

Usage

Indexing News Documents ? we can use the already indexed news with category,  to automatically tag upcoming stories with no human intervention.
E-commerce Search System ? Category assignation will require few human interaction after a valid initial corpus of products has been indexed with manually assigned category.
The possible usage for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index , the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course we must remember that for advanced classification scenarios that require in deep tuning, this solution can be not optimal.

Code

The patch will be attached to this Jira Issue :


You can play with it until officially supported.



Thursday, 2 July 2015

Lucene Document Classification

Introduction

Machine Learning and Search have been always strictly associated.
ML can help to improve the Search Experience in a lot of ways, extracting more information from the corpus of documents, auto classifying them and clustering them.
On the other hand , Search related data structures are quite similar in content to the ML models used to solve the problems, so we can use them to build up our own algorithms.
But let's go in order...
In this article we are going to explore how much Auto Classification can be important to easy the user experience in Search, and how will be possible to have an easy, out of the box painless classification directly from Lucene, from our already existent Index, without any external dependency or trained model.

Classification

From Wikipedia :
In machine learning and statisticsclassification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into "spam" or "non-spam" classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
Classification is a very well known problem, and it is really easy to verify the utility of classifiers on a  daily basis.
When dealing with Search Engines , it is so common that our Documents have a category ( generally assigned by a human ). Wouldn't be really useful to be able to automatic extract the category for the documents that are missing it ? Wouldn't be really interesting to automate that processing step to have document automatically categorised, based on the experience ? ( which means using all the documents already categorised in our Search System)
Wouldn't be amazing to do that without any external tool and without any additional expensive model training activity ?
We already have a lot of information in our Search System, let's use it !

Lucene Classification Module

To be able to provide a Classification capability, our system generally needs a trained model.
Classification is a problem solved by Supervised Machine learning algorithms which means humans need to provide a training set.
A classification training set is a set of documents, manually tagged with the correct class.
It is the "experience" that the Classification system will use to classify upcoming unseen documents.
Building a trained model is expensive and requires to be stored in specific data structures.
Is it really necessary to build an external trained model when we already have in our index million of documents, already classified by humans and ready to be searched ?
A Lucene index has already really interesting data structures that relates terms to documents ( Inverted Index), fields to terms ( Term Vectors) and all the information we need about Term Frequency and Document Frequency.
A corpus of documents, already tagged with the proper class in a specific field, can be a really useful resource to build our classification inside our search engine without any external tool or data structure.
Based on this observations, the Apache Lucene [1] Open Source project introduced with version 4.2  a document classification module to manage the text classification using the Index data structures.
The facade for this module is the text Classifier, a simple interface ( with 3 implementations available ) :

public interface Classifier<T> {

/** * Assign a class (with score) to the given text String *
* @param text a String containing text to be classified *
@return a {@link ClassificationResult} holding assigned class of type T and score *


ClassificationResult<T> assignClass(String text) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@return the whole list of {@link ClassificationResult}, the classes and scores. Returns null if the classifier can't make lists. */


List<ClassificationResult<T>> getClasses(String text) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@param max the number of return list elements *
@return the whole list of {@link ClassificationResult}, the classes and scores. Cut for "max" number of elements. Returns null if the classifier can't make lists. */


List<ClassificationResult<T>> getClasses(String text, int max) throws IOException;
}

 The available implementations are :
  • KNearestNeighborClassifier - A k-Nearest Neighbor classifier [2] based on MoreLikeThis
  • SimpleNaiveBayesClassifier - A simplistic Lucene based NaiveBayes classifier [3]
  • BooleanPerceptronClassifier - A perceptron [4] based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq both on a per field and a per document basis and then a corresponding FST is used for class assignment.
Let's see the first 2 in details :

KNearestNeighborClassifier

This classifier is based on the Lucene More like This [5]. 
A feature able to retrieve similar documents to a seed one ( or a seed text) calculating the similarity between the docs on a field base.
It takes in input the list of fields to take in consideration ( with relative boost factor to increase the importance of some field over the others).

The idea behind this algorithm is simple :
 - given a set of relevant fields for our classification
 - given a field containing the class of the document
It retrieves all the similar documents to the Text in input using the MLT.

Only the documents with the class field valorised, are taken in consideration.
The top k documents in result are evaluated, and the class extracted from all of them.
Then a ranking of the retrieved classes is made, based on the frequency of the class in the top K documents.
Currently the algorithm takes in consideration only the frequency of the class to calculate its score.
One limitation is that is not taking in consideration the ranking of the class.
This means that if in the top K, we have the first k/2 documents of class C1 and the second k/2 document of class C2, both the classes will have the same score [6] .

This classifier works on top of the Index, let's quickly have an overview of the constructor parameters :     


Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
k the no. of top docs to select in the MLT results to find the nearest neighbor
minDocsFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
minTermFreq A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
classFieldName The field that contains the class of the document. It must appear in the indexed documents. MUST BE STORED
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents

Note :  MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields (the textField  
Names parameter, below). 

You usually read that for best results, the fields should have stored term vectors.
In our case, the text to classify is unseen, it is not and indexed document.
So the TermVector is not used at all.

SimpleNaiveBayesClassifier

This Classifier is based on a simplistic implementation of a Naive Bayes Classifier [3].
It uses the Index to get the Term frequencies of the terms, the Doc frequencies and the unique terms.
First of all it extract all the possible classes, this is obtained getting all the terms for the class field.
Then , each class is scored based on
- how frequent is the class in all the classified documents in the index
- how much likely the tokenized text is to belong to the class
The details of the algorithm go beyond the scope of this blog post.

This classifier works on top of the Index, let's quickly have an overview of the constructor parameters : 



Parameter Description
leafReader The Index Reader that will be used to read the index data structures and classify the document
analyzer The Analyzer to be used to analyse the input unseen text
query A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
classFieldName The field that contains the class of the document. It must appear in the indexed documents
textFieldNames The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents

Note :  The NaiveBayes Classifier works on terms from the index. This means it pulls from the index the tokens for the class field. Each token will be considered as class, and will see a score associated.
This means that you must be careful in the analysis you choose for the classField and ideally use a not tokenizer field containing the class ( a copyField if necessary)

Document Classification Extension

The original Text Classifiers are perfectly fine, but what about Document classification ?
In a lot of cases the information we want to classify is actually composed of a set of fields with one or more value each one.
Each field content contributes to the classification ( with a different weight to be precise).
Given a simple News article, the title is much more important for the classification in comparison with the text and the author.
But even if not so relevant even the author can have a little part in the class assignation.
Lucene atomic unit of information is the Document. 

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value.
This data structure is a perfect input for a new generation of Classifiers that will benefit of augmented information to assign the relevant class(es).
Whenever a simple text is ambiguous, including other fields in the analysis is improving dramatically the precision of the classification.
In details :

  • the field content from the Input Document will be compared with the data structures in the index, only related to that field
  • an input field will be analysed accordingly to its own analysis chain, allowing greater flexibility and precision
  • an input field can be boosted, to affect more the classification. In this way different portions of the input document will have a different relevancy to discover the class of the document.
A new interface is provided :

public interface DocumentClassifier<T> { 
/** * Assign a class (with score) to the given {@link org.apache.lucene.document.Document} * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return a {@link org.apache.lucene.classification.ClassificationResult} holding assigned class of type T and score */ 
ClassificationResult<T> assignClass(Document document) throws IOException; 
/** * Get all the classes (sorted by score, descending) assigned to the given {@link org.apache.lucene.document.Document}. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Returns null if the classifier can't make lists.*/ 
List<ClassificationResult<T>> getClasses(Document document) throws IOException; 
/** * Get the first max classes (sorted by score, descending) assigned to the given text String. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @param max the number of return list elements * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Cut for "max" number of elements. Returns null if the classifier can't make lists.*/ 
List<ClassificationResult<T>> getClasses(Document document, int max) throws IOException;

And 2 Classifier extended to provide the new functionality :
  • KNearestNeighborDocumentClassifier 
  • SimpleNaiveBayesDocumentClassifier 

The first implementation is available as a contribution for this Jira issue :

LUCENE-6631

Now that a new interface is available, it will be much easier to integrate it with Apache Solr.

Stay tuned for the Solr Classification Integration.

[1] Apache Lucene
[2] K-nearest_neighbors
[3] Naive Bayes
[4] Perceptron
[5] More Like This
[6] LUCENE-6654 - Fix proposed