Wednesday, April 5, 2017

Indexing Attachments with Elasticsearch

Introduction

Elasticsearch is a great open source tool for indexing many different types of content and providing a fast search capability.  I have been working with version 5.3 (on a CentOS 7 virtual machine) to build a tool to migrate or search NSF files using Elasticsearch as a NoSQL data store.  The information provided in this post can be used to get you started indexing common files from any source.

One important feature is the ability to index attachments.  This post walks through the steps needed to get this to work with the latest ingest-attachment plugin and some of the current limitations of working with the plugin.  The previous mapper-attachments plugin has been deprecated in version 5.0.0.

Supported File Formats

The ingest-attachment plugin uses the Apace Tika content analysis toolkit to extract text from each file as it is processed.  The file formats supported are shown in this source module on github.  There are eleven popular file formats including pdf, html, XLS and PPT.  Notably, the .eml file format (for mail messages) is not supported in this release even though there is a Tika parser available for that format. 

Installation

First you need to download and install Elasticsearch.  Then you need to follow the instructions to install the ingest-attachment plugin.  There is a Docker image available, I will explore its use in a later post.

Note that the command to install the plugin is different depending on your installation configuration. The script that you need to run (bin/elasticsearch-plugin) is relative to where you installed elasticsearch.  On CentOS the default location is /usr/share/elasticsearch/bin/elasticsearch-plugin.  

In the examples below my elastic search installation is listening at http://localhost:9200.

Set up the pipeline

Next set up a pipeline to process the attachment data.  This is really a configuration step and only needs to done once.  This is done by using http PUT:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -H 'Content-Type: application/json' -d'
{
"description" : "Extract attachment content", 
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
'
You can get the current list of pipelines using http GET to verify it is setup:

curl -XGET 'localhost:9200/_ingest/pipeline'

In the above example "data" is the name of the field that will be treated as attachment data.  Setting "indexed_chars" to -1 allows the entire file to be indexed (which can resource intensive).  There are other options available. When you PUT your document content as json, the value for the data field is the B64 encoded content from your file.   Its also possible to avoid B64 encoding the file by using the CBOR format which I will explore in another post.

Index and search for a file

As an example, suppose you want to search the contents of a text file named sampleattachment.txt.
To create the file:

echo "I like to go on the Pelham Parkway to cross the Bronx." > sampleattachment.txt

To add the content of the file to an index named "myindex" and with a type named "media" and a entry id of "99" you can use this bash script:

#!/bin/bash
filePath='sampleattachment.txt'
b64encoding=$(base64 --wrap=0 $filePath)
curl -XPUT 'localhost:9200/myindex/media/99?pipeline=attachment' -d "
{
\"data\" : \"$b64encoding\"
}
"
where we refer to the attachment pipeline that we set up above.

Next verify that the content is indexed by performing a search:

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '
{
"query" : { "query_string" : { "query" : "Pelham" } }
}
'
returns the results for a search for the word "Pelham".

Here are the JSON formatted search results (which you can use any number of JavaScript based frameworks to format).  You can see that Elasticsearch has recognized the content_type as a text attachment.

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.25124598,
    "hits" : [
      {
        "_index" : "myindex",
        "_type" : "media",
        "_id" : "99",
        "_score" : 0.25124598,
        "_source" : {
          "data" : "SSBsaWtlIHRvIGdvIG9uIHRoZSBQZWxoYW0gUGFya3dheSB0byBjcm9zcyB0aGUgQnJvbnguCg==",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "language" : "en",
            "content" : "I like to go on the Pelham Parkway to cross the Bronx.",
            "content_length" : 56
          }
        }
      }
    ]
  }
}

Filtering Returned Fields

You can filter what fields are returned in the search results by using a source include or exclude filter.  For example, if we want to only return the attachment fields without the content (and exclude the data field):

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '
{
"_source" : {
"includes" : { "attachment.*" },
"excludes" : { "attachment.content" }
},
"query" : { "query_string" : { "query" : "Pelham" } }
}
'