Wednesday, April 5, 2017

Indexing Attachments with Elasticsearch

Introduction

Elasticsearch is a great open source tool for indexing many different types of content and providing a fast search capability.  I have been working with version 5.3 (on a CentOS 7 virtual machine) to build a tool to migrate or search NSF files using Elasticsearch as a NoSQL data store.  The information provided in this post can be used to get you started indexing common files from any source.

One important feature is the ability to index attachments.  This post walks through the steps needed to get this to work with the latest ingest-attachment plugin and some of the current limitations of working with the plugin.  The previous mapper-attachments plugin has been deprecated in version 5.0.0.

Supported File Formats

The ingest-attachment plugin uses the Apace Tika content analysis toolkit to extract text from each file as it is processed.  The file formats supported are shown in this source module on github.  There are eleven popular file formats including pdf, html, XLS and PPT.  Notably, the .eml file format (for mail messages) is not supported in this release even though there is a Tika parser available for that format. 

Installation

First you need to download and install Elasticsearch.  Then you need to follow the instructions to install the ingest-attachment plugin.  There is a Docker image available, I will explore its use in a later post.

Note that the command to install the plugin is different depending on your installation configuration. The script that you need to run (bin/elasticsearch-plugin) is relative to where you installed elasticsearch.  On CentOS the default location is /usr/share/elasticsearch/bin/elasticsearch-plugin.  

In the examples below my elastic search installation is listening at http://localhost:9200.

Set up the pipeline

Next set up a pipeline to process the attachment data.  This is really a configuration step and only needs to done once.  This is done by using http PUT:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -H 'Content-Type: application/json' -d'
{
"description" : "Extract attachment content", 
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
'
You can get the current list of pipelines using http GET to verify it is setup:

curl -XGET 'localhost:9200/_ingest/pipeline'

In the above example "data" is the name of the field that will be treated as attachment data.  Setting "indexed_chars" to -1 allows the entire file to be indexed (which can resource intensive).  There are other options available. When you PUT your document content as json, the value for the data field is the B64 encoded content from your file.   Its also possible to avoid B64 encoding the file by using the CBOR format which I will explore in another post.

Index and search for a file

As an example, suppose you want to search the contents of a text file named sampleattachment.txt.
To create the file:

echo "I like to go on the Pelham Parkway to cross the Bronx." > sampleattachment.txt

To add the content of the file to an index named "myindex" and with a type named "media" and a entry id of "99" you can use this bash script:

#!/bin/bash
filePath='sampleattachment.txt'
b64encoding=$(base64 --wrap=0 $filePath)
curl -XPUT 'localhost:9200/myindex/media/99?pipeline=attachment' -d "
{
\"data\" : \"$b64encoding\"
}
"
where we refer to the attachment pipeline that we set up above.

Next verify that the content is indexed by performing a search:

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '
{
"query" : { "query_string" : { "query" : "Pelham" } }
}
'
returns the results for a search for the word "Pelham".

Here are the JSON formatted search results (which you can use any number of JavaScript based frameworks to format).  You can see that Elasticsearch has recognized the content_type as a text attachment.

{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.25124598,
    "hits" : [
      {
        "_index" : "myindex",
        "_type" : "media",
        "_id" : "99",
        "_score" : 0.25124598,
        "_source" : {
          "data" : "SSBsaWtlIHRvIGdvIG9uIHRoZSBQZWxoYW0gUGFya3dheSB0byBjcm9zcyB0aGUgQnJvbnguCg==",
          "attachment" : {
            "content_type" : "text/plain; charset=ISO-8859-1",
            "language" : "en",
            "content" : "I like to go on the Pelham Parkway to cross the Bronx.",
            "content_length" : 56
          }
        }
      }
    ]
  }
}

Filtering Returned Fields

You can filter what fields are returned in the search results by using a source include or exclude filter.  For example, if we want to only return the attachment fields without the content (and exclude the data field):

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '
{
"_source" : {
"includes" : { "attachment.*" },
"excludes" : { "attachment.content" }
},
"query" : { "query_string" : { "query" : "Pelham" } }
}
'



Thursday, October 6, 2016

High Physical Memory Usage Issue

Background

I investigated an issue where a new Hyper-V virtual machine running Windows 7 would consume most of the physical memory after exactly 5 minutes of uptime.  This would occur even if no applications were running.

Investigation

I tried using the Windows Task Manager "Processes" tab to look at the memory being used but none of the processes listed (mostly services) had anywhere close to the amount of physical memory (8 GB) allocated.

After some initial searching I found this great SysInternals utility called RAMMap: https://technet.microsoft.com/en-us/sysinternals/rammap.aspx

Running RAMMap utility indicated that most of the memory was "Driver Locked".  Using a Google search, I found this post: https://social.technet.microsoft.com/Forums/office/en-US/d4f97391-a70c-47b1-ab05-bab4754868ac/hyperv-dynamic-memory-driver-locked?forum=winserverhyperv

I found that the Windows 7 virtual machine was specified to use "Dynamic Memory".

Solution 

After shutting down the virtual machine, I unchecked the "Enable Dynamic Memory" option in the Memory Settings for the virtual machine and set the startup memory to my fixed size.  After restarting the virtual machine, I found that the physical memory usage no longer grew after 5 minutes.

Monday, October 3, 2016

Useful TypeScript Links

TypeScript is a strongly typed open source language which compiles into JavaScript. TypeScript was originally developed by Microsoft.  The language supports interfaces, classes (including inheritance), generics and modules.

Using TypeScript enables the developer to validate the code contracts at design/compile time instead of waiting until the code is executed in the browser.  This reduces your development and testing costs and results in a more reliable site.

This YouTube video provides a good introduction to the language including how to integrate jQuery with TypeScript: Getting Started with TypeScript

There is a browser based playground at: http://www.typescriptlang.org/Playground which you can use to try out the language.

Many TypeScript type definitions (which are very useful when incorporating other JavaScript frameworks such as jQuery) are available on github: https://github.com/DefinitelyTyped/DefinitelyTyped




Wednesday, February 3, 2016

Debugging NFS (Network File System) connections

Introduction


There are a number of useful shell command line utilities that can be used to debug Network File System (NFS) connections.  These utilities are typically available on most Linux or other Unix-variant operating system.

rpcinfo

This utility can be used by client computers to find out what services and protocols are supported by a given server.  This is a good starting point to find out if the NFS services are enabled on a given computer.  There are three services typically needed for NFSv3 connections: portmapper, mount and nfs.  The man pages (man rpcinfo) will provide more information about the various options available.

showmount

This utility can be used to list what directories have been exported by a given NFS server.  See the man page for the command line options on your system.

Network Protocol Analyzers

There are a number of free utilities which can be used to analyze network transactions: Wireshark, tcpdump. and others.  These utilities allow the user to monitor network traffic between the client and the server and log it.  The analyzers can then be used to review the log to see what individual commands were sent from the client and the response from the server.



Sunday, February 24, 2013

undefined symbol: tdb_transaction_start_nonblock

When enabling Samba on an openSUSE instance, I received the above error when I tried to use:

net join

to join a domain.

Apparently there are some dependency issues.  To resolve this problem I used yast to find libtdb and install it and the error went away.

However, when I tried to start Samba on startup it failed.  I discovered this second error by looking in /var/log/samba/log.smbd:

/var/sbin/smbd: symbol lookup error: /usr/sbin/smbd: undefined symbol: wbcSidsToUnixIds

I found that this is from libwbclient0, so I used yast to install it (version 3.6.3-115.1) and this second error went away.   You may have to first stop the nmb daemon using:

rcnmb stop


After rebooting I checked the status of both nmb and smb using the following commands:

rcsmb status
and
rcnmb status

and now both daemons are running.

Thursday, October 11, 2012

Viewing the .NET Finalizer Queue

In .NET, memory is managed via a garbage collector.  The collector works by processing the "Finalizer Queue".  Sometimes the queue can back up (the overall system is so busy that it can't release the items fast enough) and so you may need to come up with a new resource deallocation strategy.

In order to find the problematic objects, its useful to look at the queue at certain points when the system is under load to see what can be reclaimed sooner by implementing the IDisposable interface and freeing those objects in your code (thereby avoiding having them processed by the queue).

There is a third party tool to do let you view this information but it would make sense that there is an alternative way using Microsoft Visual Studio and it is described in this article by Tess Fernandez: .Net finalizer memory leak debugging with sos dll in visual studio.

There were a couple of noteworthy gotchas:

  1. When you connect the debugger to the running executable, you need to ensure that Native debugging is turned on
  2. The sos.dll extension that you load is done via the "immediate" window which is different from the "command" window.  To get an immediate window to open you can type immed into the command window.  From the immediate window you use the .load command highlighted in the article.
The best part about this tool is there doesn't appear to be anything you need to install.  The sos.dll is always there with .Net 2.0 or later.