Jonathan Griep

'pip' is not recognized as an internal or external command, operable program or batch file.

2018-07-28T12:10:00.000-07:00

Windows fails to find pip after installing python

After installing python and adding the python path to your path environment variable, you open a cmd window and type the pip command to install a library:


c:\Users\myusername> pip install matplotlib

and you get back this error:


'pip' is not recognized as an internal or external command,
operable program or batch file.

Assuming that you have added the python directory to your path environment variable, you can execute pip this way:

c:\Users\myusername> python -m pip install matplotlib

This is because pip is usually located in the "Scripts" folder directly under the folder where python is installed.

Alternatively you can add the Scripts directory to your path environment variable.

Indexing Attachments with Elasticsearch

2017-04-05T10:54:00.001-07:00

Introduction

Elasticsearch is a great open source tool for indexing many different types of content and providing a fast search capability. I have been working with version 5.3 (on a CentOS 7 virtual machine) to build a tool to migrate or search NSF files using Elasticsearch as a NoSQL data store. The information provided in this post can be used to get you started indexing common files from any source.

One important feature is the ability to index attachments. This post walks through the steps needed to get this to work with the latest ingest-attachment plugin and some of the current limitations of working with the plugin. The previous mapper-attachments plugin has been deprecated in version 5.0.0.

Supported File Formats

The ingest-attachment plugin uses the Apace Tika content analysis toolkit to extract text from each file as it is processed. The file formats supported are shown in this source module on github. There are eleven popular file formats including pdf, html, XLS and PPT. Notably, the .eml file format (for mail messages) is not supported in this release even though there is a Tika parser available for that format.

Installation

First you need to download and install Elasticsearch. Then you need to follow the instructions to install the ingest-attachment plugin. There is a Docker image available, I will explore its use in a later post.

Note that the command to install the plugin is different depending on your installation configuration. The script that you need to run (bin/elasticsearch-plugin) is relative to where you installed elasticsearch. On CentOS the default location is /usr/share/elasticsearch/bin/elasticsearch-plugin.

In the examples below my elastic search installation is listening at http://localhost:9200.

Set up the pipeline

Next set up a pipeline to process the attachment data. This is really a configuration step and only needs to done once. This is done by using http PUT:

curl -XPUT 'localhost:9200/_ingest/pipeline/attachment' -H 'Content-Type: application/json' -d'

{

"description" : "Extract attachment content",

"processors" : [

{

"attachment" : {

"field" : "data",

"indexed_chars" : -1

}

]

}

You can get the current list of pipelines using http GET to verify it is setup:

curl -XGET 'localhost:9200/_ingest/pipeline'

In the above example "data" is the name of the field that will be treated as attachment data. Setting "indexed_chars" to -1 allows the entire file to be indexed (which can resource intensive). There are other options available. When you PUT your document content as json, the value for the data field is the B64 encoded content from your file. Its also possible to avoid B64 encoding the file by using the CBOR format which I will explore in another post.

Index and search for a file

As an example, suppose you want to search the contents of a text file named sampleattachment.txt.

To create the file:

echo "I like to go on the Pelham Parkway to cross the Bronx." > sampleattachment.txt

To add the content of the file to an index named "myindex" and with a type named "media" and a entry id of "99" you can use this bash script:

#!/bin/bash

filePath='sampleattachment.txt'

b64encoding=$(base64 --wrap=0 $filePath)

curl -XPUT 'localhost:9200/myindex/media/99?pipeline=attachment' -d "

{

\"data\" : \"$b64encoding\"

}

where we refer to the attachment pipeline that we set up above.

Next verify that the content is indexed by performing a search:

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '

{

"query" : { "query_string" : { "query" : "Pelham" } }

}

returns the results for a search for the word "Pelham".

Here are the JSON formatted search results (which you can use any number of JavaScript based frameworks to format). You can see that Elasticsearch has recognized the content_type as a text attachment.

{

"took" : 20,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

"hits" : {

"total" : 1,

"max_score" : 0.25124598,

"hits" : [

{

"_index" : "myindex",

"_type" : "media",

"_id" : "99",

"_score" : 0.25124598,

"_source" : {

"data" : "SSBsaWtlIHRvIGdvIG9uIHRoZSBQZWxoYW0gUGFya3dheSB0byBjcm9zcyB0aGUgQnJvbnguCg==",

"attachment" : {

"content_type" : "text/plain; charset=ISO-8859-1",

"language" : "en",

"content" : "I like to go on the Pelham Parkway to cross the Bronx.",

"content_length" : 56

}

]

}

Filtering Returned Fields

You can filter what fields are returned in the search results by using a source include or exclude filter. For example, if we want to only return the attachment fields without the content (and exclude the data field):

curl 'localhost:9200/myindex/media/_search?pretty=true' -d '

{
"_source" : {
"includes" : { "attachment.*" },
"excludes" : { "attachment.content" }
},

"query" : { "query_string" : { "query" : "Pelham" } }

}

High Physical Memory Usage Issue

2016-10-06T07:18:00.000-07:00

Background

I investigated an issue where a new Hyper-V virtual machine running Windows 7 would consume most of the physical memory after exactly 5 minutes of uptime. This would occur even if no applications were running.

Investigation

I tried using the Windows Task Manager "Processes" tab to look at the memory being used but none of the processes listed (mostly services) had anywhere close to the amount of physical memory (8 GB) allocated.

After some initial searching I found this great SysInternals utility called RAMMap: https://technet.microsoft.com/en-us/sysinternals/rammap.aspx

Running RAMMap utility indicated that most of the memory was "Driver Locked". Using a Google search, I found this post: https://social.technet.microsoft.com/Forums/office/en-US/d4f97391-a70c-47b1-ab05-bab4754868ac/hyperv-dynamic-memory-driver-locked?forum=winserverhyperv

I found that the Windows 7 virtual machine was specified to use "Dynamic Memory".

Solution

After shutting down the virtual machine, I unchecked the "Enable Dynamic Memory" option in the Memory Settings for the virtual machine and set the startup memory to my fixed size. After restarting the virtual machine, I found that the physical memory usage no longer grew after 5 minutes.

Useful TypeScript Links

2016-10-03T13:46:00.000-07:00

TypeScript is a strongly typed open source language which compiles into JavaScript. TypeScript was originally developed by Microsoft. The language supports interfaces, classes (including inheritance), generics and modules.

Using TypeScript enables the developer to validate the code contracts at design/compile time instead of waiting until the code is executed in the browser. This reduces your development and testing costs and results in a more reliable site.

This YouTube video provides a good introduction to the language including how to integrate jQuery with TypeScript: Getting Started with TypeScript

There is a browser based playground at: http://www.typescriptlang.org/Playground which you can use to try out the language.

Many TypeScript type definitions (which are very useful when incorporating other JavaScript frameworks such as jQuery) are available on github: https://github.com/DefinitelyTyped/DefinitelyTyped

Debugging Network File System (NFS) connections

2016-02-03T17:54:00.000-08:00

Introduction

There are a number of useful shell command line utilities that can be used to debug Network File System (NFS) connections. These utilities are typically available on most Linux or other Unix-variant operating system.

rpcinfo

This utility can be used by client computers to find out what services and protocols are supported by a given server. This is a good starting point to find out if the NFS services are enabled on a given computer. There are three services typically needed for NFSv3 connections: portmapper, mount and nfs. The man pages (man rpcinfo) will provide more information about the various options available.

showmount

This utility can be used to list what directories have been exported by a given NFS server. See the man page for the command line options on your system.

Network Protocol Analyzers

There are a number of free utilities which can be used to analyze network transactions: Wireshark, tcpdump. and others. These utilities allow the user to monitor network traffic between the client and the server and log it. The analyzers can then be used to review the log to see what individual commands were sent from the client and the response from the server.

Centos

On Centos, it is possible to enable logging using rpcdebug:

rpcdebug -m nfsd -s all

will send logging for the nfs server to /var/log/messages

and

rpcdebug -m nfsd -c all

will disable logging again. See "man rpcdebug" for more info.

If trying to debug a command like df:

strace df -h

will print out system calls made executing the command. To output to a file:

strace df -h 2> traceout.txt

undefined symbol: tdb_transaction_start_nonblock

2013-02-24T08:32:00.000-08:00

When enabling Samba on an openSUSE instance, I received the above error when I tried to use:

net join

to join a domain.

Apparently there are some dependency issues. To resolve this problem I used yast to find libtdb and install it and the error went away.

However, when I tried to start Samba on startup it failed. I discovered this second error by looking in /var/log/samba/log.smbd:

/var/sbin/smbd: symbol lookup error: /usr/sbin/smbd: undefined symbol: wbcSidsToUnixIds

I found that this is from libwbclient0, so I used yast to install it (version 3.6.3-115.1) and this second error went away. You may have to first stop the nmb daemon using:

rcnmb stop

After rebooting I checked the status of both nmb and smb using the following commands:

rcsmb status
and
rcnmb status

and now both daemons are running.

Viewing the .NET Finalizer Queue

2012-10-11T03:25:00.000-07:00

In .NET, memory is managed via a garbage collector. The collector works by processing the "Finalizer Queue". Sometimes the queue can back up (the overall system is so busy that it can't release the items fast enough) and so you may need to come up with a new resource deallocation strategy.

In order to find the problematic objects, its useful to look at the queue at certain points when the system is under load to see what can be reclaimed sooner by implementing the IDisposable interface and freeing those objects in your code (thereby avoiding having them processed by the queue).

There is a third party tool to do let you view this information but it would make sense that there is an alternative way using Microsoft Visual Studio and it is described in this article by Tess Fernandez: .Net finalizer memory leak debugging with sos dll in visual studio.

There were a couple of noteworthy gotchas:

When you connect the debugger to the running executable, you need to ensure that Native debugging is turned on
The sos.dll extension that you load is done via the "immediate" window which is different from the "command" window. To get an immediate window to open you can type immed into the command window. From the immediate window you use the .load command highlighted in the article.

The best part about this tool is there doesn't appear to be anything you need to install. The sos.dll is always there with .Net 2.0 or later.

0x1B1 - Version mismatch between executable and preexisting shared memory versions! EXITING.

2012-07-31T06:04:00.002-07:00

I recently did some debugging on this Lotus Notes issue and posted a response in the IBM dW forum:

http://www-10.lotus.com/ldd/nd85forum.nsf/5f27803bba85d8e285256bf10054620d/993bb695ac0e190785257a4c00441c76?OpenDocument&Highlight=0,tw%3F,notes,processes

Debugging Enterprise Vault 10 Indexing

2011-11-05T12:20:00.000-07:00

In debugging some indexing related issues with Symantec Enterprise Vault I found this link http://www.symantec.com/docs/TECH160420 which provides some useful information about debugging indexing.

I have found the EV Dtrace facility to be very useful for debugging in the past and for indexing, one of the tasks which generate output is EVIndexVolumesProcessor so you should set it to generate verbose output while using Dtrace.

The dtrace was instrumental in leading me to this post: http://www.symantec.com/connect/forums/problems-indexing-after-upgrading#comment-5964601 The indexer was having trouble communicating with the StorageCrawler even though they were on the same machine. Disabling the firewall resolved the issue so there must be a port that needs to be opened. More details as they become available.

About Me

2011-04-29T15:28:00.000-07:00

I run a software development company: JPG Consulting, Inc. We develop software on a contract basis as well as sell our own products on http://www.notesconnectors.com/.

This blog is going to be devoted to API issues we uncover as we do our development. Often we find the documentation and samples lacking and wind up doing a lot of digging on the web or code experimentation to find the solution.

I will try to label the relevant API on each post to make searching easier.