Hana fuzzy text search

Since connecting to HANA from Microstrategy ran quite smoothly, I was Interested if it was possible to use some of the more extensive HANA capabilities trough microstrategy.

After sitting in on sessions on the TECH-ED in Amsterdam my interest was piqued by HANA’s text capabilities. I recently worked on a Mobile dashboard where  fuzzy search would have been great feature.

First using HANA Studio to create a table having three columns (file_name, file_path, content)  with content having BLOB dataype.

The next step was of course to get  text data into HANA. As I wanted to limit the scope, I chose not to go the data services route. Instead I found a post mentioning Python being used.

After some problems with the conversion to plain text I found a tool PDF miner.

https://pypi.python.org/pypi/pdfminer/

It can convert the pdf to plain text. So I came up with following script to load all pdf’s in a folder as plain text into the HANA table.

import pyodbc
import os
import subprocess
import tkFileDialog
 
path = tkFileDialog.askdirectory(title=’Please select a directory’)
os.chdir(path)
for files in os.listdir(“.”):
    if files.endswith(“.pdf”):
        file_path = os.path.abspath(files)
        file_name = files
       
        output = subprocess.Popen([“PDF2TXT.py”, file_path], shell = True, cwd=r’c:Python27scripts’, stdout=subprocess.PIPE)
        content = output.stdout.read()
        cnxn = pyodbc.connect(‘DSN=Hana;UID=USER;PWD=PASSWORD‘)
        cursor = cnxn.cursor()
     
        cursor.execute(“insert into TECH_ED_SESSIONS VALUES (?,?,?)”,(file_name,file_path,content))
        cnxn.commit()
 

In HANA, a normal column is fuzzy searchable, but for the BLOB datatype a full text index needs to be made first.

 

CREATE FULLTEXT INDEX FTI_TECH_ED_SESSIONS ON “SWERY”.”TECH_ED_SESSIONS”(“CONTENT”)
FUZZY SEARCH INDEX ON

 

After which the fuzzy search on the uploaded plaintext content, ran without problems. Returning the field I just loaded even though there is no HANO in the pdf but there is a frequent mention of HANA.

 

Select FILE_NAME,PATH from “SWERY”.”TECH_ED_SESSIONS” where CONTAINS(CONTENT,’HANO‘, FUZZY(0.8))

1_fuzzy_search

Using the sql statement for a freeform report in Microstrategy. The microstrategy report behaved exactly as expected. With a value prompt being used to enter the text being  searched for.

 

N.B. with thanks to Scott Wery

This article belongs to
Tags
  • Microstrategy
  • SAP HANA
Author
  • Just Blogger