Indexing for Text Documents

Introduction

As data increasing exponentially day by day by using personal computer(PC),smart phones,Global Positioning System (GPS),sensors,radio frequency identification (RFID),monitoring devices or mainly due to Social network or application(video streaming mainly) According to Zhou , in the end of 2015 7.92 zetta bytes and predicted at the time of 2020 number will increase to 35 zetta bytes data[1]. Therefore, we get inefficient result when we search for result and query or retrieval information. We can solve this problem with indexing, there are many indexing methods on different domain and different data type, Some important indexing techniques are B-tree,R-tree etc.

Data indexing

Increasing complex data is an challenging for retrieval information so here we have a solution indexing for data,and data indexing can be on tags,names and subjects or fragment of the data[1]. According to the grouping data or related data, there will be considerably decrease in retrieval time for the searching query. That’s why for retrieve data suitable indexing is required.

Benefits from indexing

  • High throughput for data processing.
  • Decrease access time for queries.
  • Reliability due to data replication.
  • Scalability.
  • Availability.

Indexing categories

  • Artificial intelligence (AI)
  • Non Artificial intelligence (NAI)

Artificial Intelligence Approach

In the Artificial Intelligence indexing on data item by seeing pattern,categorization but it take more time in information retrieval and sometimes inefficient as compared to Non Artificial Intelligence Two popular artificial Intelligence Indexing Techniques:-

  • Latent Semantic Indexing
  • Hidden Markov Model
Latent semantic indexing

Latent semantic indexing (LSI) indexing is based on identify pattern between terms and semantic content of data and it is an mathematical approach based on Singular Value Decomposition (SVD)[4].

LSI use Resource Description Framework standard for web resource and also tags by adding information about speech,noun,verb,adjective etc. Text or document assigned to categories according to contextual similarities and set of text compare with example document and categories on the basis of matching document. It support textual data only and can be used on web content such as images,audio etc. that can be converted into text format. There are some challenges for LSI in case of inverted indexing on keyword queries.

Hidden Markov Indexing

Hidden Markov indexing is based on Hidden Markov model, and its working of indexing basis on the states and transition between them. In HMI or HMM Future states predicted based on the present states or current state and it is independent of past states,states depend on query and they are stored in advance[1]. For example, Storing motion data used by Robb and it is calculated or predicting motion by acceleration and it is related with the position and the force so our next motion totally depends on the positional argument and force argument. HMM or LSI indexing on storing data document or word so it give quicker search results.

Non Artificial Intelligence Approach

In Non Artificial Intelligence, indexing does not depend on the meaning of data but depends on the most queried data or searched for particular data set. Some Non Artificial Intelligence indexing approaches are:

  • Tree Indexed Strategy
  • Hash Indexed Strategy
  • Custom Indexed Strategy
  • Inverted Indexing Strategy

Tree Indexed Strategy

There are various type of structures in tree base indexing strategy.The retrieval of details done in sorted order in case of tree based indexing[1].

B-Tree : B tree is a one dimensional access method having nodes and pointers accessing is done linearly

R-Tree: R tree is 2-3 dimensional access method and more scalable

X-Tree: X tree can be used in multi dimensional.

One common drawback of all these tree based indexing is consuming more memory space

Hash Indexed Strategy

In Hash based indexing we maintain a key value pair that’s why is used for password matching and dictionary[2]. In Big data hashing used for retrieving the similar data items. Hashing works fine with limited size of key as size of key increases computational head also increases.

Custom Indexed Strategy

It’s also tree Structure based indexing. There are two type of custom indexing strategies Generalized Search Tree (GiST), Generalized Inverted Indexed. In GiST we create some Custom or arbitrary fields with the help of these fields This support indexing on 1 dimensional data as well as multi dimensional or spatial data. Gin also have custom or arbitrary fields.It’s like B-tree Indexed which compare entries to the posting trees. due to this Gin queries are faster.

Inverted Indexing Strategy

Inverted Indexing use full stack search or key word search. the inverted indexed made up of all unique words which appeared in the document this strategy consume less space[3]. There are many challenging in types of indexing this take more time in data processing, search space is also limited and due to the synonyms it can give wrong answer

Conclusion

We analyze different type of indexing strategies these strategies can be used to solved big data problem of indexing but choosing an appropriate is a difficult task because due to data distribution these strategies gives different results.

Refrences

[1].Fatima Binta Adamu1, Adib Habbal1, Suhaidi Hassan1, R. Les Cottrell2, Bebo White2, Ibrahim Abdullahi"A Survey On Big Data Indexing Strategies" https://d1b10bmlvqabco.cloudfront.net/attach/jyimpiy46823oq/jyikyqslltv3nk/k1qgbfjqu7mc/slacpub16460.pdf

[2].“Hash-Based Indexes " http://www.csbio.unc.edu/mcmillan/Media/Comp521F12Lecture15.pdf

[3]. “Inverted Indexing for Text Retrieval” http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch4.pdf

[4].“Diffusion of latent semantic analysis as a research tool: A social network analysis approach” https://www.sciencedirect.com/science/article/abs/pii/S1751157709000819