28 octombrie 2013

IR: simulare term-document incidence matrix, inverted index

(a). Write a program to create the term-document incidence matrix for a document collection. The output of this part will be binary term x document matrix file with headings associated with rows (index terms) and columns (DocIDs).

(b). Program to generate the inverted index(consisting of a dictionary and postings) for the input documents collection. There are four main steps to create inverted index:
 1) Collect the documents to be indexed.
 2) Tokenize the text, turning each document into a list of tokens.
 3) Do linguistic preprocessing, tokens normalization and stemming.
 4) Create an inverted index, consisting of a dictionary and postings.
You need not perform normalization and stemming in step 3. But you should remove stop words from the tokens list. The dictionary and postings will be the final output that can be in any mode (such as binary file, text file, screen display and so on).

* php

Niciun comentariu: