IIR 1.2 A first take at building an inverted index
-
Major steps:
-
Collect the documents to be indexed
-
Tokenize the text, turning each document into a list of tokens
-
Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing term
-
Index the doc that each term occurs in by creating an inverted index, consisting of a dictionary and postings
-
The dictionary also records some staticistics, such as the number of doc which contain each term
It is assumed that each document has a unique serial number (docID) when it is first time show up, it will be given a number. indexing is a list of normalized tokens for each document, which we can equally think of as list of pairs of term and docID. 
Exercise 1.1 [โ]
Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.)
Doc 1 new home sales top forecasts
Doc 2 home sales rise in july
Doc 3 increase in home sales in july
Doc 4 july new home sales rise
Exercise 1.2 [โ]
Consider these documents:
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
1. Draw the term-document incidence matrix for this document collection
Brutus โโ 1 โ 2 โ 4 โ 11 โ 31 โ 45 โ 173 โ 174
Calpurnia โโ 2 โ 31 โ 54 โ 101
Intersection =โ 2 โ 31
โฎ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.
1. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).
Exercise 1.3 [โ]
For the document collection shown in Exercise 1.2, what are the returned results for these queries:
1. schizophrenia AND drug 1. for AND NOT(drug OR approach)
IIR Chapter 2 The term vocabulary and postings lists
It is apparently important to detemine the vocabulary of terms, it is also important to note that each language would have different posting lists. This is also talking about how diferent it would be with character processing, especially when different language would have different type of ways to process them.
IIR Chapter 3 Dictionaries and tolerant retrieval
search structure plays a key role here in order to have everything, where it could also use algorithm methods to help with improving the efficiency of search. We can use different queries and by using indexes to do search. While doing retrieval, the speloing might be incorrect, which could affect the search, the search should have the ability to autocorrect or detect the word or message that a user was intended to search.