Unit 2 Reading

IIR 1.2 A first take at building an inverted index

Major steps:
1. Collect the documents to be indexed
2. Tokenize the text, turning each document into a list of tokens
3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing term
4. Index the doc that each term occurs in by creating an inverted index, consisting of a dictionary and postings

The dictionary also records some staticistics, such as the number of doc which contain each term

It is assumed that each document has a unique serial number (docID) when it is first time show up, it will be given a number. indexing is a list of normalized tokens for each document, which we can equally think of as list of pairs of term and docID. Screenshot 2019-01-13 22.46.58.png

Exercise 1.1 [⋆]

Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.)

Doc 1 new home sales top forecasts

Doc 2 home sales rise in july

Doc 3 increase in home sales in july

Doc 4 july new home sales rise

Exercise 1.2 [⋆]

Consider these documents:

Doc 1 breakthrough drug for schizophrenia

Doc 2 new schizophrenia drug

Doc 3 new approach for treatment of schizophrenia

Doc 4 new hopes for schizophrenia patients

1. Draw the term-document incidence matrix for this document collection

Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174

Calpurnia −→ 2 → 31 → 54 → 101

Intersection =⇒ 2 → 31

◮ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.

1. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).

Exercise 1.3 [⋆]

For the document collection shown in Exercise 1.2, what are the returned results for these queries:

1. schizophrenia AND drug 1. for AND NOT(drug OR approach)

IIR Chapter 2 The term vocabulary and postings lists

It is apparently important to detemine the vocabulary of terms, it is also important to note that each language would have different posting lists. This is also talking about how diferent it would be with character processing, especially when different language would have different type of ways to process them.

IIR Chapter 3 Dictionaries and tolerant retrieval

search structure plays a key role here in order to have everything, where it could also use algorithm methods to help with improving the efficiency of search. We can use different queries and by using indexes to do search. While doing retrieval, the speloing might be incorrect, which could affect the search, the search should have the ability to autocorrect or detect the word or message that a user was intended to search.