Exercise 1.12
Assume a biword index. Give an example of a document which will be returned for a query of New York University but is actually a false positive which should not be returned.
My Answer:
分为了 New York AND York University,显然 New York University 是一个特定的词组,前面的查询肯定会返回错误的结果。
Shown below is a portion of a positional index in the format: term: doc1: position1, position2, ...; doc2: position1, position2, ...; etc.
angels: 2: 36,174,252,651; 4: 12,22,102,432; 7: 17;
fools: 2: 1,17,74,222; 4: 8,78,108,458; 7: 3,13,23,193;
fear: 2: 87,704,722,901; 4: 13,43,113,433; 7: 18,328,528;
in: 2: 3,37,76,444,851; 4: 10,20,110,470,500; 7: 5,15,25,195;
rush: 2: 2,66,194,321,702; 4: 9,69,149,429,569; 7: 4,14,404;
to: 2: 47,86,234,999; 4: 14,24,774,944; 7: 199,319,599,709;
tread: 2: 57,94,333; 4: 15,35,155; 7: 20,320;
where: 2: 67,124,393,1001; 4: 11,41,101,421,431; 7: 16,36,736;
Exercise 1.13
Which document(s) if any match each of the following queries, where each expression within quotes is a phrase query?
fools rush in
fools rush in AND angels fear to tread
My Answer:
先处理第一个查询,这里还要考虑单词出现的顺序满足 fools 第一,之后是 rush,最后是 in。这里 doc2 和 doc4,doc7都满足。
接着看第二个查询,AND 右边的部分得到 doc4,所以整个的结果就是 doc4。
Exercise 1.14
Consider the following fragment of a positional index with the format:
word: document: position, position, ; document: position,
...
Gates: 1: 3; 2: 6; 3: 2,17; 4: 1;
IBM: 4: 3; 7: 14;
Microsoft: 1: 1; 2: 1,21; 3: 3; 5: 16,22,51;
The / operator, word1 / word2 finds occurrences of word1 within words of word2 (on either side), where is a positive integer argument. Thus demands that word1 be adjacent to word2.
- Describe the set of documents that satisfy the query Gates /2 Microsoft.
- Describe each set of values for $k$ for which the query Gates / Microsoft returns a different set of documents as the answer.
My Answer:
第一问,根据定义,doc1 和 doc3 都满足。
第二问,见如下表格:
K | result |
---|---|
1 | doc3 |
2 | doc1,doc3 |
3 | doc1,doc3 |
4 | doc1,doc3 |
5 | doc1,doc2,doc3 |
... | doc1,doc2,doc3 |
Exercise 1.15
Consider the general procedure for merging two positional postings lists for a given document, to determine the document positions where a document satisfies a k clause (in general there can be multiple positions at which each term occurs in a single document). We begin with a pointer to the position of occurrence of each term and move each pointer along the list of occurrences in the document, checking as we do so whether we have a hit for k. Each move of either pointer counts as a step. Let L denote the total number of occurrences of the two terms in the document. What is the big-O complexity of the merge procedure, if we wish to have postings including positions in the result?
Exercise 1.16
Consider the adaptation of the basic algorithm for intersection of two postings lists postings-merge-algorithm to the one in Figure 2.12 (page [*]), which handles proximity queries. A naive algorithm for this operation could be , where is the sum of the lengths of the postings lists (i.e., the sum of document frequencies) and is the maximum length of a document (in tokens).
- Go through this algorithm carefully and explain how it works.
- What is the complexity of this algorithm? Justify your answer carefully.
- For certain queries and data distributions, would another algorithm be more efficient? What complexity does it have?
Exercise 1.17
Suppose we wish to use a postings intersection procedure to determine simply the list of documents that satisfy a / clause, rather than returning the list of positions, as in Figure 2.12 (page [*]). For simplicity, assume . Let $L$ denote the total number of occurrences of the two terms in the document collection (i.e., the sum of their collection frequencies). Which of the following is true? Justify your answer.
- The merge can be accomplished in a number of steps linear in and independent of , and we can ensure that each pointer moves only to the right.
- The merge can be accomplished in a number of steps linear in and independent of , but a pointer may be forced to move non-monotonically (i.e., to sometimes back up)
- The merge can require steps in some cases.
Exercise 1.18
How could an IR system combine use of a positional index and use of stop words? What is the potential problem, and how could it be handled?