Exercise 1.1

Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.)

Doc 1: new home sales top forecasts
Doc 2: home sales rise in july
Doc 3: increase in home sales in july
Doc 4: july new home sales rise

My Answer:

terms	postings lists
forecasts	1
home	1 -> 2 -> 3 -> 4
in	2 -> 3
increase	3
july	2 -> 3 -> 4
new	1 -> 4
rise	2 -> 4
sales	1 -> 2 -> 3 -> 4
top	1

Exercise 1.2

Draw the term-document incidence matrix for this document collection.Draw the inverted index representation for this collection

Doc 1: breakthrough drug for schizophrenia
Doc 2: new schizophrenia drug
Doc 3: new approach for treatment of schizophrenia
Doc 4: new hopes for schizophrenia patients

My Answer:

terms/doc	Doc 1	Doc 2	Doc 3	Doc 4
approach	0	0	1	0
breakthrough	1	0	0	0
drug	1	1	0	0
for	1	0	1	1
hopes	0	0	0	1
new	0	1	1	1
of	0	0	1	0
patients	0	0	0	1
schizophrenia	1	1	1	1
treatment	0	0	1	0

terms	postings lists
approach	3
breakthrough	1
drug	1 -> 2
for	1 -> 3 -> 4
hopes	4
new	2 -> 3 -> 4
of	3
patients	4
schizophrenia	1 -> 2 -> 3 -> 4
treatment	3

Exercise 1.3

For the document collection shown in Exercise 1.2 , what are the returned results for these queries:

schizophrenia AND drug
for AND NOT(drug OR approach)

My Answer:

$1111 \ AND \ 1100 = 1100$
也就是 doc1 和 doc2。
$1011 \ AND \ NOT(1100 \ OR \ 0010) = 0001$
也就是 doc4。

Exercise 1.4

Draw the inverted index that would be built for the following document

For the queries below, can we still run through the intersection in time $O(x+y)$ , where $x$ and $y$ are the lengths of the postings lists for Brutus and Caesar? If not, what can we achieve?

Brutus and not Caesar
Brutus or not Caesar

My Answer:

对于第一个查询而言，可以使用以下的算法，可以看到时间复杂度还是 $O(x+y)$ 。而对于第二个查询而言，就没有办法了，时间复杂度是 $O(N)$ ，这里的 $N$ 是总文档数。

AND NOT(p1,p2)
answer <- ()
while p1 != NIL and p2!= NIL
if docID(p1) < docID(p2)
    ADD(answer,docID(p1))
    p1<-next(p1)
else if docID(p2) < docID(p1)
    p2<-next(p2)
else
    p1<-next(p1)
    p2<-next(p2)

while p1 != NIL
    ADD(answer,docID(p1))
    p1<-next(p1)

Exercise 1.5

Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider:

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)
Can we always merge in linear time? Linear in what? Can we do better than this?

My Answer:

对于该查询而言，假设每个单词的 posting list 的长度分别是s1,s2,s3,s4。整个查询左边的时间复杂度是 $O(s1+s2)$ ,右边的时间复杂度是 $O(s3+s4)$ 。根据上一题的讨论我们发现 AND NOT 操作的时间复杂度还是 $O(x+y)$ ，所以总的时间复杂度就是时间复杂度还是 $O(s1+s2+s3+s4)$ ，还是线性的。

总结来说，A AND B , A OR B , A AND NOT B 的时间复杂度都是 $O(x+y)$ ，但是 A OR NOT B 的时间复杂度是 $O(N)$ 。

Exercise 1.6

We can use distributive laws for (and) and (or) to rewrite queries.
Show how to rewrite the query in Exercise 1.3 into disjunctive normal form using the distributive laws.
Would the resulting query be more or less efficiently evaluated than the original form of this query?
Is this result true in general or does it depend on the words and the contents of the document collection?

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)

My Answer:

在我的理解上，我一般把集合的 AND 操作类比为乘法操作，OR 操作类比为加法操作。 $(a+b)*(c+d)$ 类比这里的 $(A \bigcup B)\bigcap(C \bigcup D)^c$ ，如果你使用 De Morgan's laws 对原查询的右边进行变形，查询可以变为 $(A \bigcup B)\bigcap(C^c \bigcap D^c)$ ，同样进行类比 $(a+b)*(c*d) = acd+bcd$ ，上式可以化简为 $(A \bigcap C^c \bigcap D^c)\bigcup(B \bigcap C^c \bigcap D^c)$ 。根据前一章的 intersection 算法以及集合 AND 运算不受顺序的影响，可以发现变形后的查询效率更高了。总的来说还是要看单词和文档的关系。

Exercise 1.7

Recommend a query processing order for:

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)
given the following postings list sizes:

Term	Postings size
eyes	213312
kaleidoscope	87009
marmalade	107913
skies	271658
tangerine	46653
trees	316812

My Answer:

首先估算每个 OR 操作的时间复杂度， $O(46653+316812=363465)$ ， $O(107913+271658=379571)$ ， $O(87009+213312=300321)$ 。这里你可以想象很多个集合来进行 AND 操作，由于是二元操作，之后的操作依赖于前一个操作的结果，这个结果的元素数目越少越利于之后的操作，也就是越快。所以应该先对元素少的集合进行运算。也就是 :
(kaleidoscope OR eyes) AND (tangerine OR trees) AND (marmalade OR skies)

Exercise 1.8

If the query is:

friends AND romans AND (NOT countrymen)

how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in determining the order of query processing.

My Answer:

我们需要纪录下所有的文档数 N，以及多少文档中含有该单词，记为 E，那么当决定处理顺序的时候我们观察 $N-E$ 的大小就可以了。

Exercise 1.9

For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it isn't.

My Answer:

并不一定这样就是最佳的处理顺序. 假设有三个单词并且列表的大小分别是 s1=100, s2=105, s3=110。假设 s1和s2 的交集大小是100，s1和s3 的交集大小是0。如果是按照 s1,s2,s3 的顺序来处理，需要 $100+105+100+110=415$ 。但如果按照 s1,s3,s2 的顺序来处理，只需要 $100+110+0+0=210$ 。

Exercise 1.10

Write out a postings merge algorithm,for an $x$ OR $y$ query.

My Answer:

OR(p1,p2)
answer <- ()
while p1 != NIL and p2!= NIL
if docID(p1) < docID(p2)
    ADD(answer,docID(p1))
    p1<-next(p1)
else if docID(p2) < docID(p1)
    ADD(answer,docID(p2))
    p2<-next(p2)
else
    ADD(answer,docID(p1))
    p1<-next(p1)
    p2<-next(p2)

while p1 != NIL
    ADD(answer,docID(p1))
    p1<-next(p1)

while p2 != NIL
    ADD(answer,docID(p2))
    p2<-next(p2)

Exercise 1.11

How should the Boolean query $x$ AND NOT $y$ be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently.

My Answer:

之前已经写过了，可以参考。

Exercise 1-1

Exercise 1.1

Exercise 1.2

Exercise 1.3

Exercise 1.4

Exercise 1.5

Exercise 1.6

Exercise 1.7

Exercise 1.8

Exercise 1.9

Exercise 1.10

Exercise 1.11

results matching ""

No results matching ""