Paper by Erik D. Demaine

Reference:: Erik D. Demaine and Alejandro López-Ortiz, “A Linear Lower Bound on Index Size for Text Retrieval”, Journal of Algorithms, volume 48, number 1, August 2003, pages 2–15. Special issue of selected papers from the 12th Annual ACM-SIAM Symposium on Discrete Algorithms, 2001.
Abstract:: Most information-retrieval systems preprocess the data to produce an auxiliary index structure. Empirically, it has been observed that there is a tradeoff between query response time and the size of the index. When indexing a large corpus, such as the web, the size of the index is an important consideration. In this case it would be ideal to produce an index that is substantially smaller than the text.
In this work we prove a linear worst-case lower bound on the size of any index that reports the location (if any) of a substring in the text in time proportional to the length of the pattern. In other words, an index supporting linear-time substring searches requires about as much space as the original text. Here “time” is measured in the number of bit probes to the text; an arbitrary amount of computation may be done on an arbitrary amount of the index. Our lower bound applies to inverted word indices as well.
Comments:: This paper is also available from the ScienceDirect.
Length:: The paper is 14 pages.
Availability:: The paper is available in PostScript (207k), gzipped PostScript (80k), and PDF (187k).; See information on file formats.; [Google Scholar search]
Related papers:: SODA2001b (A Linear Lower Bound on Index Size for Text Retrieval)