Publikationsansicht

WWW 2007 / Track: Search Session: Crawlers ABSTRACT The Discoverability of the Web (2008)

Abstract
Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90 % of all new content with under 3 % overhead, and 100 % of new content with 9 % overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80 % of new content during a given week may be discovered with 160 % overhead if content is recrawled fully on a monthly basis.

Details der Publikation
Download http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.77.6773
Quelle http://www2007.org/papers/paper592.pdf
Mitarbeiter CiteSeerX
Archiv CiteSeerX - Scientific Literature Digital Library and Search Engine (United States)
Keywords Algorithms, Experimentation, Measurements Keywords Crawling, discovery, set cover, max cover, greedy
Typ text
Sprache Englisch
Verknüpfungen 10.1.1.37.234, 10.1.1.43.8973, 10.1.1.22.3686, 10.1.1.30.2529, 10.1.1.18.1519, 10.1.1.40.4718, 10.1.1.2.1331, 10.1.1.87.8454, 10.1.1.42.9320, 10.1.1.2.4767, 10.1.1.20.8164, 10.1.1.36.6087, 10.1.1.6.1108, 10.1.1.58.107, 10.1.1.58.9676, 10.1.1.17.5734