Some web characteristics
Significant duplication
-
Syntactic - about 30% (near) duplicates [Fett03]
-
Semantic - ???
High linkage
-
More than 8 links/page in the average
Complex graph topology
Spam
Fetterly, D., Manasse, M. and Najork, M. 2003.
On the evolution of clusters of near-duplicate Web pages
Procs. 1st Latin-Amer. Web Congress, 37-45.