The Web corpus

No design/co-ordination

Distributed content creation, linking, democratization of publishing

Content includes truth, lies, obsolete information, contradictions ...

Unstructured (text, html, ...), semi-structured (XML, annotated photos), structured (Databases) ...

Scale much larger than previous text corpora ... but corporate records are catching up.

Growth - slowed down from initial volume doubling every few months but still expanding

Content can be dynamically generated