An Investigation of Documents on the World Wide Web


We present a study of HTML structure of documents on World Wide Web. We collected over one million HTML documents using an open source web crawler and analyzed many characteristics based on the HTML structure and content, including: document size; number and types of tags, attributes, file extensions, protocols, number of outlinks; the ratio of document size to the number of tags and attributes, and reading ease for various domains. The documents were crawled in such way that it represented an accurate sample of the entire web. A similar study was conducted by Woodruff et al [WAB96] in 1996 to examine various HTML structural characteristics. The main goal of our study was to analyze the existing structure of HTML documents on the web and see how it has changed since the previous study in 1996. We also used the HTML structural data that we collected to build a classifier that can predict the usability of a web page.

A. Nanavati, A. Chakraborty, D. DeAngelis, H. Godil, and T. D'Silva. An Investigation of Documents on the World Wide Web. Technical Report, The University of Texas at Austin, December 2004.

Dave DeAngelis,
Jun 4, 2009, 2:04 PM