Google Code: Web Authoring Statistics

In December 2005 we did an analysis of a sample of slightly over a billion documents, extracting information about popular class names, elements, attributes, and related metadata.

Some interesting things I picked up from the study are:

  • A whole slew of people are specifying the xml:lang attribute, which will have absolutely no effect (no HTML processor will look at that attribute; it’s an XML attribute).
  • Of the top twenty most-used attributes on body, fourteen are purely presentational.
  • The br element is a simple one, yet used on so many pages that it is the 8th most-used element. It is used more than the p element. There are very few legitimate semantic places to use this element (addresses and poems are the canonical examples), which means that most uses are probably presentational.
  • In our data sample there were twice as many pages that used the table element but didn’t use the td element
  • The script element was used on roughly half the pages we checked.

Google Code: Web Authoring Statistics