Microdata vocabularies

Microdata is an extension to HTML5, also known as HTML5 with Microdata, that allows adding some additional structure (semantic meaning) to HTML documents. These machine-readable properties can be processed by software searching for specific types of information. Some search engines, Google in particular, already support microdata in HTML5 and use it to improve search engine results.

I am specifically interested in the following Microdata schemas :

Art Galleries :

Creative Works :

Historical landmarks :

Historical people :

Rental :

HTML microdata

One of the most adavanced technologies for the semantic web is HTML microdata. HTML Microdata is a W3C Working Draft (last version : 29 March 2012).

Most HTML tags tell the browser how to display the information included in a tag. For example <h1>Blackberry</h1> tells the browser to display the text string Blackberry in a heading 1 format. However, the HTML tag doesn’t give any information about what that text string means. Blackberry could refer to a mobile device or to a fruit and this makes it difficult for search engines to intelligently display relevant content to a user.

Microdata vocabularies provide the semantics, or meaning of an item. Web developers can design a custom vocabulary or use vocabularies available on the web. Microdata vocabularies are provided by schema.org.

Microdata introduces five simple global attributes (available for any element to use) which give context for machines about your data :

  • itemscope – creates the Item and indicates that descendants of this element contain information about it (boolean attribute)
  • itemtype – a valid URL of a vocabulary that describes the item and its properties context
  • itemid – indicates a unique identifier of the item
  • itemprop – indicates that its containing tag holds the value of the specified item property (strings, urls, images, …)
  • itemref – properties that are not descendants of the element with the itemscope attribute can be associated with the item using this attribute

Google uses semantic web technologies to create rich snippets (detailed information intended to help users with specific queries) in web search results. Googles suggest to use microdata as a markup format. Actually Google supports rich snippets for the following content types: Reviews, People, Products, Businesses and organizations, Recipes, Events and Music.

Google provides a Rich Snippet Testing Tool to check that their search engines can correctly parse the structured data markup and display it in search results. A Microdata schema creator is provided by Raven.

The next list provide links to more informations about microdata, followed by a list of links to specific vocabularies :

Semantic Web

Last Update : October 7, 2012

The Semantic Web is a collaborative movement led by the international standards body W3C. The Semantic Web is a Web of Data, as opposed to the existing Web of Documents. The goal of the Web of Data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network.

The Web of Data is empowered by new technologies such as RDFa (Resource Description Framework-in attributes), SPARQL, OWL (Web Ontology Language), SKOS (Simple Knowledge Organization System), Microdata and Open Graph.

HTML (HyperText Markup Language) remains still the main markup language for displaying web pages and other information that can be displayed in a web browser.

Semantic HTML refers to the semantic elements and attributes of HTML (h1, h2, …, p, …), as opposed to the presentational HTML elements and attributes (center, font, b, …). The acronym POSH was coined in 2007 for semantic HTML, as a shorthand abbreviation for “plain old semantic HTML”.

HTML5 introduced a few new structural elements :

  • <header> : this tag replaces the <div class=”header”>, commonly used in the past by most designers. The header element contains introductory information to a section or page.
  • <footer> : same as above, it’s the well known <div class=”footer”>. The footer element is for marking up the baseline of the current page and of each section contained in the page.
  • <nav> : replacement for <div class=”navigation”>. The nav element is reserved for the primary navigation. Not all link groups in a page or section need to be contained within the <nav> element.
  • <section> : this is the replacement for the generic flow container <div> when it contains related content. <div> is a block-level element with no additional semantic meaning, whereas <section> is a sectioning element which has normally a header and a footer and represents a generic document or application section.
  • <article> : the <article> element represents a portion of a page or section which can stand alone and makes sense even outside the context of the page. Like <section>, an <article> generally has a header and a footer. You should avoid nesting an <article> inside another <article>.

HTML5 tag <aside>

  • <aside> : this tag is used to represent content that is related to the surrounding content within an section, article or web page, but could still stand alone in its own right. (see figure at right). This type of content is often represented in sidebars.
  • <hgroup> : A special header element that must contain at least two <h1>-<h6> tags and nothing else. It’s a group of titles with subtitles. Make sure to maintain the <h1> – <h6> hierarchy.

RDFa is a W3C Recommendation that adds a set of attribute-level extensions (rich metadata) to web documents. RDFa 1.1 was approved in June 2012. It differs from RDFa 1.0 in that it no longer relies on the XML-specific namespace mechanism, but ca be used with non-XML document types such as HTML 4 or HTML 5. eRDF is an alternative to RDFa. SPARQL is an RDF query language. On 15 January 2008, SPARQL 1.0 became an official W3C Recommendation. OWL is a family of knowledge representation languages for authoring ontologies. An ontology formally represents knowledge as a set of concepts within a domain in computer science and information science, and the relationships among those concepts. Ontologies are the structural frameworks for organizing information and are used, among others, in artificial intelligence. SKOS is a family of formal languages designed for representation of of structured controlled vocabulary (thesauri, classification schemes, taxonomies, …). Microdata is a WHATWG specification used to nest semantics within existing content on web pages. The Open Graph protocol, originally created by Facebook, enables any web page to become a rich object in a social graph.

All these technologies help computers such as search engines and web crawlers better understand what information is contained in a web page, providing better search results for users.

Another set of simple, structured open data formats, built upon existing standards, is Microformats. One difference with the other semantic technologies is that Microformats is designed for humans first and machines second.

The following list provides links to some useful blogs and tutorials about the semantic web: