for describing people and organisations, hCalendar
for describing calendars and events, and
for marking up tags, keywords and categories in pages such as blog posts.
Microformats have been designed to be straightforward for humans to use, with mark-up based
around existing, widely used HTML features as shown in Figure 5:
<a class="url fn" href="http://www.seadams.co.uk/">Sam Adams</a>
is a <span class="role">software developer</span>.
Figure 5. Example of an hCard describing Sam Adams.
Note in Figure 5 the
class on the
element indicates that the child elements form an
hCard. The subsequent classes (url, fn, role) indicate the properties their elements describe.
The major criticisms of the microformat specifications are:
Conflicts with formatting information: Microformats make wide use of the
which is more usually employed by selectors for style sheets giving presentation instructions for
a page. While the HTML specifications permit the use of the
attribute "for general
purpose processing by user agents"
, overloading the attribute in this manner makes it
impossible to tell whether a
attribute is being used for styling purposes, or to mark up a
data field, and conflicts can arise when microformats are introduced to existing Web sites.
Processing challenges: The ambiguity between data and format specification also makes it
impossible to extract marked-up data in a generic manner
a processor can only extract data
conforming to microformats that it knows about. In the above example, a processor cannot
know that it should associate the value of the
attribute with the url property,
and its text content with fn (full name), unless these rules are hard-coded.
Accessibility: a number of microformats use the
HTML element to encode text in both
human friendly and machine readable formats. e.g., a date-time may be encoded as:
<abbr class="dtstart" title="20110921T14:00:00+0100">Wednesday 21st
at 2 o’clock</abbr>
Unfortunately this usage of the
element is not compatible with screen readers used by
many blind and partially sighted users which has led some organisations, most notably the BBC
 and  to ban the use of microformats which make use of this pattern.
Approval process / Extensibility: in order to prevent conflicts between microformat and property
names, new microformats require centralised registration, and approval through a community
. This can make it a lengthy and sometimes difficult process to establish a microformat
for a new type of data.
The RDFa specification provides a mechanism for embedding RDF (the language of the
Semantic Web) data models into XHTML documents. RDFa brings the full power of RDF to
embedding semantic data into Web documents, and is automatically compatible with the work
of the Semantic Web community.
In contrast to microformats, RDF/RDFa embraces ‘distributed
anyone can create a new vocabulary. This is achieved without having to worrying
about conflicting with another vocabulary’s names by using a URL the authors control as a
namespace for the vocabulary. Technologies such as RDF Schema (RDFS) and Web Ontology
Language (OWL) enable the construction of machine-understandable descriptions of the
required structure of RDF entities, and the separation between data and formatting mark-up,
combined with more strictly specified parsing rules, ensure that problems such as the url/fn
ambiguity, discussed above, do not arise.
HTML 4.01 Specification. Chapter 7: The global structure of an HTML document.
The microformats process http://microformats.org/wiki/process
RDFa has, however been widely criticised for its complexity in a number of areas:
XML basis: RDFa was originally developed for use with XHTML, and, as such, requires
that documents be well formed XML. Since up-take of XHTML has been limited, the
specification has been ported to support less well formed HTML; however, differences
between HTML and XML can cause difficulties when processing RDF in HTML
Use of prefixes: RDFa relies on XML namespace prefixes, which, it has been argued,
"most authors simply do not understand, and which many implementors [sic] end up
getting wrong" and "lead[s] to flaky copy-and-paste behaviour" [6. This is further
complicated by the prefixed terms (technically CURIEs, rather than QNames) appearing
in attribute values which few (if any?) authoring tools understand, QNames generally
being confined to element and attribute names.
Complex formatting rules: depending on the context in which they appear, relationships in
RDFa are variously expressed using either a
authors can easily be confused about which is the correct one to use for a given situation
using the wrong one can still generate a valid RDF graph, but not with the meaning the
The RDFa 1.1 specification, currently under development
, aims to address such concerns, by:
Permitting use of full URIs as property names, rather than requiring prefixed CURIEs
Providing a mechanism for specifying a default vocabulary for a given scope within a
document, thereby removing the need to prefix property names
Permitting the external definition of standard collections of prefixes, using ‘profile’
While RDFa 1.0 is widely used, there are very few sites or applications currently supporting
The Microdata specification has been created during the development of HTML5, with the aim
of addressing the common use cases for embedding metadata, while avoiding some of the
concerns that are raised around microformats and RDFa. James Graham of Opera  (Graham,
2009) has stated that,
Compared to microformats I believe the HTML 5 microdata offers more
consistent parsing rules [...] and cleaner separation from the rest of the markup language.
Compared to RFDa, microdata offers a considerably simpler authoring experience which I
believe to be critical to gaining traction with a large base of users.
Microdata introduces a set of new attributes for specifying data 'items' and their properties.
Items can be assigned a type (defined using a URL) which provides a context for prefix-less
property names, similar to the role of namespaces in RDF/RDFa. Properties may also be
specified using a URL, in which case they can be applied in any context, without requiring a
specific item type. Currently there is no mechanism for providing machine-understandable
specification of microdata vocabularies, or mapping between URL and ‘simple’ property names;
so it is not possible to mix ‘simple’ names from differe
nt vocabularies in a single item. This
contrasts with RDF/RDFa, where objects (items) can be assigned multiple classes (types), and
it is straightforward to mix property names from different vocabularies.
The microdata specification currently includes instructions for mapping microdata to JSON.
Some earlier versions of the specification have included instructions for converting HTML
Microdata to RDF, but they have been removed from the current draft.
Metadata available in scholarly works
This case study is not looking at adding new metadata to scholarly publications, but
semantically encoding metadata that is already being recorded. The focus is on bibliographic
and citation data
i.e. metadata about the publication itself, and about other publications that it
cites and references.
RDFa in HTML issues http://rdfa.info/wiki/Rdfa-in-html-issues
RDFa 1.1 Nears Completion http://rdfa.info/2011/03/31/rdfa-1-1-almost-ready/
How to C#: Basic SDK Concept of XDoc.PDF for .NET
as annotation creating, deleting, modifying, importing, exporting, and so on. and events necessary to load a PDF document from file or query data and save how to save pdf form data in reader; save data in pdf form reader
VB.NET PDF: Basic SDK Concept of XDoc.PDF
as annotation creating, deleting, modifying, importing, exporting, and so on. and events necessary to load a PDF document from file or query data and save exporting data from pdf to excel; change font size pdf form reader
The Public Library of Science (PLoS)
is an open access publisher. Alongside the conventional
HTML and PDF formatted versions of papers they publish, PLoS also makes available raw XML
versions (conforming to the U.S. National Library of Medicine Document Type Definition (NLM
DTD)). The XML files contain considerable amounts of metadata, including:
Author names and affiliations
Citation (journal title, year, volume, pages)
titles, authors, citation (e.g., journal title, year, volume, issue, pages)
is a repository aggregating openly published crystallographic molecular structures
from across the Web. CrystalEye entries consist of Crystallographic Information Files and
Chemical Markup Language XML files describing the crystallographic structure, as well as,
recently, an RDF representation of information about the crystal. There is an HTML splash page
for each entry, providing a summary of the crystal structure, and linking to the various resources
(files) making up the entry. The full semantic data can already be retrieved as an RDF/XML file,
but there are core items of metadata that, if encoded in the HTML splash page, could assist
Web crawlers and browsers in respect of:
Title and authors of the crystal structure
Identity of molecular entities in the crystal structure
Citation for the original publication
Evaluation of suitability
Microformats such as
rel="license">cc by 2.0</a>
<a href="http://example.com/tag/html5" rel="tag">html5</a>
are likely to be useful for adding semantics to licence statements and content tags, due to their
simplicity. However, there are currently no microformat specifications or drafts relating to
scholarly works’ more c
While there are ‘exploratory discussions’ around
citations, this process appears to have been on-going for some years, and it is likely to be some
time before a specification starts to emerge.
RDF is widely used to process data in many communities, including the handling of scholarly
metadata. This means there are already a large number of RDF vocabularies available;
examples with particular relevance to scholarly publishing include:
FOAF (Friend of a Friend)
The Public Library of Science http://www.plos.org/
VB.NET PDF - Convert PDF with VB.NET WPF PDF Viewer
Data. Data: Auto Fill-in Field Data. Field: Insert PDF, VB.NET Word, VB.NET Excel, VB.NET part illustrates some conversion tabs and features for PDF exporting. vb extract data from pdf; extract data from pdf file to excel
PRISM (Publishing Requirements for Industry Standard Metadata)
FRBR (Functional Requirements for Bibliographic Records)
The Dublin Core vocabulary is very widely used for marking up basic metadata (e.g. title,
creator(s), description…) and is straightforward to use to mark
up a resource’s title:
<h1 property="dc:title">My Really Great Paper</h1>
prefix is bound to the namespace
Author names are also straightforward to encode using Dublin Core in RDFa:
<span property="dc:creator">Sam Adams</span>
<span property="dc:creator">John Smith</span>
And more complex descriptions of an author can be supported:
<span property="foaf:name">Sam Adams</span>
<span rel="foaf:url" resource="http://www.seadams.co.uk/" />
prefix is bound to the namespace
The existence of two versions of the Dublin Core vocabulary
the original 15 elements, and the
larger set of DC terms
can cause confusion for authors: strictly following the specifications, a
creator should be specified as a simple ('literal') string if using the original elements, and as an
object with properties if using the DC terms vocabulary. This means that data of the form:
<span rel="dcterms:creator">Sam Adams</span>
is not strictly permitted, although such constructs are quite commonly observed.
There are a number of RDF vocabularies for describing bibliographic data. During the course of
this case study we have evaluated the two most widely used: the Bibliographic Ontology
and Publishing Requirements for Industry Standard Metadata (PRISM)
vocabularies contain broadly equivalent terms (e.g. title, authors, journal, issue number, volume
number…), however in order to conform strictly to their specification they impose quite di
structures on the data. Here we have focused on marking up journal article metadata; however,
the vocabularies can also be used to mark up bibliographic data about books, reports and other
The PRISM vocabulary imposes a flat structure, consisting of an article, with a list of properties
describing the bibliographic data.
Web site for the Bibliographic Ontology, known as BIBO http://bibliontology.com/
Publishing Requirements for Industry Standard Metadata (PRISM)
Figure 6. The flat data structure imposed by the PRISM vocabulary.
In contrast, BIBO imposes a nested structure, where following the specification, an article is
described as part of an issue, which is in turn part of a journal. According to BIBO's
specification, it is not permitted to use the properties in the ‘flat’
style of the PRISM structure.
However, these rules are not always observed (e.g., by some of the examples found in the
documentation of BIBO’s Web site
Figure 7. The nested data structure imposed by the Bibliographic Ontology.
A second difference is in marking up a journal's name. While both vocabularies use the Dublin
Core title property to mark-up an article's title, the PRISM vocabulary includes an explicit
publicationName term, whereas BIBO used Dublin Core title again (this is made possible due to
the nested data structure). These differences make BIBO well suited to building databases of
bibliographic data, where it may be useful to model issues and journals explicitly. However,
PRISM's simpler data structure makes it better suited than BIBO for encoding bibliographic
metadata in documents.
<p>DOI: <a rel="prism:url" href="http://dx.doi.org/...">...</a></p>
. Describing an article’s bibliographic information using RDFa / PRISM
Since microdata is a relatively recent development, there are not yet many vocabularies
available. The first W3C version of the Microdata specification included a number of predefined
types and property names for describing common structures. They were removed from
subsequent drafts, but some standard vocabularies (vCard, vEvent and Licensing works) are
still included in the current WHATWG specification.
Microdata received a major boost in June 2011, when Bing, Google and Yahoo! announced a
joint initiative called schema.org  to support a common set of schemas for structured data
mark-up on the Web. Schema.org has chosen to use microdata due to it striking a "balance
between the extensibility of RDFa and the simplicity of microformats". The primary benefit of
marking up data using the schema.org vocabulary is to improve on
e’s display in search results.
Google, for example, will display Rich Snippets
in its search listings for pages containing
schema.org mark-up of supported data types, such as Events, Organisations and People.
Among its data types, schema.org includes a ScholarlyArticle type, which we can use to
describe an article:
<article itemtype="http://schema.org/ScholarlyArticle" itemscope>
Adding a title (name) to this is straightforward:
<article itemtype="http://schema.org/ScholarlyArticle" itemscope>
<h1 itemprop="name">An investigation of FUD</h1>
Author names are a little more complicated, as you have start a new Person item, and then
attach properties to that:
<span itemprop="author" itemscope
<span itemprop="name">Sam Adams</span>
<span itemprop="author" itemscope
<span itemprop="name">John Smith</span>
The schema.org specification does not permit the simpler:
<span itemprop="author">Sam Adams</span>,
<span itemprop="author">John Smith</span>
Although it seems likely that many examples of this approach will appear as use of the
schema.org vocabulary grows.
The schema.org vocabulary for ScholarlyArticles does not support concepts such as volume,
DOI which are needed to mark up journal papers’ bi
bliographic and citation data.
This leaves three options for representing such data using Microdata:
1. Extend schema.org
The specification for schema.org allows Web masters to introduce new properties for
isting schema.org classes; so we could simply introduce ‘volume’, ‘issueNumber’, ‘doi’ etc
properties. However, this carries the risk that a property name we introduce could conflict
with another extension. It would also be difficult to document these extensions
place for a user to find information about properties of schema.org classes is on the
schema.org Web site, but there would be no information about our extensions there.
<span itemprop="journalTitle">J Interest Things</span>
2. Extend schema.org with external vocabularies
While Microdata properties whose names are plain
words (e.g. ‘author’) can only be used
within the context of item types for which they are defined, if properties are named using
URLs, they can be used on items of any type, though this can end up being quite verbose:
J Interest Things</span>
3. Use a different vocabulary
We could create a whole new Microdata vocabulary for scholarly works (possibly building on
an existing RDF vocabulary). However, this runs the risk of missing out on the
ecosystem/support that may develop around schema.org, given the dominance of its
To explore the options raised above further, tools have been developed to demonstrate the
production of scholarly documents containing semantically encoded metadata:
As previously discussed, the raw XML is made available for articles published in PLoS journals.
In order to generate examples of articles with semantically marked-up metadata, an XSLT
stylesheet has been developed that transforms the XML articles into HTML5, with semantic
mark-up of embedded metadata.
The stylesheet has been packaged into a Web application that is accessible at:
The source code for this application, including the XSLT stylesheet are available from
CrystalEye is powered by an instance of the Chempound data repository. Chempound
generates splash pages for data items using a templating system. The templates used to
generate splash pages for CrystalEye entries have been extended to encode core metadata:
title and authors of the crystal structure, and citation of the source publication.
The repository is available at: http://crystaleye.ch.cam.ac.uk/
Embedding semantic metadata into HTML pages is clearly a topic of current interest.
Unfortunately there is not yet a clear standard for generating this mark-up, instead there are a
number of competing formats. The strongest contenders seem to be RDFa and microdata, both
of which have advantages and disadvantages when compared to the other. Given its longer
history, RDFa is currently the more widely used of the two. On the other hand, due to its simpler
form, and the recent backing of microdata by
the Web’s major search engines through the
schema.org initiative, it seems likely that large amounts of microdata will start to appear shortly.
Assuming that microdata does take off, conventions for describing scholarly works will be
needed. There are a number of options, though they all suffer from potential drawbacks:
Extend schema.org vocabularies; but the extensions could clash with someone else's.
Mint a whole new microdata vocabulary of scholarly works; but this misses out the
ecosystem/support that may develop around schema.org, given its backers
Use schema.org so far as possible, and import elements of other vocabularies, e.g.
BIBO/PRISM; but this would rapidly become a bit untidy/unwieldy
Some other option.
There are advantages and disadvantages to each of these options, but the most important
factor is consensus.
It is worth bearing in mind that the microdata specification is not yet finalised. At the same time,
the current development of the RDFa 1.1  specification appears to be addressing some of the
concerns regarding the complexity of producing RDFa.
While it is unlikely that these efforts will merge anytime in the foreseeable future, ideally a
mechanism for interoperability will develop.
There have been a number of developments since this case study was initially written:
Late in September 2011 the W3C launched a Microdata/RDFa Task Force
analyse the relationship between the two formats.
Work is ongoing on a ‘Microdata to RDF’ specification
The microdata specification has been changed to allow an item to have multiple item
types, so long as
the all “
are defined to use the same vocabulary
Schema.org have announced  that they are introducing support for RDFa 1.1 lite
a very minimal subset that will work for 80% of the folks out there doing simple
, in order to “
allow publishers to focus more on what
they want to say with their data, rather than on the details of its specific encoding as
It still does not look like the microdata and RDFa efforts are likely to merge, however efforts are
clearly being made to improve their interoperability.
There is not yet any consensus as to whether one format will emerge as the de facto standard
for data publication on the Web. My personal feeling is that RDFa is likely to be the stronger
contender for this, since it offers greatest flexibility and supports complex data models.
Moreover, the development of the RDFa 1.1, and especially the RDFa Lite 1.1, specifications
has made it much simpler to publish than was previously the case (RDFa Lite 1.1 looks to be as
simple to use as microdata). Microdata suffers from the limitation that it cannot support the more
complex use cases for data publication, so will never be able to completely replace RDFa.
HTML Data Task Force: http://www.w3.org/wiki/Html-data-tf
Documents you may be interested
Documents you may be interested