1. About This Case Study
This case study looks at the fundamentals of using HTML5 for scholarly documents of all kinds,
particularly theses and courseware documents (with application to journal articles as well), but
with an eye on a much broader spectrum of resources, including those which are the subject of
other case studies in this project such as slide presentations. It will aim to establish the basic
structural and semantic building blocks for how resources should be marked up for the Web, to
increase their utility for people and machines, as well as help to ensure they can be preserved
effectively. This case study will build on work already undertaken by in the Scholarly HTML
community as well as the other HTML5 case studies , , , , , ,  and .
The audience for this work is tool developers building authoring systems, repositories and
publishing infrastructure for academic documents. The outcomes could be used by people
hand-coding documents, but that is not a very likely scenario; another related case study will
implement the advice from this study into a tool to allow users to create HTML5 from Microsoft
Word (using the Word 2000 HTML format which is available on Windows from v2000 to v2010)
 and a third looks at how citations embedded in a document using the guidelines presented
here can be re-formatted .
What Is Covered
The following aspects are covered in this document:
The basic structural backbone of a scholarly HTML document; how to mark up the
scope of what is the content on a page. For example, on a blog, which section is the
scholarly work as distinct from the navigation elements, advertisements, etc.?
Best practice for marking up sections within the document (whether to use nested
sections or just headings
discussion of issues like putting headings in tables and
A brief discussion of techniques for embedding rhetorical semantics in documents.
That is, the ability to distinguish an introduction or conclusion, or to mark parts of a
text such as learning objectives by drawing on XML schemas and ontologies. Some
generic advice about how to mark up other kinds of semantic relationships, such as
linking to a data file, illustrated with examples from Chemistry.
Work on metadata and semantics by other case studies incorporated into the core
The following is still in gestation:
Anchors for commenting and annotation: this requires some attention as simple
schemes such as numbering paragraphs are very limited in capability, and current
tools such as digress.it require documents to be in a particular dialect of HTML to
work. The introduction of standards in this area would allow interoperability between
commenting and annotation systems.
This work should have an impact on:
Search engine optimisation (SEO), particularly for services such as Google Scholar.
Reduced friction in moving documents through submission processes to journals, to
repositories and to review processes such as peer review, thesis examination, and
assessment, via automated metadata extraction.
Improved machine-readability for text and data-mining processes.
Improved accessibility for readers
guidelines will take into account WCAG
Preservation: the guidelines will assist authors and tool makers in constructing
documents which do not 'rot' as technology changes.
2. Use Case
The use case here is very broad: it is about the optimal mark-up for any kind of academic-
related document on the Web. The Web was conceived as a vehicle for scholarship, but in the
two decades since its invention, scholarly communications have taken a back seat to the driver,
commerce. The most common form of Web publishing for scholarly publications is articles in
PDF format, which readers are expected to download and manage themselves. PDF has the
advantage of capturing an absolute layout, preserving the exact look of a document, but it was
designed for capturing print
not for delivery to an increasing variety of screen sizes (and to
devices with no screen at all).
It has been recognised that the current scholarly publications landscape is not serving the
needs of the community. Publishers have built an industry around the creation of paper-like
objects which do not allow:
Delivery to any device
Re-use by humans to create new works.
Rich integration of publications with supporting data and visualisations of data.
Machine-processable semantics so that research literature can be mined, indexed and
For learning resources the situation is a little different, in that there has been some history of
Web-based delivery of materials controlled by institutions themselves.
As noted in another JISC project
HTML is the major format for the next stage of the
development of the Web, as well structured Web resouces can not only be delivered and used
on the Web, they are the basis for creating e-books. It is clear from the extremely rapid growth
of the e-book market for commercial publishing that learning resources and research materials
will need to follow. HTML5 is essential both to the open EPUB3 standard and to market leader
Amazon’s newest format.
In both research and learning materials, there is still a distinct lack of tools for creating Web-
native resources, at least in a way accessible to typical academics. The use case of an
academic sitting down to create richly structured HTML5 academic objects, with embedded
semantics and preservation-quality mark-up, is something of a dream at this stage. The best
this case study can hope to achieve is to provide a starting point for a description of what
Scholarly HTML should look like and to provide a starting point for a roadmap for tool
development to allow the scholarly Web to take its rightful place alongside the Web ‘high street’.
The solution has two parts. The first is a guide to marking up Scholarly HTML documents. The
second points to software packages and techniques that are useful in the process of marking up
documents, both existing tools, and tools developed as part of this project.
How to Mark Up Documents for Scholarly HTML
This section will become a stand-alone guide and be posted on the Scholarly HTML Website as
the core guide to structuring HTML documents. Scholarly HTML is a term used by a loose group
of people interested in bringing scholarship to the Web, or the Web back to its scholarly roots as
a publication and research platform. The group met physically once at a meeting convened by
Peter Murray-Rust at Cambridge University in March 2011.
Use HTML5, Microdata and Common Vocabularies
HTML5 is an evolving standard which codifies HTML in the context of the real world. The
Wikipedia page for HTML5  is a good starting point for pointers to the specification. This
document will assume that the reader is familiar with the HTML5 standard, in particular outline
structures and microdata; Mark Pilgrim’s Dive Into HTML5
 is a good free introduction.
The #jiscPUB Project, http://jiscpub.blogs.edina.ac.uk/about/
Within the Scholarly HTML group there was a short, vigorous, debate about whether or not
Scholarly HTML should be required to be well-formed XML. There were fears on both sides
that HTML resources would be impossible to parse reliably, and on the other hand that making
XML mandatory would be too high a bar, reducing the pool of available content considerably.
Mark Pilgrim covers very similar arguments in his chapter on the background of HTML5
including the now-abandoned XHTML standard. The good news is that with HTML5 you do not
have to choose. The HTML5 standard specifies exactly how HTML5 should be parsed
once parsed it can be re-serialised as XML. So, for machine-based processing, the advice is
use an HTML5 parser, not an XML parser. Then, if you want to use XML tools, serialise the
document as XML.
For example, here is some Python to illustrate the process using html5lib
implementation of the parsing rules, and lxml. This is copied and pasted from an Ubuntu Linux
First, install the Python libraries you need:
sudo easy_install lxml html5lib
Then, open a Python shell (type python) and try this:
import html5lib #Handles the ’HTML5’ stuff
from html5lib import treebuilders
from lxml import etree #Handles the XML serialization
e = parser.parse("<p>This is some HTML<p>Which is very far from
being <b>XML <p>But which the HTML parser will be OK with")
The result is well-formed XML (yes, the namespace is probably wrong, but this will serve as the
input to downstream processes).
<html:p>THis is some HTML</html:p>
<html:p>Which is very far from being <html:b>XML</html:b></html:p>
<html:b>But which the HTML parser will be OK with</html:b>
So, the advice for scholary HTML is:
Use HTML5 as per the standard including Microdata.
The Context for Pages
Scholarly works on the Web are unlikely to be stand-alone documents. They will very often be
embedded in content management systems, repositories or publisher Web sites. It is out of
scope for this case-study to consider the structure of Web pages produced by these Web
applications, but for an example of best practice in structuring HTML5 Web sites, including
navigation elements and so on, see the Common Framework case study conducted by Bilbie
The key points in that case study involve:
Flexible design that can re-flow to any size of device.
Use of HTML5 attributes in the mark-up to provide cues to screen readers and other
element on which you can cause things to appear, via scripts. But there are several reasons to
avoid making academic works depend on particular scripts or applications, and instead look for
ways to express the meaning of a work and the parts it links to as plain HTML, with enough
information in it for scripts, etc., to come into play when needed:
Works can be archived and preserved independently of the scripts on which they
Other people can reuse the declaratively specified data in new ways
Revising the work when new applications emerge is much easier when there is a clear
separation between documents, together with data and media, as opposed to the
code that does interesting things with the documents, data and media.
A good example of this approach can be seen in this series of case studies in the work by
Adams  and MacGillivray  on citation formats, where the same declarative format is a
meeting point for two different projects. Another project based on a declarative format (though
not HTML5) is the work by Gray on embedding 3D motion-capture models in HTML5. The
embedding is done using a declarative XML format rather than via a script .
HTML5 has an <article> element which at first glance seems to be the perfect container for
scholarly works. It seems obvious that it should be used for the text of a scholarly article and
reasonable that it should be used for book chapters, course modules and so on. The problem
with this is that content management systems may also be using <article>. For example, the
WordPress default theme at time of writing, uses <article> to mark up posts.
So, the advice is:
Conventions for document-level mark-up:
If your scholarly work is going to be part of a stand-alone Web page, or you know that it is
appropriate in the context into which it will be published use <article>.
If the article is going to be sent off to a publisher, posted to a blog (where for example the
theme might change at some point) it is safer to use the <section> element.
In either case, mark up the scholarly work with microdata semantics:
Note that the Schema.org definition for Scholarly Article is at present rather light on detail,
defining it as:
A Scholarly Article
This guide is making the assumption that, in spirit it is really the more generic
If more delicate terms are added to the Schema.org vocabularies or more appropriate terms
identified then this advice will need to change.
Within the section or article element chosen, the question arises how to mark up the structure of
a work with headings, sections, etc. In HTML5, documents have an outline, which can be
computed using a well-specified algorithm.
This means that the use of internal section elements within resources has no real impact on
semantics, so how you format a document depends on what is convenient or necessary:
For authoring in a text editor, or even an HTML editor, the use of sections may be an
unnecessary complication; consider using headings which are not wrapped in
For authoring in a word-processing environment, nested sections are impossible to
implement; so use heading styles and choose conversion software that can respect
for example WordDown, produced as a demonstrator for this project (Sefton
To add microdata semantics at the section level of the document, it will be necessary
to use section mark-
up on which to ‘hang’ microdata attributed. This is a significant
barrier to editing with lightweight tools such as Markdown.
In published documents, using sections makes it easier for other people to copy and
paste or machine-process documents, even though they could determine the structure
of the document by computing its outline.
For the most general way of presenting documents for publication, it is possible to use
this structure where each section has an <h1> heading, even where they are nested
within each other, but this may not be encouraging re-use by others who need to edit
For documents that need to work in legacy browsers, and content management
systems where you do not have control over the CSS used current best-practice
advice is to use a <header> block with the document title in and <h1> element, then
use <h2> .. <h5> throughout the document, each in a section, to enable the use of
microdata semantics, and to aid others in re-using the content. (For example, loading
such a document into Microsoft Word would lose the sections, but keep the headings.)
<!-- document title-->
Embedding Metadata and Semantics
Sam Adams has included a discussion of the most prominent methods of embedding semantics
in documents in his case study. He considers microformats, RDFa and microdata. As microdata
is part of the HTML5 specification, and is receiving mainstream support from major internet
companies it is recommended as the default method of adding semantics to Scholarly HTML
Conventions for embedded semantics and metadata:
Use the schema.org vocabularies where possible, and when they are not adequate
extend semantics by using well-documented ontologies or vocabularies maintained by
groups with an interest in scholarship.
A blog post
is available as example of some of the design considerations in using microdata
and which reports on work done as part of this case study.
Marking up Rhetorical Semantics
It is useful to be able to mark up sections in academic publications that have different roles. A
W3C working draft on the “
Ontology of Rhetorical Blocks
 puts it like this:
Having the rhetorical block structure externalised and attached to the digital publications
would enable a richer and more expressive searching and browsing experience. One
would be able to quickly spot the METHODS blocks within the publication and possibly
resume the reading activity only to those, thus reducing the time usually spent on reading
the entire publication. On the other hand, being able to formulate queries for content
specific only to such blocks could already improve the quality (and possibly the quantity)
of the set of relevant publications (e.g. methods: "autosomal-dominant mutations in
For example, to mark up the major body sections of a document use mark-up like this:
<section itemtype="http://purl.org/orb/Introduction" itemscope>
Or one of the other rhetorical elements defined in the above-mentioned W3C ORB draft:
Relating the section back to the containing document still needs consideration, but using this
kind of mark-up is a first step to capturing some of the document structure that XML is often
used to describe, but in a more flexible way, that can be applied directly to Web documents.
As a simple demonstration of how this is useful, the WordDown converter that generates HTML
from Word documents uses this mark-up for a references or bibliography section:
<section itemtype="http://purl.org/orb/References" itemscope>
By using a public, well-defined and documented URI we increase the chances that software can
interoperate and that others can reuse our scholarly resources. (But note that in the model of
citations we are proposing here, detailed information about references might be stored in the
document at the point they are cited, or not at all if the author is citing by reference).
See the case study by Sefton  on citation formatting, which contains draft examples.
Linking to Data and Supporting Documents
To link to a data set in a declarative way, use this pattern, with a generic property from the
Citation ontology (cito) and a type which is domain-specific from an appropriate vocabulary, in
this case a term associated with Chemical Markup Language:
<link href="link-to-the-data.cml" itemprop="url">
My CML file
This declarative statement of the relationship between a scholarly work and supporting data
Scholarly HTML5: experimenting on myself with microdata and Schema.org vocabs
bookmarklet, extension or added by the CMS serving the page. A worked example of using
similar declarative mark-up to embed chemical visualisations in Web pages is available in a
Many scholarly resources are made up of more than one part: theses and chapters have books;
journal issues are made up of multiple articles; reviews etc. Courseware is very often made up
of disparate resources brought together in a lesson context, and the research object of the
future really must comprise a wide range of components, including documents, data,
provenance information and so on.
In the academic environment, the Open Access Initiative has worked on a standard way of
describing compound objects, the Object Reuse and Exchange (ORE) standard
. ORE is
complicated to understand. For the purposes of real-world Web practice that is easy to
implement, a simplified approach is needed; but as always, drawing on terms from established
ontologies and vocabularies.
Take the case of a table of contents on the Web for work that is made up of multiple parts
(NOTE: this approach is an early proposal only).
<article itemtype="http://schema.org/Book" itemscope>
<h1 itemprop="name">My book!</h1>
<span itemprop="name">Chapter One</span>
As part of this work various tools were created by the author, drawing on open source libraries:
WordDown is a Word-to-HTML5 conversion application, covered in another case
study document  and hosted on the Google Code site for jiscHTML5
ReCite is a citation re-formatter that uses Citation Style Language (CSL) to reformat
the references in a page covered in another case study (Sefton, 2012a).
Show5ource is Javacript code, packaged as a bookmarklet for:
Extracting Microdata from HTML5 documents in JSON format
Copying and pasting the source of HTML5 documents for simple pasting into
content management systems
Alpha code for EPUB packaging of compound resources.
Also useful are these tools:
The Live Microdata tool
is useful for debugging microdata and uses the same
underlying library as Show5ource.
Scholarly HTML: Fraglets of progress, http://ptsefton.com/2011/03/18/scholarly-html-fraglets-
What the OAI-ORE protocol can do for you, http://ptsefton.com/2008/10/14/what-the-oai-ore-
WordDown: jischtml5: Collection of HTML5 case studies and examples of scholarly
resources and tools for processing them, http://code.google.com/p/jischtml5/wiki/WordDown
Live Microdata, http://foolip.org/microdatajs/live/
Documents you may be interested
Documents you may be interested