236
278
TUGboat, Volume 32 (2011), No. 3
On the use of T
E
Xas an authoring language
for
HTML
5
S.K. Venkatesan
Abstract
The T
E
Xsyntax has been fairly successful at mark-
ing up a variety of scientific and technical literature,
making it an ideal authoring syntax. The brevity
of the T
E
Xsyntax makes it difficult to create over-
lapping structures, which in the case of
HTML
has
made life so difficult for
XML
purists. We discuss
S-expressions, the T
E
Xsyntax and how it can help
reduce the nightmare that
HTML
5markup is going
to create. Apart from this we implement a new syn-
tax for marking up semantic information (microdata)
in T
E
X.
1 Introduction
The brevity of T
E
Xsyntax has made it fairly success-
ful at marking up a variety of scientific and techni-
cal literature. On the one hand, modern markup
languages such as
(X)HTML
and
XML
have ver-
bose syntax which is not only difficult to author
but also produces non-treelike structures such as
overlapping structures that need to be checked for
well-formedness. On the other hand, T
E
Xand its
macros are difficult to parse and validate, compared
to
XML
with a
DTD
or schema. Many
XML
versions
of T
E
Xhave been proposed such as T
E
X
ML
[3] and
XL
A
T
E
X [5] that are intrinsically close to (L
A
)T
E
X.
The main advantage of such a system is that one
can introduce a validator using a
DTD
or schema to
check the syntax before passing it to the T
E
Xengine.
However,
XML
syntax is difficult to author and
in fact is prone to producing overlapping structures
that need to be avoided for it to be well-formed,
and as a result these
XML
versions have not become
popular for authoring. In this article, we propose
something that is quite the reverse, i.e., T
E
Xas an
authoring syntax for both
XML
and
HTML
.
2 T
E
X, S-expressions and
XML
Let us look at the following T
E
Xcode:
\title[lang=en]{Title of
a \textit{plain} article}
The same code in a Lisp-like S-expression would be:
(title (@ (lang="en")) ("Title of a ")
(italic "plain") ("article"))
or if one would like to treat elements and attributes
in the same way:
(title (@lang="en") ("Title of a ")
(italic "plain") ("article"))
The difference between the above two S-expressions
is that the former introduces a deliberate asymmetry
between attributes and elements, whereas the latter
treats attributes on a par with elements. However,
both S-expressions can be considered as an improve-
ment on
XML
as they allow further nesting within
attributes. The corresponding
XML
code would be:
<title lang="en">Title of
a <italic>plain</italic> article</title>
In both T
E
X and
XML
syntax, further nesting of
structures is not possible within attributes, which
makes T
E
Xideal for authoring
XML
or
HTML
5.
There are further similarities between the T
E
X
and
SGML/HTML
syntaxes. Attribute minimization
used in
HTML
,like not quoting attribute values, is
very much practiced in T
E
Xsyntax, more as a rule
rather than the exception; e.g.,
\includegraphics[width=2cm]{myimage.gif}
Unlike
SGML/HTML
,T
E
Xtypically uses a comma
as the separator between attributes, instead of the
word-space used in
SGML/HTML
. T
E
X also uses
complete skipping of attribute values, similar to the
commonly used
HTML
code: <option selected>.
Quite like T
E
X,
HTML
also has the practise of shrink-
ing multiple spaces to a single space. All of these
similarities make it clear that authoring
HTML
in
T
E
Xwould be an ideal proposition.
3 Overlapping markup in
HTML
Since
HTML
is marked up by humans, there tend
to be many situations with overlapping elements or
other eccentric markup which do not confirm to a
well-formed
SGML
or
XML
syntax. Consider the
HTML
markup:
<p>Text with <i>unique <b>and</i>
strong formatting</b> issues</title>
Autility like
HTML
Tidy [6] or TagSoup [1] can
convert this into well-formed markup such as:
<p>Text with <i>unique </i><b><i>and</i>
strong formatting</b> issues</title>
However, it is not always clear what should be
done with such a non-standard markup. The
HTML
5
specification defines clearly how such a non-standard
markup should be interpreted [7] but the
HTML
implementations in browsers currently deal with it
differently from each other.
W3C
has been insisting for some time that the
next generation of markup should be
XML
-compliant
like
XHTML
+Math
ML
+
SVG
profiles, with other in-
tricacies such as namespaces. However, more than
99% of
HTML
pages in the wild are invalid, accord-
ing to the
HTML
4
DTD
or schema. This being the
S.K. Venkatesan