choose from. There were several sites for the exchange of new XMLDTDs,
but writing new ones is now rare.
You can of course just make up your own markup: so long as it makes
sense and you create a well-formed ﬁle,you should be able to write a CSS
or XSLT stylesheet and have your document displayed in a browser.
TO WELL FORMED
If your ﬁles are invalid HTML (95they can be converted to well-formed
DTDlessﬁles as follows:
1. replace any DOCTYPE Declaration with the XML Declaration
<?xml version="1.0" encoding="UTF-8"?> (or using the appropriate
2. If there was no DOCTYPE Declaration, just prepend the XML
Declaration. Either way, the XMLDeclaration, if used, must be line 1
of the document.
3. Change any EMPTY elements (eg every BASE, ISINDEX, LINK, META,
NEXTID and RANGE in the header, and every AREA, ATOPARA,AUDIOSCOPE,
BASEFONT, BR, CHOOSE, COL, FRAME, HR,IMG, KEYGEN, LEFT, LIMITTEXT, OF,
OVER, PARAM, RIGHT, SPACER, SPOT, TAB, and WBR in the body of the
document) so that they end with /> instead, for example
<img src="mypic.gif" alt="Picture"/>;
4. Make all element type names and attribute names lowercase;
5. Ensure there are correctly-matched explicit end-tags for all
non-EMPTY elements; eg every <para> must have a </para>, etc;
6. Escape all < and & non-markup (ie literal text) characters as < and
& respectively (there shouldn’t have been any isolated <
characters to start with, anyway!);
7. Ensure all attribute values are in matched quotes (values with
embedded single quotes must be in double quotes, and vice versa —
if you needboth, use the " character entity reference);
8. Ensure all script URIswhich have & as aﬁeld separator are changed to
use & or a semicolon instead.
(mathematical less-than tests, and Boolean AND conditionals) are
either given as CDATA Marked Sections, or (if browser processors
accept them) changed to use < and & or a semicolon
Be aware that some obsolete HTML browsers may not accept XML-style
EMPTY elements with the trailing slash, so the above changes may not be
backwards-compatible. An alternative is to add a dummy end-tag to all
EMPTY elements, so <img src="foo.gif"/> becomes
<img src="foo.gif"></img>. This is valid XML but you must be able to
guarantee no-one will ever put any text content inside such elements. Adding
aspace before the closing slash in EMPTY elements (eg
<img src="foo.gif" />) may also fool older browsers into accepting XHTML
If you have to answer Yes to any of the questions in theQuestion3.5below,
you can save yourself a lot of grief by ﬁxing those problems ﬁrst before
doing anything else. You will likely then be getting very close to having
Markup which is syntactically correct but semantically meaningless or void
should be edited out before conversion. Examples are bogus spacing devices
such as repeated empty paragraphs or linebreaks, empty tables, invisible
spacing GIFs etc. XML uses stylesheets, and CSS3 means you won’t need any
Unfortunately there is rather a lot of work to do if your ﬁles are invalid: this
is why many Webmasters now insist that only valid or well-formed ﬁles are
used (and why you should instruct your designers to do the same), in order
to avoid unnecessary manual maintenance and conversion costs later.
HECKLIST FOR INVALID
If your HTMLﬁles fall into this category (HTML created by most WYSIWYG
editorsis usually invalid) then they will almost certainly have to be
converted manually, although if the deformities are regular and carefully
constructed, the ﬁles may actually be almost well-formed, and you could
write a programor script to do as described above. The oddities you may
need to check for include:
• Do the ﬁles contain markup syntax errors? For example, are there
any missing angle-brackets, backslashes instead of forward slashes
on end-tags, or elements which nest incorrectly (eg
<B>starting <I>inside one element</B> but ending out-
• Are there elements with missing end-tagsthat cannot be inferred by
• Are there any URIs (eg in hrefs or srcs) which use Microsoft
Windows-style backslashes instead of normal forward slashes?
• Do the ﬁles contain markup which conﬂicts with HTMLDTDs,such as
headings or lists inside paragraphs, list items outside list
environments, headerelementslike base preceding the ﬁrst html,
etc? (another sloppy editor trick)
• Do the ﬁles use imaginary elements which are not in any known
HTML DTD? (large amounts of these are used in proprietary markup
systems masquerading as HTML). Although this is easy to transform
to a DTDless well-formed ﬁle (because you don’t have to deﬁne
elements in advance) most proprietary orbrowser-speciﬁc
extensionshave neverbeen formally deﬁned, so it isoften impossible
to work out meaningfully where the element types can be used.
• Are there any invalid (non-XML) characters in your ﬁles? Look
especially for native Apple Mac Roman-8 characters left by careless
designers; any of the illegal Windows characters (the 32 characters at
decimal codes 128–159 inclusive) insertedby Microsoft editors; and
any of the ASCII control characters 0–31 (except those permitted
like TAB, CR,and LF). These must be convertedto the correct
characters in UTF-8 (or whatever you are using).
• Do your ﬁlescontain invalid (old Mosaic/Netscape-style) comments?
Comments must look
<!-- like this s -->
with double-dashes each end and no other double (especially not
multiple) dashes in between.
3.6 How do I convert XML to other ﬁle formats?
Write a conversion in a language that understands XML
While it is possible to write conversion routines by inventing your own XML
parser, it is not recommended except as an exercise for students of
computing science. All major languages have XML libraries that do all the
heavy lifting of parsing (and validating, if needed).
You do need to know what’s in the XML document before you start: there is
no magic wand that will automatically deduce what things mean and where
they are located in the ﬁle. If you have been handed some XML ﬁles out of
the blue, you will need to go and ﬁnd the creator or some documentation
about them. The ﬁrst 2–3 lines of the ﬁle may hold a clue as to what type of
XML they are. You will almost certainly need a copy of the DTD or Schema
to which the ﬁles have been created.
The options for programming are:
• Use a language designed for the task. XSLT2 has all the facilities for
handling XML built in from the start, and standalone processors are
available for all platforms. Many XML editors have a copy of XSLT (2,
hopefully) built in, so they oﬀer an integrated development environment
for editing and conversion. XSLT2 conversion can also run inside
server packages like Apache Cocoon.
• Use an XML processing or pipelining package. These are (usually)
commercial products which provide extensive document management,
document database, and document conversion and editing functions,
often as part of a much larger enterprise information solution, using
XSLT2 or their own in-house systems. Two popular ones are
• Use a conventional compilable language. Java or C (or one of its many
variants) would be common; Pascal, FORTRAN, or COBOL are rare
these days, but XML libraries do exist for them). BASIC, anyone?
• Use a scripting language. Perl, Python, Tcl, VBscript, or even
Powershell are all popular, and XML libraries exist for them; the
Python ones have an excellent reputation.
• Combine XML utilities with standard shells. Here is an early example
of anXML-to-CSV routine which uses onsgmls to expose the ESIS, and
awk to reformat it. Similar processes can be developed using the
• There are downloadable (sometimes free) programs claiming to be
‘easy’ XML converters. The editor would like to hear recommendations
or warnings ,
The process of converting XML to other formats is sometimes referred to as
‘down-converting’, as it may involve the unavoidable loss of information
(usually metadata) when the target format simply doesn’t have a way to
3.7 If XML is just a subset of SGML, can I use my
existing SGML tools?
Yes, if they are up to date
Yes, provided you use up-to-date SGML software which knows about the
WebSGML Adaptations TC to ISO 8879(thefeaturesneededtosupportXML,
such as the variant form for EMPTY elements; some aspects of the SGML
Declaration such as NAMECASE GENERAL NO; multiple attribute token list
An alternative is to use an SGML DTD to let you create a fully-normalised
SGML ﬁle, but one which does not use empty elements; and then remove the
DocType Declaration so it becomes a well-formed DTDless XML ﬁle. Most
SGML tools now handle XML ﬁles well, and provide an option switch
between the two standards. (see the pointers inQuestion4.10onpage81).
Unless there are very special reasons, you should probably plan to move
your SGML to XML anyway.
3.8 I’m used to authoring and serving HTML. Can I
learn XML easily?
Very easily, but even after nearly 20 years there is still a need for more
tutorials, simpler tools, and more open examples of XML documents.
‘Well-formed’ XML documentsmaylooksimilartoHTMLexceptforsome
small but very important points of syntax.
The big practical diﬀerence is that XML has to stick to the rules. HTML
browsers let you serve them even fatally broken or ridiculously corrupt
HTML because they don’t do a formal parse but just elide all the broken bits
instead. With XML your ﬁles have to be completely correct or they simply
won’t work at all. One outstanding problem is that some browsers claiming
XML conformance are also broken, and some browsers’ support for XSLT
processing and CSS styling is still dubious at the best. Try yours on thelistof
real hotel web sites.
3.9 Can XML use non-Latin characters?
Yes, this is the default
Yes, theXMLSpeciﬁcation explicitly says XML usesISO10646, the
international standard character repertoire which covers most known
languages. Unicodeis an identical repertoire, and the two standards track
each other. The spec says (2.2): ‘All XML processors must accept the UTF-8
and UTF-16 encodings of ISO 10646...’. There is a Unicode FAQ at
While XML software may allow you to enter any Unicode character into a
document, your readers can only see the charactersif their computer has a
suitable font! Not all typefaces and font ﬁles have the entire Unicode
repertoire (ones that do are huge).
UTF-8 is an encoding of Unicode into 8-bit characters: the ﬁrst 128 are the
same as ASCII, andhigher-ordercharactersareusedtoencodeanything
else from Unicode into sequences of between 2 and 6 bytes. UTF-8inits
single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can
continue to use ASCII for English or other languages using the Latin
alphabet without diacritics (accents). Note that UTF-8 is incompatible with
ISO 8859-1 (ISO Latin-1) after code point 127 decimal (the end of ASCII).
UTF-16 is an encoding of Unicode into 16-bit characters, which lets it
represent 16 planes. UTF-16 is incompatible with ASCII because it uses two
8-bit bytes per character (four bytes above U+FFFF).
Peter Flynn writes:
The encoding speciﬁcation can refer to any character set your software
supports, but the XML Speciﬁcation only requires that applications
support UTF-8 and UTF-16. Some of the common encodingssupported by
C# PDF - Extract Text from Scanned PDF Using OCR SDK
Field Data. Data: Auto Fill-in Field Data. Field: Insert Recognize scanned PDF document and output OCR result to class source code for ocr text extraction in .NET how to save pdf form data in reader; saving pdf forms in acrobat reader
C# TIFF: Use C#.NET Code to Extract Text from TIFF File
SDKs, C# users can easily add and perform text extraction functionality into Certainly, you may also render it to a PDF, Word or SVG Set the training data path extract data from pdf form to excel; how to save editable pdf form in reader
US ASCII Characters TAB, LF, CR, space, andthe printable characters33 to
126 (decimal) only (all other control charactersare forbidden by
ISO 8859 1 (Western European Latin-1) As ASCII plus codes 128 to 255
(decimal). Covers most (but not all) western European accented
ISO 8859 2
15 These other planes ofISO-8859 cover the remaining and
different sets of Latin-based alphabetic and other symbols.
AND OTHER OBSOLESCENT SETS
Some software may also
support various obsolete ‘codepages’, such as IBM-850, Microsoft
Windows-1252, Apple Macintosh Roman-8,DEC Multinational and
other non-standard character encodings, but these are generally
non-portable andshould be avoided where possible.
One common practice in western Europe is to use ISO-8859-1 so that the
majority of common accented letters can be used as single bytes, and to
use character entity referencesor numeric entities for all other characters.
Thishas the advantage that such ﬁles can be opened in almost any
single-byte editor. The drawback is that numeric entities are not
mnemonic, and character entities have to be declared in DTD or internal
subset, but if they are rare, this may not be a serious problem.
Bertilo Wennergren writes:
UTF-16 is an encoding that represents each Unicode character of the ﬁrst
plane (the ﬁrst 64K characters) of Unicode with a 16-bit unit — in practice
with two bytes for each character. Thus it isbackwards compatible with
neither ASCII nor Latin-1. UTF-16 can also accessan additional 1 million
characters by a mechanism known as surrogate pairs (two 16-bit units for
‘.. .the mechanisms for signalling which of the two are in use, and for
bringing other encodings into play, are [.. .] in the discussion of character
encodings.’ TheXMLSpeciﬁcation explains how to specify in your XML ﬁle
which coded character set you are using.
‘Regardless of the speciﬁc encoding used, any character in the ISO
10646 character set may be referred to by the decimal or hexadecimal
equivalent of its bit string’: so no matter which character set you
personally use, you can still refer to speciﬁc individual characters from
elsewhere in the encoded repertoire by using &#dddd; (decimal character
code) or &#xHHHH; (hexadecimal character code, in uppercase). The
terminology can get confusing, as can the numbers: see theISO10646
Concept Dictionary.RickJelliffehas XML-ised the ISO character entity
is a very useful explanation of the need for correct encoding. There is an
excellent online database of glyphs and characters in many encodings
from the Estonian Language Institute server athttp://www.eki.ee/letter/.
3.10 What’s a DTD and where do I get one?
Aspeciﬁcation of document structure. You can write one or
ADTD is a description in XML Declaration Syntax of a particular type or
class of document. It sets out what names are to be used for the diﬀerent
types of element, where they may occur, and how they all ﬁt together. A
an XML document itself; and Schemas allow more extensive data-typing.
For example, if you want a document type to be able to describe Lists which
contain Items, the relevant part of your DTD might contain something like
<!ELEMENT List (Item)+>
<!ELEMENT Item (#PCDATA)>
This deﬁnes a list as an element type containing one or more items (that’s
the plus sign); and it deﬁnes items as element types containing just plain text
(Parsed Character Data or PCDATA). Validators read the DTD before they
read your document so that they can identify where every element type
ought to come, what they can contain, and how each relates to the other, so
that applications which need to know this in advance (processors, browsers,
editors, search engines, navigators, and databases) can set themselves up
correctly. The example above lets you create lists like this:
As explained inQuestion3.2onpage35, the indentation in the example is
just for legibility while editing: it is not required by XML. It could just as
easily be written like this:
ADTD therefore provides applications with advance notice of what names
and structures can be used in a particular document type. Using a DTD and a
validating editor means you can be certain that all documents of that
particular type will be constructed and named in a consistent and
DTDs are not required for processingwell-formeddocuments, but they are
needed if you want to take advantage of XML’s special attribute types like the
built-in ID/IDREF cross-reference mechanism; or the use of default attribute
values; or references to external non-XML ﬁles (‘Notations’) like images; or if
you simply want a check on document validity before processing.
There are thousands of DTDs already in existence in all kinds of areas (see
theSGML/XMLCoverPages for pointers). Many of them can be
downloaded and used freely, but some are restricted to certain industries, or
are proprietary; but you can also write your own (see the question on
creating your own DTD.OldSGMLDTDsneedtobeconvertedtoXMLfor
use with XML systems: : readthequestiononconvertingSGMLDTDsto
Some XML editors use a binary compiled format of DTD produced by their
own management routines to allow a single person in an organisation to be
in charge of modiﬁcations, and to distribute only an unmodiﬁable (binary
compiled) version to users.
The alternatives to a DTD are various forms ofSchema. These provide more
extensive validation features than DTDs, including character data content
3.11 Does XML let me make up my own tags?
Yes but they’re not called tags. They’re element types.
XML lets you make up names for your own element types. If you think tags
and elements are the same thing you are already in considerable trouble:
read the rest of this question carefully.
The same applies if you are thinking in terms of ‘ﬁelds’ (seeQuestion4.8on
Bob DuCharme writes:
Don’t confuse the term ‘tag’ with the term ‘element’. They are not
interchangeable. An element usually contains two different kinds of tag: a
start-tag and an end-tag, with text or more markupbetween them.
XML lets you decide which elements you want in your document and
then indicate your element boundaries using the appropriate start- and
end-tagsfor those elements. Each <!ELEMENT... declaration deﬁnes a
type of element that may be used in adocument conforming to that DTD.
We call thistype of element an ‘element type’. Just as the HTML DTD
includes the H1 andP element types, your document can have color or
price element types, or anything else you want.
Normal (non-empty) elements are made up of a start-tag, the
element’s content, and an end-tag. <color>red</color> is a complete
instance of the color element. <color> isonly the start-tag of the
element, showing where it begins; it is not the element itself.
Empty elements are a special case that may be represented either as a
pair of start- and end-tags with nothing between them (eg
<price retail="123"></price>) or as a single empty element start-tag
that has aclosing slash to tell the parser‘don’t golooking for an end-tag to
match this’ (eg <price retail="123"/>).
3.12 How do I create my own document type?
Analyse the class of documents, and write a DTD or Schema
Document types usually need a formal description, either a DTD or a
Schema. Whilst it is possible to process well-formed XML documents
without any such description, trying to create them without one is asking for
trouble. A DTD or Schema is used with an XML editor or API interface to
guide and control the construction of the document, making sure the right
elements go in the right places.
Documents you may be interested
Documents you may be interested