79
12.2.2.4 Changing the encoding while parsing
When the parser requires the user agent tochange the encoding, it must run the following steps. This might happen if theencoding sniffing
algorithmp965
described above failed to find a character encoding, or if it found a character encoding that was not the actual encoding of the file.
1. If the encoding that is already being used to interpret the input stream is aUTF-16 encodingp44
, then set theconfidencep965
tocertain
and abort these steps. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect.
2. If the new encoding is aUTF-16 encodingp44
, then change it toUTF-8
.
3. If the new encoding isx-user-defined
, then change it towindows-1252
.
4. If the new encoding is identical or equivalent to the encoding that is already being used to interpret the input stream, then set the
confidencep965
tocertainand abort these steps. This happens when the encoding information found in the file matches what the
encoding sniffing algorithmp965
determined to be the encoding, and in the second pass through the parser if the first pass found that the
encoding sniffing algorithm described in the earlier section failed to find the right encoding.
5. If all the bytes up to the last byte converted by the current decoder have the same Unicode interpretations in both the current encoding
and the new encoding, and if the user agent supports changing the converter on the fly, then the user agent may change to the new
converter for the encoding on the fly. Set thedocument's character encoding
and the encoding used to convert the input stream to the
new encoding, set theconfidencep965
tocertain, and abort these steps.
6. Otherwise,navigatep788
to the document again, withreplacement enabledp800
, and using the samesource browsing contextp788
, but this
time skip theencoding sniffing algorithmp965
and instead just set the encoding to the new encoding and theconfidencep965
tocertain.
Whenever possible, this should be done without actually contacting the network layer (the bytes should be re-parsed from memory),
even if, e.g., the document is marked as not being cacheable. If this is not possible and contacting the network layer would involve
repeating a request that uses a method other than `GET`), then instead set theconfidencep965
tocertainand ignore the new encoding.
The resource will be misinterpreted. User agents may notify the user of the situation, to aid in application development.
12.2.2.5 Preprocessing the input stream
Theinput streamconsists of the characters pushed into it as theinput byte streamp964
is decoded or from the various APIs that directly manipulate
the input stream.
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and
characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF,
U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF areparse errorsp963
. These
are allcontrol charactersp64
or permanently undefined Unicode characters (noncharacters).
Anycharacterp45
that is a not aUnicode characterp45
, i.e. any isolated surrogate, is aparse errorp963
. (These can only find their way into the input
stream via script APIs such asdocument.write()p859
.)
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) characters are treated specially. Any LF character that immediately
follows a CR character must be ignored, and all CR characters must then be converted to LF characters. Thus, newlines in HTML DOMs are
represented by LF characters, and there are never any CR characters in the input to thetokenizationp975
stage.
Thenext input characteris the first character in theinput streamp970
that has not yet beenconsumedor explicitly ignored by the requirements in
this section. Initially, thenext input characterp970
is the first character in the input. Thecurrent input characteris the last character to have been
consumed.
Theinsertion pointis the position (just before a character or just before the end of the input stream) where content inserted using
document.write()p859
is actually inserted. The insertion point is relative to the position of the character immediately after it, it is not an absolute
offset into the input stream. Initially, the insertion point is undefined.
The "EOF" character in the tables below is a conceptual character representing the end of theinput streamp970
. If the parser is ascript-created
parserp858
, then the end of theinput streamp970
is reached when anexplicit "EOF" character(inserted by thedocument.close()p859
method) is
consumed. Otherwise, the "EOF" character is not a real character in the stream, but rather the lack of any further characters.
This algorithm is only invoked when a new encoding is found declared on ametap143
element.
Note
970