quinta-feira, 21 de novembro de 2013

http://html5readiness.com/

12.2 Parsing HTML documents

Implemented and widely deployed
Tests: 5 — View...
Latest Internet Explorer beta: incomplete support
Latest Firefox trunk nightly build: buggy support
Latest WebKit or Chromium trunk build: buggy support
Latest Opera beta or preview build: buggy support
JavaScript libraries, plugins, etc: unknown
This section only applies to user agents, data mining tools, and conformance checkers.
The rules for parsing XML documents into DOM trees are covered by the next section, entitled "The XHTML syntax".
User agents must use the parsing rules described in this section to generate the DOM trees from text/html resources. Together, these rules define what is referred to as the HTML parser.
While the HTML syntax described in this specification bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.
Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed Web browsers interoperably implemented a different representation — has wasted decades of productivity. This version of HTML thus returns to a non-SGML basis.
Authors interested in using SGML tools in their authoring pipeline are encouraged to use XML tools and the XML serialization of HTML.
This specification defines the parsing rules for HTML documents, whether they are syntactically correct or not. Certain points in the parsing algorithm are said to be parse errors. The error handling for parse errors is well-defined (that's the processing rules described throughout this specification), but user agents, while parsing an HTML document, may abort the parser at the first parse error that they encounter for which they do not wish to apply the rules described in this specification.
Conformance checkers must report at least one parse error condition to the user if one or more parse error conditions exist in the document and must not report parse error conditions if none exist in the document. Conformance checkers may report more than one parse error condition if more than one parse error condition exists in the document.
Parse errors are only errors with the syntax of HTML. In addition to checking for parse errors, conformance checkers will also verify that the document obeys all the other conformance requirements described in this specification.
For the purposes of conformance checkers, if a resource is determined to be in the HTML syntax, then it is an HTML document.
As stated in the terminology section, references to element types that do not explicitly specify a namespace always refer to elements in theHTML namespace. For example, if the spec talks about "a menuitem element", then that is an element with the local name "menuitem", the namespace "http://www.w3.org/1999/xhtml", and the interface HTMLMenuItemElement. Where possible, references to such elements are hyperlinked to their definition.

12.2.1 Overview of the parsing model

Ready for first implementations
Latest Internet Explorer beta: excellent support
Latest Firefox trunk nightly build: excellent support
Latest WebKit or Chromium trunk build: excellent support
Latest Opera beta or preview build: excellent support
JavaScript libraries, plugins, etc: not applicable

The input to the HTML parsing process consists of a stream of Unicode code points, which is passed through atokenization stage followed by a tree construction stage. The output is a Document object.
Implementations that do not support scripting do not have to actually create a DOM Document object, but the DOM tree in such cases is still used as the model for the rest of the specification.
In the common case, the data handled by the tokenization stage comes from the network, but it can also come from scriptrunning in the user agent, e.g. using the document.write() API.
There is only one set of states for the tokenizer stage and the tree construction stage, but the tree construction stage is reentrant, meaning that while the tree construction stage is handling one token, the tokenizer might be resumed, causing further tokens to be emitted and processed before the first token's processing is complete.
In the following example, the tree construction stage will be called upon to handle a "p" start tag token while handling the "script" end tag token:
...
<script>
 document.write('<p>');
</script>
...
To handle these cases, parsers have a script nesting level, which must be initially set to zero, and a parser pause flag, which must be initially set to false.

12.2.2 The input byte stream

The stream of Unicode code points that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a particular character encoding, which the user agent uses to decode the bytes into characters.
For XML documents, the algorithm user agents must use to determine the character encoding is given by the XML specification. This section does not apply to XML documents. [XML]
Usually, the encoding sniffing algorithm defined below is used to determine the character encoding.
Given a character encoding, the bytes in the input byte stream must be converted to Unicode code points for the tokenizer's input stream, as described by the rules for that encoding's decoder.
Bytes or sequences of bytes in the original byte stream that did not conform to the encoding specification (e.g. invalid UTF-8 byte sequences in a UTF-8 input byte stream) are errors that conformance checkers are expected to report.
Leading Byte Order Marks (BOMs) are not stripped by the decoder algorithms, they are stripped by the algorithm below.
The decoder algorithms describe how to handle invalid input; for security reasons, it is imperative that those rules be followed precisely. Differences in how invalid byte sequences are handled can result in, amongst other problems, script injection vulnerabilities ("XSS").
When the HTML parser is decoding an input byte stream, it uses a character encoding and a confidence. The confidence is either tentativecertain, or irrelevant. The encoding used, and whether the confidence in that encoding is tentative or certain, is used during the parsing to determine whether to change the encoding. If no encoding is necessary, e.g. because the parser is operating on a Unicode stream and doesn't have to use a character encoding at all, then the confidence is irrelevant.
Some algorithms feed the parser by directly adding characters to the input stream rather than adding bytes to the input byte stream.
12.2.2.1 Parsing with a known character encoding
When the HTML parser is to operate on an input byte stream that has a known definite encoding, then the character encoding is that encoding and the confidence iscertain.
12.2.2.2 Determining the character encoding
Ready for first implementations
Latest Internet Explorer beta: unknown
Latest Firefox trunk nightly build: buggy support
Latest WebKit or Chromium trunk build: buggy support
Latest Opera beta or preview build: buggy support
JavaScript libraries, plugins, etc: unknown
In some cases, it might be impractical to unambiguously determine the encoding before parsing the document. Because of this, this specification provides for a two-pass mechanism with an optional pre-scan. Implementations are allowed, as described below, to apply a simplified parsing algorithm to whatever bytes they have available before beginning to parse the document. Then, the real parser is started, using a tentative encoding derived from this pre-parse and other out-of-band metadata. If, while the document is being loaded, the user agent discovers a character encoding declaration that conflicts with this information, then the parser can get reinvoked to perform a parse of the document with the real encoding.
User agents must use the following algorithm, called the encoding sniffing algorithm, to determine the character encoding to use when decoding a document in the first pass. This algorithm takes as input any out-of-band metadata available to the user agent (e.g. the Content-Type metadata of the document) and all the bytes available so far, and returns a character encoding and a confidence that is either tentative or certain.
Section
Bugs: 21312
  1. If the user has explicitly instructed the user agent to override the document's character encoding with a specific encoding, optionally return that encoding with theconfidence certain and abort these steps.
    Typically, user agents remember such user requests across sessions, and in some cases apply them to documents in iframes as well.
  2. The user agent may wait for more bytes of the resource to be available, either in this step or at any later step in this algorithm. For instance, a user agent might wait 500ms or 1024 bytes, whichever came first. In general preparsing the source to find the encoding improves performance, as it reduces the need to throw away the data structures used when parsing upon finding the encoding information. However, if the user agent delays too long to obtain data to determine the encoding, then the cost of the delay could outweigh any performance improvements from the preparse.
    The authoring conformance requirements for character encoding declarations limit them to only appearing in the first 1024 bytes. User agents are therefore encouraged to use the prescan algorithm below (as invoked by these steps) on the first 1024 bytes, but not to stall beyond that.
  3. For each of the rows in the following table, starting with the first one and going down, if there are as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then return the encoding given in the cell in the second column of that row, with theconfidence certain, and abort these steps:
    Bytes in HexadecimalEncoding
    FE FFBig-endian UTF-16
    FF FELittle-endian UTF-16
    EF BB BFUTF-8
    This step looks for Unicode Byte Order Marks (BOMs).
    That this step happens before the next one honoring the HTTP Content-Type header is a willful violation of the HTTP specification, motivated by a desire to be maximally compatible with legacy content. [HTTP]
  4. If the transport layer specifies a character encoding, and it is supported, return that encoding with the confidence certain, and abort these steps.
  5. Optionally prescan the byte stream to determine its encoding. The end condition is that the user agent decides that scanning further bytes would not be efficient. User agents are encouraged to only prescan the first 1024 bytes. User agents may decide that scanning any bytes is not efficient, in which case these substeps are entirely skipped.
    The aforementioned algorithm either aborts unsuccessfully or returns a character encoding. If it returns a character encoding, then this algorithm must be aborted, returning the same encoding, with confidence tentative.
  6. If the HTML parser for which this algorithm is being run is associated with a Document that is itself in a nested browsing context, run these substeps:
    1. Let new document be the Document with which the HTML parser is associated.
    2. Let parent document be the Document through which new document is nested (the active document of the parent browsing context of new document).
    3. If parent document's origin is not the same origin as new document's origin, then abort these substeps.
    4. If parent document's character encoding is not an ASCII-compatible character encoding, then abort these substeps.
    5. Return parent document's character encoding, with the confidence tentative, and abort the encoding sniffing algorithm's steps.
  7. Otherwise, if the user agent has information on the likely encoding for this page, e.g. based on the encoding of the page when it was last visited, then return that encoding, with the confidence tentative, and abort these steps.
  8. The user agent may attempt to autodetect the character encoding from applying frequency analysis or other algorithms to the data stream. Such algorithms may use information about the resource other than the resource's contents, including the address of the resource. If autodetection succeeds in determining a character encoding, and that encoding is a supported encoding, then return that encoding, with the confidence tentative, and abort these steps. [UNIVCHARDET]
    The UTF-8 encoding has a highly detectable bit pattern. Documents that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very likely to be UTF-8, while documents with byte sequences that do not match it are very likely not. User-agents are therefore encouraged to search for this common encoding. [PPUTF8] [UTF8DET]
  9. Otherwise, return an implementation-defined or user-specified default character encoding, with the confidence tentative.
    In controlled environments or in environments where the encoding of documents can be prescribed (for example, for user agents intended for dedicated use in new networks), the comprehensive UTF-8 encoding is suggested.
    In other environments, the default encoding is typically dependent on the user's locale (an approximation of the languages, and thus often encodings, of the pages that the user is likely to frequent). The following table gives suggested defaults based on the user's locale, for compatibility with legacy content. Locales are identified by BCP 47 language tags. [BCP47] [ENCODING]
    Locale languageSuggested default encoding
    arArabicwindows-1256
    baBashkirwindows-1251
    beBelarusianwindows-1251
    bgBulgarianwindows-1251
    csCzechwindows-1250
    elGreekISO-8859-7
    etEstonianwindows-1257
    faPersianwindows-1256
    heHebrewwindows-1255
    hrCroatianwindows-1250
    huHungarianISO-8859-2
    jaJapaneseShift_JIS
    kkKazakhwindows-1251
    koKoreaneuc-kr
    kuKurdishwindows-1254
    kyKyrgyzwindows-1251
    ltLithuanianwindows-1257
    lvLatvianwindows-1257
    mkMacedonianwindows-1251
    plPolishISO-8859-2
    ruRussianwindows-1251
    sahYakutwindows-1251
    skSlovakwindows-1250
    slSlovenianISO-8859-2
    srSerbianwindows-1251
    tgTajikwindows-1251
    thThaiwindows-874
    trTurkishwindows-1254
    ttTatarwindows-1251
    ukUkrainianwindows-1251
    viVietnamesewindows-1258
    zh-CNChinese (People's Republic of China)GB18030
    zh-TWChinese (Taiwan)Big5
    All other localeswindows-1252
    The contents of this table are derived from the intersection of Windows, Chrome, and Firefox defaults.
The document's character encoding must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to use for the input byte stream.

When an algorithm requires a user agent to prescan a byte stream to determine its encoding, given some defined end condition, then it must run the following steps. These steps either abort unsuccessfully or return a character encoding. If at any point during these steps (including during instances of the get an attribute algorithm invoked by this one) the user agent either runs out of bytes (meaning the position pointer created in the first step below goes beyond the end of the byte stream obtained so far) or reaches its end condition, then abort the prescan a byte stream to determine its encoding algorithm unsuccessfully.
  1. Let position be a pointer to a byte in the input byte stream, initially pointing at the first byte.
  2. Loop: If position points to:
    A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')
    Advance the position pointer so that it points at the first 0x3E byte which is preceded by two 0x2D bytes (i.e. at the end of an ASCII '-->' sequence) and comes after the 0x3C byte that was found. (The two 0x2D bytes can be the same as the those in the '<!--' sequence.)
    A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)
    1. Advance the position pointer so that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or 0x2F byte (the one in sequence of characters matched above).
    2. Let attribute list be an empty list of strings.
    3. Let got pragma be false.
    4. Let need pragma be null.
    5. Let charset be the null value (which, for the purposes of this algorithm, is distinct from an unrecognised encoding or the empty string).
    6. AttributesGet an attribute and its value. If no attribute was sniffed, then jump to the processing step below.
    7. If the attribute's name is already in attribute list, then return to the step labeled attributes.
    8. Add the attribute's name to attribute list.
    9. Run the appropriate step from the following list, if one applies:
      If the attribute's name is "http-equiv"
      If the attribute's value is "content-type", then set got pragma to true.
      If the attribute's name is "content"
      Apply the algorithm for extracting a character encoding from a meta element, giving the attribute's value as the string to parse. If a character encoding is returned, and if charset is still set to null, let charset be the encoding returned, and set need pragma to true.
      If the attribute's name is "charset"
      Let charset be the result of getting an encoding from the attribute's value, and set need pragma to false.
    10. Return to the step labeled attributes.
    11. Processing: If need pragma is null, then jump to the step below labeled next byte.
    12. If need pragma is true but got pragma is false, then jump to the step below labeled next byte.
    13. If charset is a UTF-16 encoding, change the value of charset to UTF-8.
    14. If charset is not a supported character encoding, then jump to the step below labeled next byte.
    15. Abort the prescan a byte stream to determine its encoding algorithm, returning the encoding given by charset.
    A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)
    1. Advance the position pointer so that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII >) byte.
    2. Repeatedly get an attribute until no further attributes can be found, then jump to the step below labeled next byte.
    A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')
    A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')
    A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')
    Advance the position pointer so that it points at the first 0x3E byte (ASCII >) that comes after the 0x3C byte that was found.
    Any other byte
    Do nothing with that byte.
  3. Next byte: Move position so it points at the next byte in the input byte stream, and return to the step above labeled loop.
When the prescan a byte stream to determine its encoding algorithm says to get an attribute, it means doing this:
  1. If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x2F (ASCII /) then advance position to the next byte and redo this step.
  2. If the byte at position is 0x3E (ASCII >), then abort the get an attribute algorithm. There isn't one.
  3. Otherwise, the byte at position is the start of the attribute name. Let attribute name and attribute value be the empty string.
  4. Process the byte at position as follows:
    If it is 0x3D (ASCII =), and the attribute name is longer than the empty string
    Advance position to the next byte and jump to the step below labeled value.
    If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space)
    Jump to the step below labeled spaces.
    If it is 0x2F (ASCII /) or 0x3E (ASCII >)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.
    If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)
    Append the Unicode character with code point b+0x20 to attribute name (where b is the value of the byte at position). (This converts the input to lowercase.)
    Anything else
    Append the Unicode character with the same code point as the value of the byte at position to attribute name. (It doesn't actually matter how bytes outside the ASCII range are handled here, since only ASCII characters can contribute to the detection of a character encoding.)
  5. Advance position to the next byte and return to the previous step.
  6. Spaces: If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
  7. If the byte at position is not 0x3D (ASCII =), abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.
  8. Advance position past the 0x3D (ASCII =) byte.
  9. Value: If the byte at position is one of 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then advance position to the next byte, then, repeat this step.
  10. Process the byte at position as follows:
    If it is 0x22 (ASCII ") or 0x27 (ASCII ')
    1. Let b be the value of the byte at position.
    2. Quote loop: Advance position to the next byte.
    3. If the value of the byte at position is the value of b, then advance position to the next byte and abort the "get an attribute" algorithm. The attribute's name is the value of attribute name, and its value is the value of attribute value.
    4. Otherwise, if the value of the byte at position is in the range 0x41 (ASCII A) to 0x5A (ASCII Z), then append a Unicode character to attribute valuewhose code point is 0x20 more than the value of the byte at position.
    5. Otherwise, append a Unicode character to attribute value whose code point is the same as the value of the byte at position.
    6. Return to the step above labeled quote loop.
    If it is 0x3E (ASCII >)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name, its value is the empty string.
    If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)
    Append the Unicode character with code point b+0x20 to attribute value (where b is the value of the byte at position). Advance position to the next byte.
    Anything else
    Append the Unicode character with the same code point as the value of the byte at position to attribute value. Advance position to the next byte.
  11. Process the byte at position as follows:
    If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII >)
    Abort the get an attribute algorithm. The attribute's name is the value of attribute name and its value is the value of attribute value.
    If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)
    Append the Unicode character with code point b+0x20 to attribute value (where b is the value of the byte at position).
    Anything else
    Append the Unicode character with the same code point as the value of the byte at position to attribute value.
  12. Advance position to the next byte and return to the previous step.
For the sake of interoperability, user agents should not use a pre-scan algorithm that returns different results than the one described above. (But, if you do, please at least let us know, so that we can improve this algorithm and benefit everyone...)
12.2.2.3 Character encodings
Ready for first implementations
User agents must support the encodings defined in the WHATWG Encoding standard. User agents should not support other encodings.
User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU encodings. [CESU8] [UTF7] [BOCU1] [SCSU]
Support for encodings based on EBCDIC is especially discouraged. This encoding is rarely used for publicly-facing Web content. Support for UTF-32 is also especially discouraged. This encoding is rarely used, and frequently implemented incorrectly.
This specification does not make any attempt to support EBCDIC-based encodings and UTF-32 in its algorithms; support and use of these encodings can thus lead to unexpected behavior in implementations of this specification.
12.2.2.4 Changing the encoding while parsing
Ready for first implementations
When the parser requires the user agent to change the encoding, it must run the following steps. This might happen if the encoding sniffing algorithm described above failed to find a character encoding, or if it found a character encoding that was not the actual encoding of the file.
  1. If the encoding that is already being used to interpret the input stream is a UTF-16 encoding, then set the confidence to certain and abort these steps. The new encoding is ignored; if it was anything but the same encoding, then it would be clearly incorrect.
  2. If the new encoding is a UTF-16 encoding, change it to UTF-8.
  3. If the new encoding is identical or equivalent to the encoding that is already being used to interpret the input stream, then set the confidence to certain and abort these steps. This happens when the encoding information found in the file matches what the encoding sniffing algorithm determined to be the encoding, and in the second pass through the parser if the first pass found that the encoding sniffing algorithm described in the earlier section failed to find the right encoding.
  4. If all the bytes up to the last byte converted by the current decoder have the same Unicode interpretations in both the current encoding and the new encoding, and if the user agent supports changing the converter on the fly, then the user agent may change to the new converter for the encoding on the fly. Set the document's character encoding and the encoding used to convert the input stream to the new encoding, set the confidence to certain, and abort these steps.
  5. Otherwise, navigate to the document again, with replacement enabled, and using the same source browsing context, but this time skip the encoding sniffing algorithmand instead just set the encoding to the new encoding and the confidence to certain. Whenever possible, this should be done without actually contacting the network layer (the bytes should be re-parsed from memory), even if, e.g., the document is marked as not being cacheable. If this is not possible and contacting the network layer would involve repeating a request that uses a method other than HTTP GET (or equivalent for non-HTTP URLs), then instead set the confidence to certain and ignore the new encoding. The resource will be misinterpreted. User agents may notify the user of the situation, to aid in application development.
12.2.2.5 Preprocessing the input stream
Ready for first implementations
The input stream consists of the characters pushed into it as the input byte stream is decoded or from the various APIs that directly manipulate the input stream.
One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present in the input stream.
The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine the byte order is a willful violation of Unicode, motivated by a desire to increase the resilience of user agents in the face of naïve transcoders.
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).
Any character that is a not a Unicode character, i.e. any isolated surrogate, is a parse error. (These can only find their way into the input stream via script APIs such asdocument.write().)
U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) characters are treated specially. All CR characters must be converted to LF characters, and any LF characters that immediately follow a CR character must be ignored. Thus, newlines in HTML DOMs are represented by LF characters, and there are never any CR characters in the input to the tokenization stage.
The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored by the requirements in this section. Initially, the next input character is the first character in the input. The current input character is the last character to have been consumed.
The insertion point is the position (just before a character or just before the end of the input stream) where content inserted using document.write() is actually inserted. The insertion point is relative to the position of the character immediately after it, it is not an absolute offset into the input stream. Initially, the insertion point is undefined.
The "EOF" character in the tables below is a conceptual character representing the end of the input stream. If the parser is a script-created parser, then the end of the input stream is reached when an explicit "EOF" character (inserted by the document.close() method) is consumed. Otherwise, the "EOF" character is not a real character in the stream, but rather the lack of any further characters.
The handling of U+0000 NULL characters varies based on where the characters are found. In general, they are ignored except where doing so could plausibly introduce an attack vector. This handling is, by necessity, spread across both the tokenization stage and the tree construction stage.

12.2.3 Parse state

Ready for first implementations
12.2.3.1 The insertion mode
Ready for first implementations
The insertion mode is a state variable that controls the primary operation of the tree construction stage.
Initially, the insertion mode is "initial". It can change to "before html", "before head", "in head", "in head noscript", "after head", "in body", "text", "in table", "in table text", "in caption", "in column group", "in table body", "in row", "in cell", "in select", "in select in table", "in template", "after body", "in frameset", "after frameset", "after after body", and "after after frameset" during the course of the parsing, as described in the tree construction stage. The insertion mode affects how tokens are processed and whether CDATA sections are supported.
Several of these modes, namely "in head", "in body", "in table", and "in select", are special, in that the other modes defer to them at various times. When the algorithm below says that the user agent is to do something "using the rules for the m insertion mode", where m is one of these modes, the user agent must use the rules described under the m insertion mode's section, but must leave the insertion mode unchanged unless the rules in m themselves switch the insertion mode to a new value.
When the insertion mode is switched to "text" or "in table text", the original insertion mode is also set. This is the insertion mode to which the tree construction stage will return.
Similarly, to parse nested template elements, a stack of template insertion modes is used. It is initially empty. The current template insertion mode is the insertion mode that was most recently added to the stack of template insertion modes. The algorithms in the sections below will push insertion modes onto this stack, meaning that the specified insertion mode is to be added to the stack, and pop insertion modes from the stack, which means that the most recently added insertion mode must be removed from the stack.

When the steps below require the UA to reset the insertion mode appropriately, it means the UA must follow these steps:
  1. Let last be false.
  2. Let node be the last node in the stack of open elements.
  3. Loop: If node is the first node in the stack of open elements, then set last to true, and, if the parser was originally created as part of the HTML fragment parsing algorithm(fragment case) set node to the context element.
  4. If node is a select element, run these substeps:
    1. If last is true, jump to the step below labeled done.
    2. Let ancestor be node.
    3. Loop: If ancestor is the first node in the stack of open elements, jump to the step below labeled done.
    4. Let ancestor be the node before ancestor in the stack of open elements.
    5. If ancestor is a template node, jump to the step below labeled done.
    6. If ancestor is a table node, switch the insertion mode to "in select in table" and abort these steps.
    7. Jump back to the step labeled loop.
    8. Done: Switch the insertion mode to "in select" and abort these steps.
  5. If node is a td or th element and last is false, then switch the insertion mode to "in cell" and abort these steps.
  6. If node is a tr element, then switch the insertion mode to "in row" and abort these steps.
  7. If node is a tbodythead, or tfoot element, then switch the insertion mode to "in table body" and abort these steps.
  8. If node is a caption element, then switch the insertion mode to "in caption" and abort these steps.
  9. If node is a colgroup element, then switch the insertion mode to "in column group" and abort these steps.
  10. If node is a table element, then switch the insertion mode to "in table" and abort these steps.
  11. If node is a template element, then switch the insertion mode to the current template insertion mode and abort these steps.
  12. If node is a head element and last is false, then switch the insertion mode to "in head" and abort these steps.
  13. If node is a body element, then switch the insertion mode to "in body" and abort these steps.
  14. If node is a frameset element, then switch the insertion mode to "in frameset" and abort these steps. (fragment case)
  15. If node is an html element, run these substeps:
    1. If the head element pointer is null, switch the insertion mode to "before head" and abort these steps. (fragment case)
    2. Otherwise, the head element pointer is not null, switch the insertion mode to "after head" and abort these steps.
  16. If last is true, then switch the insertion mode to "in body" and abort these steps. (fragment case)
  17. Let node now be the node before node in the stack of open elements.
  18. Return to the step labeled loop.
12.2.3.2 The stack of open elements
Ready for first implementations
Initially, the stack of open elements is empty. The stack grows downwards; the topmost node on the stack is the first one added to the stack, and the bottommost node of the stack is the most recently added node in the stack (notwithstanding when the stack is manipulated in a random access fashion as part of the handling for misnested tags).
The "before htmlinsertion mode creates the html root element node, which is then added to the stack.
In the fragment case, the stack of open elements is initialized to contain an html element that is created as part of that algorithm. (The fragment case skips the "before htmlinsertion mode.)
The html node, however it is created, is the topmost node of the stack. It only gets popped off the stack when the parser finishes.
The current node is the bottommost node in this stack of open elements.
The adjusted current node is the context element if the stack of open elements has only one element in it and the parser was created by the HTML fragment parsing algorithm; otherwise, the adjusted current node is the current node.
Elements in the stack of open elements fall into the following categories:
Special
The following elements have varying levels of special parsing rules: HTML's addressappletareaarticleasidebasebasefontbgsound,blockquotebodybrbuttoncaptioncentercolcolgroupdddetailsdirdivdldtembedfieldsetfigcaptionfigurefooter,formframeframeseth1h2h3h4h5h6headheaderhgrouphrhtmliframeimginputisindexlilinklistingmainmarquee,menumenuitemmetanavnoembednoframesnoscriptobjectolpparamplaintextprescriptsectionselectsourcestyle,summarytabletbodytdtemplatetextareatfootththeadtitletrtrackulwbr, and xmp; MathML's mimomnmsmtext, andannotation-xml; and SVG's foreignObjectdesc, and title.
Formatting
The following HTML elements are those that end up in the list of active formatting elementsabbigcodeemfontinobrssmallstrikestrongtt, and u.
Ordinary
All other elements found while parsing an HTML document.
The stack of open elements is said to have an element target node in a specific scope consisting of a list of element types list when the following algorithm terminates in a match state:
  1. Initialize node to be the current node (the bottommost node of the stack).
  2. If node is the target node, terminate in a match state.
  3. Otherwise, if node is one of the element types in list, terminate in a failure state.
  4. Otherwise, set node to the previous entry in the stack of open elements and return to step 2. (This will never fail, since the loop will always terminate in the previous step if the top of the stack — an html element — is reached.)
The stack of open elements is said to have a particular element in scope when it has that element in the specific scope consisting of the following element types:
The stack of open elements is said to have a particular element in list item scope when it has that element in the specific scope consisting of the following element types:
The stack of open elements is said to have a particular element in button scope when it has that element in the specific scope consisting of the following element types:
The stack of open elements is said to have a particular element in table scope when it has that element in the specific scope consisting of the following element types:
The stack of open elements is said to have a particular element in select scope when it has that element in the specific scope consisting of all element types except the following:
Nothing happens if at any time any of the elements in the stack of open elements are moved to a new location in, or removed from, the Document tree. In particular, the stack is not changed in this situation. This can cause, amongst other strange effects, content to be appended to nodes that are no longer in the DOM.
In some cases (namely, when closing misnested formatting elements), the stack is manipulated in a random-access fashion.
12.2.3.3 The list of active formatting elements
Ready for first implementations
Initially, the list of active formatting elements is empty. It is used to handle mis-nested formatting element tags.
The list contains elements in the formatting category, and scope markers. The scope markers are inserted when entering applet elements, buttons, object elements, marquees, table cells, and table captions, and are used to prevent formatting from "leaking" into applet elements, buttons, object elements, marquees, and tables.
The scope markers are unrelated to the concept of an element being in scope.
In addition, each element in the list of active formatting elements is associated with the token for which it was created, so that further elements can be created for that token if necessary.
When the steps below require the UA to push onto the list of active formatting elements an element element, the UA must perform the following steps:
  1. If there are already three elements in the list of active formatting elements after the last list marker, if any, or anywhere in the list if there are no list markers, that have the same tag name, namespace, and attributes as element, then remove the earliest such element from the list of active formatting elements. For these purposes, the attributes must be compared as they were when the elements were created by the parser; two elements have the same attributes if all their parsed attributes can be paired such that the two attributes in each pair have identical names, namespaces, and values (the order of the attributes does not matter).
    This is the Noah's Ark clause. But with three per family instead of two.
  2. Add element to the list of active formatting elements.
When the steps below require the UA to reconstruct the active formatting elements, the UA must perform the following steps:
  1. If there are no entries in the list of active formatting elements, then there is nothing to reconstruct; stop this algorithm.
  2. If the last (most recently added) entry in the list of active formatting elements is a marker, or if it is an element that is in the stack of open elements, then there is nothing to reconstruct; stop this algorithm.
  3. Let entry be the last (most recently added) element in the list of active formatting elements.
  4. Rewind: If there are no entries before entry in the list of active formatting elements, then jump to the step labeled create.
  5. Let entry be the entry one earlier than entry in the list of active formatting elements.
  6. If entry is neither a marker nor an element that is also in the stack of open elements, go to the step labeled rewind.
  7. Advance: Let entry be the element one later than entry in the list of active formatting elements.
  8. CreateInsert an HTML element for the token for which the element entry was created, to obtain new element.
  9. Replace the entry for entry in the list with an entry for new element.
  10. If the entry for new element in the list of active formatting elements is not the last entry in the list, return to the step labeled advance.
This has the effect of reopening all the formatting elements that were opened in the current body, cell, or caption (whichever is youngest) that haven't been explicitly closed.
The way this specification is written, the list of active formatting elements always consists of elements in chronological order with the least recently added element first and the most recently added element last (except for while steps 8 to 11 of the above algorithm are being executed, of course).
When the steps below require the UA to clear the list of active formatting elements up to the last marker, the UA must perform the following steps:
  1. Let entry be the last (most recently added) entry in the list of active formatting elements.
  2. Remove entry from the list of active formatting elements.
  3. If entry was a marker, then stop the algorithm at this point. The list has been cleared up to the last marker.
  4. Go to step 1.
12.2.3.4 The element pointers
Ready for first implementations
Initially, the head element pointer and the form element pointer are both null.
Once a head element has been parsed (whether implicitly or explicitly) the head element pointer gets set to point to this node.
The form element pointer points to the last form element that was opened and whose end tag has not yet been seen. It is used to make form controls associate with forms in the face of dramatically bad markup, for historical reasons.
12.2.3.5 Other parsing state flags
Ready for first implementations
The scripting flag is set to "enabled" if scripting was enabled for the Document with which the parser is associated when the parser was created, and "disabled" otherwise.
The scripting flag can be enabled even when the parser was originally created for the HTML fragment parsing algorithm, even though scriptelements don't execute in that case.
The frameset-ok flag is set to "ok" when the parser is created. It is set to "not ok" after certain tokens are seen.

Nenhum comentário:

Postar um comentário