Context Navigation

← Previous Changeset
Next Changeset →

Changeset 57348

Timestamp:

01/24/2024 11:35:46 PM (6 months ago)

Author:

dmsnell

Message:

HTML API: Scan all syntax tokens in a document, read modifiable text.

Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.

In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read #text nodes in the
document.

What's new in the Tag Processor?
================================

next_token() visits every distinct syntax token in a document.
get_token_type() indicates what kind of token it is.
get_token_name() returns something akin to DOMNode.nodeName.
get_modifiable_text() returns the text associated with a token.
get_comment_type() indicates why a token represents an HTML comment.

Example usage.
==============

<?php
function strip_all_tags( $html ) {
        $text_content = '';
        $processor    = new WP_HTML_Tag_Processor( $html );

        while ( $processor->next_token() ) {
                if ( '#text' !== $processor->get_token_type() ) {
                        continue;
                }

                $text_content .= $processor->get_modifiable_text();
        }

        return $text_content;
}

What changes in the Tag Processor?
==================================

Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.

These are special tags because their content contains no other HTML or
markup, only non-HTML content.

SCRIPT elements contain raw text which is isolated from the rest of the HTML document and fed separately into a JavaScript engine. There are complicated rules to avoid escaping the script context in the HTML. The contents are left verbatim, and character references are not decoded.

TEXTARA and TITLE elements contain plain text which is decoded before display, e.g. transforming & into &. Any markup which resembles tags is treated as verbatim text and not a tag.

IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the textarea and title elements, but no character references are decoded. For example, & inside a STYLE element is passed to the CSS engine as the literal string & and _not_ as &.

Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).

This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.

    <title>There is only a single token in this line</title>
    <title>There are two tokens in this line></title></title>
    </title><title>There are still two tokens in this line></title>

What are tokens?
================

The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:

a tag has a name, attributes, and a closing or self-closing flag.
a text node, or #text node contains plain text which is displayed in a browser and which is decoded before display.
a DOCTYPE declaration indicates how to parse the document.
a comment is hidden from the display on a page but present in the HTML.

There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.

What is a funky comment?
========================

This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.

</1>
</%avatar_url>
</{"wp_bit": {"type": "post-author"}}>
</[post-author]>
</__( 'Save Post' );>

All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the < and continues until the nearest >. There
can be no funky comment inside another, because that would imply having
a > inside of one, which would actually terminate the first one.

What is modifiable text?
========================

Modifiable text is similar to the innerText property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.

There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.

Tags
====

Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.

    <div class="post">Another day in HTML</div>
    └─ tag ──────────┘└─ text node ─────┘└────┴─ tag

    <title>Is <img> &gt; <image>?</title>
    │      └ modifiable text ───┘       │ "Is <img> > <image>?"
    └─ tag ─────────────────────────────┘

Text nodes
==========

Text nodes are entirely modifiable text.

    This HTML document has no tags.
    └─ modifiable text ───────────┘

Comments
========

The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.

    <!-- this is inside a comment -->
    │   └─ modifiable text ──────┘  │
    └─ comment token ───────────────┘

    <!-->
    This invalid comment has no modifiable text.

    <? this is an invalid comment -->
    │ └─ modifiable text ────────┘  │
    └─ comment token ───────────────┘

    <[CDATA[this is an invalid comment]]>
    │       └─ modifiable text ───────┘ │
    └─ comment token ───────────────────┘

Other token types also have modifiable text. Consult the code or tests
for further information.

Developed in https://github.com/WordPress/wordpress-develop/pull/5683
Discussed in https://core.trac.wordpress.org/ticket/60170

Follows [57575]

Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam
Fixes #60170

Location:

trunk

Files:

: 1 added
: 4 edited

src/wp-includes/html-api/class-wp-html-processor.php (modified) (3 diffs)
src/wp-includes/html-api/class-wp-html-tag-processor.php (modified) (52 diffs)
tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php (modified) (1 diff)
tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php (added)
tests/phpunit/tests/html-api/wpHtmlTagProcessor.php (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/src/wp-includes/html-api/class-wp-html-processor.php

-                      r57343
+                      r57348
     /**
-     * Static query for instructing the Tag Processor to visit every token.
+     *
-     * @access private
+     *
-     * @since 6.4.0
+     *
-     * @var array
-     */
-    const VISIT_EVERYTHING = array( 'tag_closers' => 'visit' );
-    /**
      * Holds the working state of the parser, including the stack of
      * open elements and the stack of active formatting elements.
 …
         return false;
+    }
 …
+            }
+            parent::next_tag( self::VISIT_EVERYTHING );
+            while ( parent::next_token() && '#tag' !== $this->get_token_type() ) {
+                continue;
+            }
+        }

trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php

-                      r57227
+                      r57348
  *     }
+ *
  * ## Design and limitations
+ *
 …
  * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE.
  * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token.
+ *              Introduces "special" elements which act like void elements, e.g. STYLE.
+ *              Introduces "special" elements which act like void elements, e.g. TITLE, STYLE.
+ *              Allows scanning through all tokens and processing modifiable text, where applicable.
  */
 class WP_HTML_Tag_Processor {
 …
      * Specifies mode of operation of the parser at any given time.
+     *
+     * | State         | Meaning                                                              |
+     * | --------------|----------------------------------------------------------------------|
+     * | *Ready*       | The parser is ready to run.                                          |
+     * | *Complete*    | There is nothing left to parse.                                      |
+     * | *Incomplete*  | The HTML ended in the middle of a token; nothing more can be parsed. |
+     * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes.           |
+     * | State           | Meaning                                                              |
+     * | ----------------|----------------------------------------------------------------------|
+     * | *Ready*         | The parser is ready to run.                                          |
+     * | *Complete*      | There is nothing left to parse.                                      |
+     * | *Incomplete*    | The HTML ended in the middle of a token; nothing more can be parsed. |
+     * | *Matched tag*   | Found an HTML tag; it's possible to modify its attributes.           |
+     * | *Text node*     | Found a #text node; this is plaintext and modifiable.                |
+     * | *CDATA node*    | Found a CDATA section; this is modifiable.                           |
+     * | *Comment*       | Found a comment or bogus comment; this is modifiable.                |
+     * | *Presumptuous*  | Found an empty tag closer: `</>`.                                    |
+     * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable.     |
+     *
      * @since 6.5.0
 …
      * @see WP_HTML_Tag_Processor::STATE_READY
      * @see WP_HTML_Tag_Processor::STATE_COMPLETE
      * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE
+     * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE
      * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
+     *
      * @var string
      */
+    private $parser_state = self::STATE_READY;
+    protected $parser_state = self::STATE_READY;
+    /**
+     * What kind of syntax token became an HTML comment.
+     *
+     * Since there are many ways in which HTML syntax can create an HTML comment,
+     * this indicates which of those caused it. This allows the Tag Processor to
+     * represent more from the original input document than would appear in the DOM.
+     *
+     * @since 6.5.0
+     *
+     * @var string|null
+     */
+    protected $comment_type = null;
     /**
 …
      */
     private $tag_name_length;
     /**
 …
      */
     public function next_token() {
         $this->get_updated_html();
-        $was_at = $this->bytes_already_parsed;
         // Don't proceed if there's nothing more to scan.
         if (
             self::STATE_COMPLETE === $this->parser_state ||
             self::STATE_INCOMPLETE === $this->parser_state
+            self::STATE_INCOMPLETE === $this->parser_state
         ) {
             return false;
 …
         // Find the next tag if it exists.
         if ( false === $this->parse_next_tag() ) {
             if ( self::STATE_INCOMPLETE === $this->parser_state ) {
+            if ( self::STATE_INCOMPLETE === $this->parser_state ) {
                 $this->bytes_already_parsed = $was_at;
+            }
             return false;
+        }
 …
         // Ensure that the tag closes before the end of the document.
         if (
             self::STATE_INCOMPLETE === $this->parser_state ||
+            self::STATE_INCOMPLETE === $this->parser_state ||
             $this->bytes_already_parsed >= strlen( $this->html )
         ) {
             // Does this appropriately clear state (parsed attributes)?
             $this->parser_state         = self::STATE_INCOMPLETE;
+            $this->parser_state         = self::STATE_INCOMPLETE;
             $this->bytes_already_parsed = $was_at;
 …
         $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
         if ( false === $tag_ends_at ) {
             $this->parser_state         = self::STATE_INCOMPLETE;
+            $this->parser_state         = self::STATE_INCOMPLETE;
             $this->bytes_already_parsed = $was_at;
 …
         $this->parser_state         = self::STATE_MATCHED_TAG;
         $this->token_length         = $tag_ends_at - $this->token_starts_at;
         $this->bytes_already_parsed = $tag_ends_at;
+        $this->bytes_already_parsed = $tag_ends_at;
         /*
 …
         $t = $this->html[ $this->tag_name_starts_at ];
         if (
             ! $this->is_closing_tag &&
+            (
+            (
                 'i' === $t || 'I' === $t ||
                 'n' === $t || 'N' === $t ||
 …
+            )
         ) {
+            $tag_name = $this->get_tag();
+            if ( 'SCRIPT' === $tag_name && ! $this->skip_script_data() ) {
+                $this->parser_state         = self::STATE_INCOMPLETE;
+                $this->bytes_already_parsed = $was_at;
+                return false;
+            } elseif (
+                ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) &&
+                ! $this->skip_rcdata( $tag_name )
+            ) {
+                $this->parser_state         = self::STATE_INCOMPLETE;
+                $this->bytes_already_parsed = $was_at;
+                return false;
+            } elseif (
+                (
+                    'IFRAME' === $tag_name ||
+                    'NOEMBED' === $tag_name ||
+                    'NOFRAMES' === $tag_name ||
+                    'STYLE' === $tag_name ||
+                    'XMP' === $tag_name
+                ) &&
+                ! $this->skip_rawtext( $tag_name )
+            ) {
+                $this->parser_state         = self::STATE_INCOMPLETE;
+                $this->bytes_already_parsed = $was_at;
+                return false;
+            }
+        }
+            return true;
+        }
+        $tag_name = $this->get_tag();
+        /*
+         * Preserve the opening tag pointers, as these will be overwritten
+         * when finding the closing tag. They will be reset after finding
+         * the closing to tag to point to the opening of the special atomic
+         * tag sequence.
+         */
+        $tag_name_starts_at   = $this->tag_name_starts_at;
+        $tag_name_length      = $this->tag_name_length;
+        $tag_ends_at          = $this->token_starts_at + $this->token_length;
+        $attributes           = $this->attributes;
+        $duplicate_attributes = $this->duplicate_attributes;
+        // Find the closing tag if necessary.
+        $found_closer = false;
+        switch ( $tag_name ) {
+            case 'SCRIPT':
+                $found_closer = $this->skip_script_data();
+                break;
+            case 'TEXTAREA':
+            case 'TITLE':
+                $found_closer = $this->skip_rcdata( $tag_name );
+                break;
+            /*
+             * In the browser this list would include the NOSCRIPT element,
+             * but the Tag Processor is an environment with the scripting
+             * flag disabled, meaning that it needs to descend into the
+             * NOSCRIPT element to be able to properly process what will be
+             * sent to a browser.
+             *
+             * Note that this rule makes HTML5 syntax incompatible with XML,
+             * because the parsing of this token depends on client application.
+             * The NOSCRIPT element cannot be represented in the XHTML syntax.
+             */
+            case 'IFRAME':
+            case 'NOEMBED':
+            case 'NOFRAMES':
+            case 'STYLE':
+            case 'XMP':
+                $found_closer = $this->skip_rawtext( $tag_name );
+                break;
+            // No other tags should be treated in their entirety here.
+            default:
+                return true;
+        }
+        if ( ! $found_closer ) {
+            $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
+            $this->bytes_already_parsed = $was_at;
+            return false;
+        }
+        /*
+         * The values here look like they reference the opening tag but they reference
+         * the closing tag instead. This is why the opening tag values were stored
+         * above in a variable. It reads confusingly here, but that's because the
+         * functions that skip the contents have moved all the internal cursors past
+         * the inner content of the tag.
+         */
+        $this->token_starts_at      = $was_at;
+        $this->token_length         = $this->bytes_already_parsed - $this->token_starts_at;
+        $this->text_starts_at       = $tag_ends_at + 1;
+        $this->text_length          = $this->tag_name_starts_at - $this->text_starts_at;
+        $this->tag_name_starts_at   = $tag_name_starts_at;
+        $this->tag_name_length      = $tag_name_length;
+        $this->attributes           = $attributes;
+        $this->duplicate_attributes = $duplicate_attributes;
         return true;
 …
      */
     public function paused_at_incomplete_token() {
         return self::STATE_INCOMPLETE === $this->parser_state;
+        return self::STATE_INCOMPLETE === $this->parser_state;
+    }
 …
     public function set_bookmark( $name ) {
         // It only makes sense to set a bookmark if the parser has paused on a concrete token.
+        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
+        if (
+            self::STATE_COMPLETE === $this->parser_state ||
+            self::STATE_INCOMPLETE_INPUT === $this->parser_state
+        ) {
             return false;
+        }
 …
         while ( false !== $at && $at < $doc_length ) {
+            $at = strpos( $this->html, '</', $at );
+            $at                       = strpos( $this->html, '</', $at );
+            $this->tag_name_starts_at = $at;
             // Fail if there is no possible tag closer.
 …
+            }
+            $closer_potentially_starts_at = $at;
+            $at                          += 2;
+            $at += 2;
             /*
 …
                 continue;
+            }
             $at = $this->bytes_already_parsed;
             if ( $at >= strlen( $this->html ) ) {
 …
+            }
+            if ( '>' === $html[ $at ] || '/' === $html[ $at ] ) {
+                $this->bytes_already_parsed = $closer_potentially_starts_at;
+            if ( '>' === $html[ $at ] ) {
+                $this->bytes_already_parsed = $at + 1;
+                return true;
+            }
+            if ( $at + 1 >= strlen( $this->html ) ) {
+                return false;
+            }
+            if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) {
+                $this->bytes_already_parsed = $at + 2;
                 return true;
+            }
 …
             if ( $is_closing ) {
                 $this->bytes_already_parsed = $closer_potentially_starts_at;
                 if ( $this->bytes_already_parsed >= $doc_length ) {
                     return false;
 …
                 if ( $this->bytes_already_parsed >= $doc_length ) {
                     $this->parser_state = self::STATE_INCOMPLETE;
+                    $this->parser_state = self::STATE_INCOMPLETE;
                     return false;
 …
                 if ( '>' === $html[ $this->bytes_already_parsed ] ) {
                     $this->bytes_already_parsed = $closer_potentially_starts_at;
+                    ;
                     return true;
+                }
 …
         $html       = $this->html;
         $doc_length = strlen( $html );
+        $at         = $this->bytes_already_parsed;
+        $was_at     = $this->bytes_already_parsed;
+        $at         = $was_at;
         while ( false !== $at && $at < $doc_length ) {
             $at = strpos( $html, '<', $at );
             /*
 …
              */
             if ( false === $at ) {
+                return false;
+                $this->parser_state         = self::STATE_TEXT_NODE;
+                $this->token_starts_at      = $was_at;
+                $this->token_length         = strlen( $html ) - $was_at;
+                $this->text_starts_at       = $was_at;
+                $this->text_length          = $this->token_length;
+                $this->bytes_already_parsed = strlen( $html );
+                return true;
+            }
 …
             if ( $tag_name_prefix_length > 0 ) {
                 ++$at;
                 $this->tag_name_length      = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length );
-                $this->tag_name_starts_at   = $at;
                 $this->bytes_already_parsed = $at + $this->tag_name_length;
                 return true;
 …
              */
             if ( $at + 1 >= $doc_length ) {
                 $this->parser_state = self::STATE_INCOMPLETE;
+                $this->parser_state = self::STATE_INCOMPLETE;
                 return false;
 …
             /*
              * <! transitions to markup declaration open state
+             *  transitions to markup declaration open state
              * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state
              */
             if ( '!' === $html[ $at + 1 ] ) {
                 /*
                  * <!-- transitions to a bogus comment state – skip to the nearest -->
+                 *
                  * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
                  */
 …
                     // If it's not possible to close the comment then there is nothing more to scan.
                     if ( $doc_length <= $closer_at ) {
                         $this->parser_state = self::STATE_INCOMPLETE;
+                        $this->parser_state = self::STATE_INCOMPLETE;
                         return false;
 …
                     $span_of_dashes = strspn( $html, '-', $closer_at );
                     if ( '>' === $html[ $closer_at + $span_of_dashes ] ) {
+                        $at = $closer_at + $span_of_dashes + 1;
+                        continue;
+                        /*
+                         * @todo When implementing `set_modifiable_text()` ensure that updates to this token
+                         *       don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment
+                         *       and bogus comment syntax, these leave no clear insertion point for text and
+                         *       they need to be modified specially in order to contain text. E.g. to store
+                         *       `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which
+                         *       involves inserting an additional `-` into the token after the modifiable text.
+                         */
+                        $this->parser_state = self::STATE_COMMENT;
+                        $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT;
+                        $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at;
+                        // Only provide modifiable text if the token is long enough to contain it.
+                        if ( $span_of_dashes >= 2 ) {
+                            $this->comment_type   = self::COMMENT_AS_HTML_COMMENT;
+                            $this->text_starts_at = $this->token_starts_at + 4;
+                            $this->text_length    = $span_of_dashes - 2;
+                        }
+                        $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1;
+                        return true;
+                    }
 …
                         $closer_at = strpos( $html, '--', $closer_at );
                         if ( false === $closer_at ) {
                             $this->parser_state = self::STATE_INCOMPLETE;
+                            $this->parser_state = self::STATE_INCOMPLETE;
                             return false;
 …
                         if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) {
+                            $at = $closer_at + 3;
+                            continue 2;
+                            $this->parser_state         = self::STATE_COMMENT;
+                            $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
+                            $this->token_length         = $closer_at + 3 - $this->token_starts_at;
+                            $this->text_starts_at       = $this->token_starts_at + 4;
+                            $this->text_length          = $closer_at - $this->text_starts_at;
+                            $this->bytes_already_parsed = $closer_at + 3;
+                            return true;
+                        }
+                        if ( $closer_at + 3 < $doc_length && '!' === $html[ $closer_at + 2 ] && '>' === $html[ $closer_at + 3 ] ) {
+                            $at = $closer_at + 4;
+                            continue 2;
+                        if (
+                            $closer_at + 3 < $doc_length &&
+                            '!' === $html[ $closer_at + 2 ] &&
+                            '>' === $html[ $closer_at + 3 ]
+                        ) {
+                            $this->parser_state         = self::STATE_COMMENT;
+                            $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
+                            $this->token_length         = $closer_at + 4 - $this->token_starts_at;
+                            $this->text_starts_at       = $this->token_starts_at + 4;
+                            $this->text_length          = $closer_at - $this->text_starts_at;
+                            $this->bytes_already_parsed = $closer_at + 4;
+                            return true;
+                        }
+                    }
 …
                 /*
+                 * <![CDATA[ transitions to CDATA section state – skip to the nearest ]]>
+                 * The CDATA is case-sensitive.
+                 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
+                 */
+                if (
+                    $doc_length > $at + 8 &&
+                    '[' === $html[ $at + 2 ] &&
+                    'C' === $html[ $at + 3 ] &&
+                    'D' === $html[ $at + 4 ] &&
+                    'A' === $html[ $at + 5 ] &&
+                    'T' === $html[ $at + 6 ] &&
+                    'A' === $html[ $at + 7 ] &&
+                    '[' === $html[ $at + 8 ]
+                ) {
+                    $closer_at = strpos( $html, ']]>', $at + 9 );
+                    if ( false === $closer_at ) {
+                        $this->parser_state = self::STATE_INCOMPLETE;
+                        return false;
+                    }
+                    $at = $closer_at + 3;
+                    continue;
+                }
+                /*
+                 * <!DOCTYPE transitions to DOCTYPE state – skip to the nearest >
+                 * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest >
                  * These are ASCII-case-insensitive.
                  * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
 …
                     $closer_at = strpos( $html, '>', $at + 9 );
                     if ( false === $closer_at ) {
                         $this->parser_state = self::STATE_INCOMPLETE;
+                        $this->parser_state = self::STATE_INCOMPLETE;
                         return false;
+                    }
+                    $at = $closer_at + 1;
+                    continue;
+                    $this->parser_state         = self::STATE_DOCTYPE;
+                    $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                    $this->text_starts_at       = $this->token_starts_at + 9;
+                    $this->text_length          = $closer_at - $this->text_starts_at;
+                    $this->bytes_already_parsed = $closer_at + 1;
+                    return true;
+                }
 …
                  * found then the HTML was truncated inside the markup declaration.
                  */
                 $at = strpos( $html, '>', $at + 1 );
                 if ( false === $at ) {
                     $this->parser_state = self::STATE_INCOMPLETE;
+                $at = strpos( $html, '>', $at + 1 );
+                if ( false === $at ) {
+                    $this->parser_state = self::STATE_INCOMPLETE;
                     return false;
+                }
+                continue;
+                $this->parser_state         = self::STATE_COMMENT;
+                $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
+                $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                $this->text_starts_at       = $this->token_starts_at + 2;
+                $this->text_length          = $closer_at - $this->text_starts_at;
+                $this->bytes_already_parsed = $closer_at + 1;
+                /*
+                 * Identify nodes that would be CDATA if HTML had CDATA sections.
+                 *
+                 * This section must occur after identifying the bogus comment end
+                 * because in an HTML parser it will span to the nearest `>`, even
+                 * if there's no `]]>` as would be required in an XML document. It
+                 * is therefore not possible to parse a CDATA section containing
+                 * a `>` in the HTML syntax.
+                 *
+                 * Inside foreign elements there is a discrepancy between browsers
+                 * and the specification on this.
+                 *
+                 * @todo Track whether the Tag Processor is inside a foreign element
+                 *       and require the proper closing `]]>` in those cases.
+                 */
+                if (
+                    $this->token_length >= 10 &&
+                    '[' === $html[ $this->token_starts_at + 2 ] &&
+                    'C' === $html[ $this->token_starts_at + 3 ] &&
+                    'D' === $html[ $this->token_starts_at + 4 ] &&
+                    'A' === $html[ $this->token_starts_at + 5 ] &&
+                    'T' === $html[ $this->token_starts_at + 6 ] &&
+                    'A' === $html[ $this->token_starts_at + 7 ] &&
+                    '[' === $html[ $this->token_starts_at + 8 ] &&
+                    ']' === $html[ $closer_at - 1 ]
+                ) {
+                    $this->parser_state    = self::STATE_COMMENT;
+                    $this->comment_type    = self::COMMENT_AS_CDATA_LOOKALIKE;
+                    $this->text_starts_at += 7;
+                    $this->text_length    -= 9;
+                }
+                return true;
+            }
 …
              */
             if ( '>' === $html[ $at + 1 ] ) {
+                ++$at;
+                continue;
+                $this->parser_state         = self::STATE_PRESUMPTUOUS_TAG;
+                $this->token_length         = $at + 2 - $this->token_starts_at;
+                $this->bytes_already_parsed = $at + 2;
+                return true;
+            }
             /*
              * <? transitions to a bogus comment state – skip to the nearest >
+             *  transitions to a bogus comment state – skip to the nearest >
              * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
              */
 …
                 $closer_at = strpos( $html, '>', $at + 2 );
                 if ( false === $closer_at ) {
                     $this->parser_state = self::STATE_INCOMPLETE;
+                    $this->parser_state = self::STATE_INCOMPLETE;
                     return false;
+                }
+                $at = $closer_at + 1;
+                continue;
+                $this->parser_state         = self::STATE_COMMENT;
+                $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
+                $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                $this->text_starts_at       = $this->token_starts_at + 2;
+                $this->text_length          = $closer_at - $this->text_starts_at;
+                $this->bytes_already_parsed = $closer_at + 1;
+                /*
+                 * Identify a Processing Instruction node were HTML to have them.
+                 *
+                 * This section must occur after identifying the bogus comment end
+                 * because in an HTML parser it will span to the nearest `>`, even
+                 * if there's no `?>` as would be required in an XML document. It
+                 * is therefore not possible to parse a Processing Instruction node
+                 * containing a `>` in the HTML syntax.
+                 *
+                 * XML allows for more target names, but this code only identifies
+                 * those with ASCII-representable target names. This means that it
+                 * may identify some Processing Instruction nodes as bogus comments,
+                 * but it will not misinterpret the HTML structure. By limiting the
+                 * identification to these target names the Tag Processor can avoid
+                 * the need to start parsing UTF-8 sequences.
+                 *
+                 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
+                 *                     [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
+                 *                     [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
+                 *                     [#x10000-#xEFFFF]
+                 * > NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
+                 *
+                 * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget
+                 */
+                if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) {
+                    $comment_text     = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 );
+                    $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' );
+                    if ( 0 < $pi_target_length ) {
+                        $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length );
+                        $this->comment_type       = self::COMMENT_AS_PI_NODE_LOOKALIKE;
+                        $this->tag_name_starts_at = $this->token_starts_at + 2;
+                        $this->tag_name_length    = $pi_target_length;
+                        $this->text_starts_at    += $pi_target_length;
+                        $this->text_length       -= $pi_target_length + 1;
+                    }
+                }
+                return true;
+            }
 …
              * If a non-alpha starts the tag name in a tag closer it's a comment.
              * Find the first `>`, which closes the comment.
+             *
              * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name
 …
                 $closer_at = strpos( $html, '>', $at + 3 );
                 if ( false === $closer_at ) {
                     $this->parser_state = self::STATE_INCOMPLETE;
+                    $this->parser_state = self::STATE_INCOMPLETE;
                     return false;
+                }
+                $at = $closer_at + 1;
+                continue;
+                $this->parser_state         = self::STATE_FUNKY_COMMENT;
+                $this->token_length         = $closer_at + 1 - $this->token_starts_at;
+                $this->text_starts_at       = $this->token_starts_at + 2;
+                $this->text_length          = $closer_at - $this->text_starts_at;
+                $this->bytes_already_parsed = $closer_at + 1;
+                return true;
+            }
 …
         $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed );
         if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
             $this->parser_state = self::STATE_INCOMPLETE;
+            $this->parser_state = self::STATE_INCOMPLETE;
             return false;
 …
         $this->bytes_already_parsed += $name_length;
         if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
             $this->parser_state = self::STATE_INCOMPLETE;
+            $this->parser_state = self::STATE_INCOMPLETE;
             return false;
 …
         $this->skip_whitespace();
         if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
             $this->parser_state = self::STATE_INCOMPLETE;
+            $this->parser_state = self::STATE_INCOMPLETE;
             return false;
 …
             $this->skip_whitespace();
             if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
                 $this->parser_state = self::STATE_INCOMPLETE;
+                $this->parser_state = self::STATE_INCOMPLETE;
                 return false;
 …
         if ( $attribute_end >= strlen( $this->html ) ) {
             $this->parser_state = self::STATE_INCOMPLETE;
+            $this->parser_state = self::STATE_INCOMPLETE;
             return false;
 …
         $this->tag_name_starts_at   = null;
         $this->tag_name_length      = null;
         $this->is_closing_tag       = null;
         $this->attributes           = array();
         $this->duplicate_attributes = null;
+    }
 …
         // Point this tag processor before the sought tag opener and consume it.
         $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start;
         return $this->next_tag( array( 'tag_closers' => 'visit' ) );
+        return $this->next_t);
+    }
 …
      */
     public function get_tag() {
         if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
+        if (  ) {
             return null;
+        }
 …
         $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
+        return strtoupper( $tag_name );
+        if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
+            return strtoupper( $tag_name );
+        }
+        if (
+            self::STATE_COMMENT === $this->parser_state &&
+            self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
+        ) {
+            return $tag_name;
+        }
+        return null;
+    }
 …
             $this->is_closing_tag
         );
+    }
 …
     /**
      * Parser Ready State
+     * Parser Ready State
+     *
      * Indicates that the parser is ready to run and waiting for a state transition.
 …
     /**
      * Parser Complete State
+     * Parser Complete State
+     *
      * Indicates that the parser has reached the end of the document and there is
 …
     /**
      * Parser Incomplete State
+     * Parser Incomplete
+     *
      * Indicates that the parser has reached the end of the document before finishing
 …
      * @access private
      */
     const STATE_INCOMPLETE = 'STATE_INCOMPLETE';
     /**
      * Parser Matched Tag State
+    const STATE_INCOMPLETE';
+    /**
+     * Parser Matched Tag State
+     *
      * Indicates that the parser has found an HTML tag and it's possible to get
 …
      */
     const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
+}

trunk/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php

-                      r57343
+                      r57348
      */
     public function test_can_seek_back_and_forth() {
+        $p = WP_HTML_Processor::create_fragment( '<div><p one><div><p><div two><p><div><p><div><p three>' );
+        $p = WP_HTML_Processor::create_fragment(
+            <<<'HTML'
+<div>text<p one>more stuff<div><![CDATA[this is not real CDATA]]><p><!-- hi --><div two><p><div><p>three comes soon<div><p three>' );
+HTML
+        );
         // Find first tag of interest.

trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php

-                      r57211
+                      r57348
         $p->next_tag();
+        $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </script> tag closer' );
+        $this->assertTrue( $p->is_tag_closer(), 'Indicated a <script> tag opener is a tag closer' );
+        $this->assertFalse(
+            $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+            'Should not have found closing SCRIPT tag when closing an opener.'
+        );
         $p = new WP_HTML_Tag_Processor( 'abc</script>' );
 …
         $p->next_tag();
+        $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </textarea> tag closer' );
+        $this->assertTrue( $p->is_tag_closer(), 'Indicated a <textarea> tag opener is a tag closer' );
+        $this->assertFalse(
+            $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+            'Should not have found closing TEXTAREA when closing an opener.'
+        );
         $p = new WP_HTML_Tag_Processor( 'abc</textarea>' );
 …
         $p->next_tag();
+        $this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </title> tag closer' );
+        $this->assertTrue( $p->is_tag_closer(), 'Indicated a <title> tag opener is a tag closer' );
+        $this->assertFalse(
+            $p->next_tag( array( 'tag_closers' => 'visit' ) ),
+            'Should not have found closing TITLE when closing an opener.'
+        );
         $p = new WP_HTML_Tag_Processor( 'abc</title>' );
 …
             'Text with comments'     => array( 'One <!-- sneaky --> comment.' ),
             'Empty tag closer'       => array( '</>' ),
             'Processing instruction' => array( '<?xml version="1.0"?>' ),
             'Combination XML-like'   => array( '<!DOCTYPE xml><?xml version=""?><!-- this is not a real document. --><![CDATA[it only serves as a test]]>' ),
 …
             'Partial CDATA'                        => array( '<![CDA' ),
             'Partially closed CDATA]'              => array( '<![CDATA[cannot escape]' ),
-            'Partially closed CDATA]>'             => array( '<![CDATA[cannot escape]>' ),
             'Unclosed IFRAME'                      => array( '<iframe><div>' ),
             'Unclosed NOEMBED'                     => array( '<noembed><div>' ),
 …
             'tag inside of CDATA'      => array(
                 'input'    => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span>test</span>',
                 'expected' => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span class="firstTag" foo="bar">test</span>',
+                'expected' => '<![CDATA[This <is> a <strong ">test</span>',
             ),
         );

Note: See TracChangeset for help on using the changeset viewer.

Trac UI Preferences

Make WordPress Core