Make WordPress Core

Opened 14 months ago

Closed 11 months ago

Last modified 4 weeks ago

#58517 closed enhancement (fixed)

HTML API: Introduce HTML Processor, a higher-level partner to the Tag Processor

Reported by: dmsnell's profile dmsnell Owned by: bernhard-reiter's profile Bernhard Reiter
Milestone: 6.4 Priority: normal
Severity: normal Version:
Component: HTML API Keywords: has-unit-tests 2nd-opinion has-patch
Focuses: Cc:

Description

This patch introduces the first of many iterations on the evolution of the HTML API, the HTML Processor, which is built in order to understand HTML structure including nesting, misnesting, and complicated semantic rules.

In the first iteration, the HTML Processor is arbitrarily limited to a minimal subset of functionality so that we can review it, ship it, test it, and collect feedback before moving forward. This means that this patch is more or less an extension to the Tag Processor query language, providing the ability not only to scan for a tag of a given name, but also to find an HTML element in a specific nesting path.

The HTML Processor also aborts any time it encounters:

  • a tag that isn't a P, DIV, FIGURE, FIGCAPTION, IMG, STRONG, B, EM, I, A, BIG, CODE, FONT, SMALL, STRIKE, TT, or U tag. this limit exists because many HTML elements require specific rules and we are trying to limit the number of rules introduced at once. this work is targeted at existing work in places like the image block.
  • certain misnesting constructs that evoke complicated resolution inside the HTML spec. where possible and where simple to do reliably, certain parse errors are handled. in most cases the HTML Processor aborts.

The structure of the HTML Processor is established in this patch. Further spec-compliance comes through filling out _more of the same_ kind and nature of code as is found in this patch. Certain critical HTML algorithms are partially supported, and where support requires more than is present, the HTML Processor acknowledges this and refuses to operate.

In this patch are explorations for how to verify that new HTML support is fully added (instead of allowing for partial updates that leave some code paths non-compliant). Performance is hard to measure since support is so limited at the current time, but it should generally follow the performance of the Tag Processor somewhat close as the overhead is minimized as much as practical.

A Make post will follow, and the PR in Github will see updates. Target is merging into 6.4.

Change History (34)

This ticket was mentioned in PR #4519 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#1

Trac ticket: 58517-trac

## Status

  • [x] Finish documenting the main HTML Processor class
  • [x] Can we inhibit people from using the constructor directly?
    • Can't make it private because the tag processor constructor isn't private
    • Added _doing_it_wrong() to warn people
  • [ ] Prepare explanation of what it supports and how development is planned.
    • Block seeking after modifying? We can add this later, but for now it seems good before we can provide the reliability guarantees needed to ensure it.
    • No set_inner_html() or set_outer_html() for release because it introduces new problems (mostly around seeking and bookmarking)
    • Only support using breadcrumbs to find a tag, basically an extension to the Tag Processor's query language.
  • [x] Try and find HTML cases that break existing support.
  • [x] Review bookmark garbage collection, I think there are some leaks when creating them outside of the stack, which might only exist in the inner/outer HTML work.
    • ~Could we finally create an internal bookmark class, that when it disappears from scope can release the bookmark?~
    • ~If we do this, I think we would need to mirror the bookmarking system, but it would result in natural garbage collection, or would we have to _hook_ in to the bookmarking so that the outer class calls into the bookmarks.~
    • Derp, rely on the WP_HTML_Token class and release the bookmark in the destructor.
  • [x] Formalize after_push and after_pop, as they will be valuable for turning nested iteration into a "free" flag
    • Removed for now, but these will be a handy addition when we get back to implementing things like working with inner/outer content. They can come in their own explained changeset.
  • [x] Add a last_error property to determine why a failed run failed.

Lines of code counts as of 6ba5b3932612cfd7c00d30422e63ed5e1639d92b:

  • _Source lines_ 634
  • _Comment lines_ 1128
  • _Files_ 6
  • _Test lines_ 529

@dmsnell commented on PR #4519:


14 months ago
#2

@costdev thanks for your thorough code style scrutiny. I appreciate it! given the state of this PR right now, I would also be extremely grateful for higher-level feedback about the design and implementation and overall structure given that lots of code might change between now and merge. I've done my best to try and get ahead of having a hundred comments about code style, but obviously I missed a lot of cases. still, all the comments about small styling differences can make it harder to review the parts that matter with respect to function, performance, security, design.

would it help next time if I add something in the description to say this isn't ready for meticulous styling scrutiny? what would be a good way to communicate this?

@costdev commented on PR #4519:


14 months ago
#3

@dmsnell No problem and I totally understand! When a folks are already weighing in on the design/implementation side of things in a PR, I tend to look for the rest so that the PR gets good coverage in reviews, but I get that when there's a lot of iteration anticipated, this can make things hard to follow.

As we don't have three stages of PR (such as "Draft", "Ready for technical review" and "Ready for additional review"), it may be worth keeping a PR in "Draft" if a lot of iteration is anticipated prior to merge. That way, design discussions and such can take priority until the PR is marked "Ready for Review", then final reviews (including code style) can be carried out when it's marked "Ready for Review", so nothing gets missed before commit.

What do you think?

@dmsnell commented on PR #4519:


14 months ago
#4

What do you think?

Maybe something to iterate on. I don't anticipate a lot of iteration because I have been careful and we went through a lot of internal review already, but if someone _were_ to point out something at this time it could change things is more what I meant.

I've added a note at the top. Thanks again for your thoroughness; it was actually that attention to detail with the Tag Processor that led most of the functions here to have the appropriate docblock ordering already.

@dmsnell commented on PR #4519:


13 months ago
#5

@ockham I think your comments are all valid. How about we get this in trunk so we can start iterating on it, and before 6.4.0 is cut, make sure that everything we want is in place, be it state simplification, TODO items (which I intentionally left out because I was afraid people would reject @TODO comments), and additional tests.

Thanks for all your help with this.

#6 follow-up: @Bernhard Reiter
13 months ago

  • Owner set to Bernhard Reiter
  • Resolution set to fixed
  • Status changed from new to closed

In 56274:

HTML-API: Introduce minimal HTML Processor.

This patch introduces the first of many iterations on the evolution of the HTML API, the HTML Processor, which is built in order to understand HTML structure including nesting, misnesting, and complicated semantic rules.

In the first iteration, the HTML Processor is arbitrarily limited to a minimal subset of functionality so that we can review it, ship it, test it, and collect feedback before moving forward. This means that this patch is more or less an extension to the Tag Processor query language, providing the ability not only to scan for a tag of a given name, but also to find an HTML element in a specific nesting path.

The HTML Processor also aborts any time it encounters:

  • a tag that isn't a P, DIV, FIGURE, FIGCAPTION, IMG, STRONG, B, EM, I, A, BIG, CODE, FONT, SMALL, STRIKE, TT, or U tag. this limit exists because many HTML elements require specific rules and we are trying to limit the number of rules introduced at once. this work is targeted at existing work in places like the image block.
  • certain misnesting constructs that evoke complicated resolution inside the HTML spec. where possible and where simple to do reliably, certain parse errors are handled. in most cases the HTML Processor aborts.

The structure of the HTML Processor is established in this patch. Further spec-compliance comes through filling out more of the same kind and nature of code as is found in this patch. Certain critical HTML algorithms are partially supported, and where support requires more than is present, the HTML Processor acknowledges this and refuses to operate.

In this patch are explorations for how to verify that new HTML support is fully added (instead of allowing for partial updates that leave some code paths non-compliant). Performance is hard to measure since support is so limited at the current time, but it should generally follow the performance of the Tag Processor somewhat close as the overhead is minimized as much as practical.

Props dmsnell, zieladam, costdev.
Fixes #58517.

#8 @spacedmonkey
13 months ago

@bernhard-reiter Is this meant for WP 6.4? If so, this ticket should be marked as so.

#9 @audrasjb
13 months ago

  • Milestone changed from Awaiting Review to 6.4

Indeed, this was committed without any milestone, which shouldn't be the case :)
Moving to milestone 6.4.

#10 @flixos90
13 months ago

  • Milestone changed from 6.4 to Awaiting Review

@bernhard-reiter Just spotted this, and I'm curious how the new HTML Processor relates to the HTML Tag Processor? I see the title mentions "higher-level partner", but can you elaborate on that? For example, what use-cases would you rely on the HTML Processor for vs the HTML Tag Processor?

#11 @flixos90
13 months ago

  • Milestone changed from Awaiting Review to 6.4

Oops, changing the milestone was not intentional, @audrasjb and I were just apparently authoring comments at the same time :)

#12 @dmsnell
13 months ago

@flixos90 I've tried to explain in the class docblock the difference, but it's something we can continue to improve.

the short answer is that the Tag Processor lexes individual tokens within an HTML document, but is completely unaware of the HTML structure and unaware of the semantic parsing rules, such as the fact that a new <p> implicitly closes any unopened <p> found before it.

the HTML Processor thus aspires to eventually provide a reliable DOM-like interface for operating in a streaming manner just like the Tag Processor, but with the knowledge of the entire document structure including nesting, overlapping tags, missing closers, and other kinds of HTML errors.

at this stage we're only introducing the very minimal amount of support and I want to gradually bring in further support as we test and vet this code in the Gutenberg plugin. a large part of why this is so minimal is to make review and testing easier. in the code is the basic structure and skeleton necessary for fully implementing the HTML5 specification, but in almost all of the places requiring more complicated coding the processor aborts and refuses to engage with HTML it may not understand.

at the moment this patch can be viewed as an extension to the Tag Processor that allows for a more advanced query. whereas the Tag Processor can search by tag name, the HTML Processor can also search by "breadcrumbs" which is equivalent to a chain of tag names in a CSS query separated by the child combinator (>). This is challenging to implement with the Tag Processor, but the HTML Processor is building the HTML state machine to make these kinds of queries and operations easy and reliable.

#13 in reply to: ↑ 6 ; follow-up: @azaozz
13 months ago

  • Keywords 2nd-opinion added; has-patch removed
  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to Bernhard Reiter:

This patch introduces the first of many iterations on the evolution of the HTML API, the HTML Processor, which is built in order to understand HTML structure including nesting, misnesting, and complicated semantic rules.

Sorry but still thinking that building an HTML parser in PHP is not a good idea. If it was, there would have been a few around.

Also WP doesn't need this imho. JavaScript is 1000 times better at handling all the tasks that might be needed in manipulating HTML. At the end of the day no matter how well a PHP parser is maintained (which is quite questionable when it is part of WP and not a separate project) it will likely still have many unresolved edge cases and will likely be long way behind the latest in HTML standards.

For comparison doing any of the HTML parsing/manipulation tasks in JS would be faster, more secure and a lot easier :) (would generally need few lines of code instead of few thousands).

@luisherranz commented on PR #4519:


13 months ago
#14

Hey, feel free to ignore it, and I'm sure this is not the bottleneck, but wouldn't it be preferable from a performance point of view to order the tags by how common they are?

return (
  'ADDRESS' === $tag_name ||
  'APPLET' === $tag_name ||
  'AREA' === $tag_name ||
  'ARTICLE' === $tag_name ||
  // ...
switch ( $op ) {
  case '+BLOCKQUOTE':
  case '+DIV':
  case '+FIGCAPTION':
  case '+FIGURE':
  case '+P':
  // ...

Maybe also store those tag name strings in variables? (I'm not so sure if that one makes any difference).

@dmsnell commented on PR #4519:


13 months ago
#15

wouldn't it be preferable from a performance point of view to order the tags by how common they are?

I don't know. Once this is spec-complete I'd like to run some benchmarks to better understand. A switch is also possible, but we need to measure and see. For the moment I want to keep things clear from an implementation standpoint, preferring a correspondence to the HTML spec's organization over conjectured micro performance. If we discover this is a significant penalty then I'd reconsider.

Maybe also store those tag name strings in variables? (I'm not so sure if that one makes any difference).

PHP interns string literals so there's no benefit to storing them in variables.

Originally I had a more performant solution here, which involved a WP_HTML_Spec class, but for the minimal processor I removed that again for reasons of clarity and maintainability over raw performance.

I'm still not sure how fast or slow this is because it's hard to create a meaningful test document that's fully supported. My guess is that this is marginally slower than the Tag Processor.

#16 in reply to: ↑ 13 ; follow-up: @westonruter
12 months ago

Replying to azaozz:

Also WP doesn't need this imho. JavaScript is 1000 times better at handling all the tasks that might be needed in manipulating HTML. [...] For comparison doing any of the HTML parsing/manipulation tasks in JS would be faster, more secure and a lot easier :) (would generally need few lines of code instead of few thousands).

How is JavaScript faster? By doing HTML manipulation via PHP on the server, there would be no need to ship any JS, meaning much faster client-side performance in all the CWV metrics. And when using page caching, no difference in terms of TTFB.

I definitely think WP does need this, because the server-side alternatives are much worse in terms of performance or security/reliability: DOMDocument and RegEx.

This ticket was mentioned in Slack in #core by oglekler. View the logs.


11 months ago

#18 follow-up: @oglekler
11 months ago

This ticket was discussed during bug scrub.

Hi @westonruter and @azaozz, we are getting closer to Beta 1, so, please review what we have.

Add props @mukesh27

#19 @dmsnell
11 months ago

@oglekler this has been in trunk since July 20 and is supposed to be part of 6.4. Beyond some ideological opposition to properly handling HTML from within PHP, there is nothing to my knowledge that anyone has raised as objection to this.

Can you clarify what you mean by "reinventing the wheel" here? For more context you might review the API roadmap post on Make, which reviews the motivation for the different parts of the HTML API, or recent XSS alerts spanning multiple new code contributions around WordPress that wouldn't have been possible within this new API.

What follow-up would you like here? This is definitely a different sort of component being developed, but its pace and schedule are intentional in response to the sensitivity of the work and the longstanding security, convenience, performance, and reliability issues that have given rise to it.

Happy to help however I can. This early inclusion Core-first is part of an attempt to avoid typical headaches in Gutenberg > Core merges, headaches which appeared during the 6.3 process.

#20 in reply to: ↑ 18 @azaozz
11 months ago

Replying to oglekler:

we are getting closer to Beta 1, so, please review what we have.

Yep, there are some "philosophical differences" here about what is considered a "good way" to do things with HTML and what is not a good idea or a "bad way".

I have no doubt that @dmsnell will make the parsing of HTML in PHP work as well as possible, and with the least amount of edge cases. However the popular belief (not just mine but of many developers for many years) is that it is impossible to account for all edge cases. This will always be inferior to applying the same changes from JS using the browser's DOM. The difference being that PHP can only guess but actually has no idea what the DOM will look like once the browser parses the HTML. (This also includes subtle browsers differences in handling cases of "broken" HTML, etc.)

In these terms I'll continue to advocate against using this (or any other) "processing of HTML as string in PHP" in WP core, now and in the future. This would include cautioning/negative reviews for any code that uses this. I believe this is the inferior way to manipulate the HTML, and that using the browser's DOM will always be the better alternative.

Last edited 11 months ago by azaozz (previous) (diff)

#21 in reply to: ↑ 16 @azaozz
11 months ago

Replying to westonruter:

How is JavaScript faster? By doing HTML manipulation via PHP on the server, there would be no need to ship any JS, meaning much faster client-side performance in all the CWV metrics. And when using page caching, no difference in terms of TTFB.

Thinking you're perhaps considering only very simple examples here. The PHP manipulations of a long(er) HTML strings would likely slow down significantly. Perhaps try testing with 10KB, 100KB and 500KB strings. Comparing that with (native) browser DOM manipulations, it would be slower in most cases, with some exceptions (low memory in PHP will likely result in a fatal error, low browser memory will likely result in excessively sluggish performance, etc.).

I definitely think WP does need this, because the server-side alternatives are much worse in terms of performance or security/reliability: DOMDocument and RegEx.

Uh, not talking about these at all. I believe all "HTML manipulations" should be done using the browser's DOM, see above. This is by far the better way, and also the most future-proof and bugs/edge cases free.

#22 follow-up: @dmsnell
11 months ago

it is impossible to account for all edge cases. This will always be inferior to applying the same changes from JS using the browser's DOM. The difference being that PHP has no idea what the DOM will look like once the browser parses the HTML.

come on @azaozz 🙃 I thought we'd gotten through this - these comments are factually inaccurate. I really do hear you, but repeating this mantra doesn't help anyone or provide any actionable steps. it's absolutely possible to do all these things because HTML5 standardized everything that governs how the browsers and all other parsers should handle malformed HTML.

the entire purpose of the HTML API is to stop treating HTML as a string in PHP, because it provides a structural/semantic interface for doing that. you're barking against the one thing that's delivering what you've been wanting for years.

that being said I have no problem with your opinion to run code on the browser vs. the server, but this isn't the place to hold that discussion is it? "don't provide a safe structural HTML interface on the server because I personally prefer people do it in the browser?" you're advocating that we keep the status quo of treating HTML as a string on the server, advocating for keeping around the norms that introduce all the vulnerabilities and breakages you are speaking against.

I believe all "HTML manipulations" should be done using the browser's DOM

I would enjoy reviewing your patch to remove wp_kses 😀

anyway, my point was to try and be clear for the sake of @oglekler - this is not the place to hash out a philosophy of doing everything in the customer's browser vs. on the server. that cannot be resolved here, plus there are already competing CMS's that don't run PHP.

this is a practical matter about whether we want to eliminate security vulnerabilities and content breakage by providing a spec-compliant HTML parser in Core. this system eliminates knowingly-naive and faulty string replacements and regular expressions that constantly cause grief and damage WordPress' reputation. as long as people are processing HTML on the server I'd rather have a reliable system than a fundamentally-flawed one.

can we find a different place to host the philosophical argument rather than on every patch that improves the situation?

#23 in reply to: ↑ 22 @azaozz
11 months ago

Replying to dmsnell:

these comments are factually inaccurate

Hehe, sure, lets give this some time in core. If nothing comes up I'll try to get few examples where this behaves differently than manipulating the browser's DOM :)

but this isn't the place to hold that discussion is it?

Yea, perhaps, but then it seems "nowhere" is suitable. Having this documented here at least can perhaps reach few people who may have opinions on the matter.

you're advocating that we keep the status quo of treating HTML as a string on the server, advocating for keeping around the norms that introduce all the vulnerabilities and breakages you are speaking against.

No. I'd advocating for not adding more or new code that keeps repeating the same old mistakes of trying to manipulate HTML as strings in PHP. Deprecating and removing the old code that does this would be great. Unfortunately that's a really hard task while maintaining backwards compatibility.

I would enjoy reviewing your patch to remove wp_kses 😀

Yea, replacing KSES is impossible for now. It's functionality is somewhat different though, it is meant to make it impossible to save HTML that contains certain tags or attributes, even if it breaks it in the process.

On the other hand replacing KSES may not be that impossible if WP starts using node.js and headless Chromium on the server... Who knows :)

anyway, my point was to try and be clear for the sake of @oglekler

Yea. Seems this is in core and will stay there until forever no matter what!!

this is a practical matter about whether we want to eliminate security vulnerabilities and content breakage by providing a spec-compliant HTML parser in Core. this system eliminates knowingly-naive and faulty string replacements and regular expressions that constantly cause grief and damage WordPress' reputation. as long as people are processing HTML on the server I'd rather have a reliable system than a fundamentally-flawed one.

If the intent here was to only replace existing cases of manipulating HTML in PHP, I'd be +1000 for it. However as far as I see the intent is to keep adding more and more code that does this. In core and by plugins. Despite that most (all?) agree this is not the best way to do things :(

can we find a different place to host the philosophical argument rather than on every patch that improves the situation?

Hmm, not sure. Don't actually think such place exists as each new addition of code using this functionality should be considered separately imho.

Last edited 11 months ago by azaozz (previous) (diff)

#24 @dmsnell
11 months ago

@azaozz I don't want to continue these circular discussions here since everyone can examine the code for themselves to see what's accurate and not.

I started collecting a list of places people are introducing the HTML API. so far, from what I found, about half of the cases are directly replacing string-based HTML processing. the other half are more or less doing the same thing but in new features. in other words, the features would be processing HTML in PHP regardless, but they are new enough that they're leaning on the HTML API for reliability instead of regular expressions and string-replace.

HTML API introductions in the WordPress world

if you are looking for where to share opposition, I recommend that you leave it in those PRs where people are introducing what you don't like: processing HTML in PHP. this HTML API development is only about making ongoing work more reliable and about removing decades of defects, exploits, and breakage; it's not about persuading people to do more or less work on the server or in the client.

This ticket was mentioned in Slack in #core by oglekler. View the logs.


11 months ago

#26 @kirasong
11 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Looking through the end of this discussion, I see a comment from @azaozz that notes:

Hehe, sure, lets give this some time in core. If nothing comes up I'll try to get few examples where this behaves differently than manipulating the browser's DOM :)

Based on that, I am going to go ahead and re-close this ticket for now.

If there are issues that are brought up during beta, then we can address whether things should change here.

Any performance improvements can happen in new tickets.

Thanks so much everyone for your work here!

#27 @codente
10 months ago

@dmsnell Thanks for the dev note on this. We're tracking the dev note and feedback over here:
https://github.com/orgs/WordPress/projects/141/views/1?pane=issue&itemId=41804661

#28 @dmsnell
5 months ago

In 57768:

HTML API: Ensure that breadcrumbs are properly retained after seeking.

In some cases, it's possible to seek back into a location found inside
an element which has been closed before the point in the document where
the seek() was made. In these cases the breadcrumb stack is lost, and
calling get_breadcrumbs() after the seek will return the wrong information.

In this patch, the HTML Processor takes a conservative approach and
moves to the front of the document, then reparses the document until
it reaches the sought-after location. This ensures consistency on
the stack of open elements and active formats, and preserves
breadcrumbs.

Developed in https://github.com/WordPress/wordpress-develop/pull/6185
Discussed in https://core.trac.wordpress.org/ticket/60687

Props jonsurrell.
Follow-up to [60687].
See #58517.
Fixes #60687.

#29 @dmsnell
4 months ago

In 58010:

HTML API: Validate HTML Processor against external test suite from html5lib.

In this patch, the test suite from html5lib validates the tree-construction
steps in the HTML Processor to ensure that they are behaving according to the
HTML specification. This suite of tests is also used by the servo project to
test its html5ever package.

A new test module in the HTML API transforms HTML Processor output to match
the expected tree shape from the external tests. For cases where there are
tests validating behaviors of unsupported HTML tags and constructs, the tests
are marked as skipped. As the HTML API continues to expand its own support,
the number of skipped tests will automatically shrink down towards zero.

Additional tests are skipped through the SKIP_TEST array in the test runner.

Fixes #60227.
See #58517.
Props azaozz, costdev, dmsnell, hellofromtonya, jonsurrell, jorbin, swisspidy.

This ticket was mentioned in PR #6618 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#30

  • Keywords has-patch added

Trac ticket: Core-58517

Previously the HTML Process was ignoring the class_name argument in the next_tag() query if it existed. This was wrong and would lead to calling code finding the wrong tag if provided.

This patch adds the class detection code back into next_tag() to fix the bug.

Follow-up to [56274].

#31 @dmsnell
2 months ago

In 58190:

HTML API: Respect class_name query arg in HTML_Processor::next_tag()

Previously the HTML Process was ignoring the class_name argument in
the next_tag() query if it existed. This was wrong and would lead to
calling code finding the wrong tag if provided.

This patch adds the class detection code back into next_tag() to fix
the bug.

Developed in https://github.com/WordPress/wordpress-develop/pull/6618
Discussed in https://core.trac.wordpress.org/ticket/58517

See #58517.
Follow-up to [56274].
Props: dmsnell, jonsurrell.

This ticket was mentioned in PR #6980 on WordPress/wordpress-develop by @dmsnell.


4 weeks ago
#33

Trac ticket: Core-58517

Previously the HTML Processor was ignoring the tag_name argument in the next_tag() query if it existed. This was wrong adn would lead to calling code finding the very next tag, regardless of tag name, instead of the requested taag.

This patch adds the tag name detection code into next_tag() to fix the bug and ensure that next_tag() always returns only when finding a tag of the given name.

Follow-up to [56274].

Note: See TracTickets for help on using tickets.