Comparison of HTML parsers

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:

HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.

Parser	License	Implementation language(s)	Latest date*	HTML parsing^[1]	HTML5-compliant parsing	Clean HTML**	Update HTML***
HTML Tidy	W3C license	ANSI C	2021-07-17^[2]	Yes^[3]	Yes	Yes^[3]	Yes
HtmlUnit	Apache License 2.0	Java	2023-10-31^[4]	Yes	?	No	No
Beautiful Soup	MIT License	Python	2023-04-07^[5]	Yes	Yes	?	No
jsoup	MIT License	Java	2024-07-10^[6]	Yes	Yes	Yes	Yes
Parser	License	Implementation language(s)	Latest date*	HTML Parsing	HTML5-compliant Parsing	Clean HTML**	Update HTML***

* Latest release (of significant changes) date.

** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.

*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").

References

^ 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine
^ HTML Tidy release 5.8.0
^ ^a ^b What is Tidy?
^ HtmlUnit 3.7.0
^ Beautiful Soup release 4.10
^ jsoup Java HTML Parser release 1.18.1

[1] 12.2 Parsing HTML documents — HTML Standard Archived 2013-01-16 at the Wayback Machine

[2] HTML Tidy release 5.8.0

[what_is_tidy-3] What is Tidy?

[HtmlUnit_Release_2.50.0-4] HtmlUnit 3.7.0

[5] Beautiful Soup release 4.10

[6] soup Java HTML Parser release 1.18.1

[1]