CHANGES.md
NodeTraversor support for in-place DOM rewrites during NodeVisitor.head(). Current-node edits such as remove, replace, and unwrap now recover more predictably, while traversal stays within the original root subtree. This makes single-pass tree cleanup and normalization visitors easier to write, for example when unwrapping presentational elements or replacing text nodes as you walk the DOM. #2472Cleaner may be reused across concurrent threads, and that shared Safelist instances should not be mutated while in use. #2473re2j dependency when not present. #2459NodeTraversor regression in 1.21.2 where removing or replacing the current node during head() could revisit the replacement node and loop indefinitely. The traversal docs now also clarify which inserted nodes are visited in the current pass. #2472available() call throws IOException, as seen on JDK 8 HttpURLConnection. #2474Cleaner no longer makes relative URL attributes in the input document absolute when cleaning or validating a Document. URL normalization now applies only to the cleaned output, and Safelist.isSafeAttribute() is side effect free. #2475Cleaner no longer duplicates enforced attributes when the input Document preserves attribute case. A case-variant source attribute is now replaced by the enforced attribute in the cleaned output. #2476HttpClient, because the JDK would silently ignore that proxy and attempt to connect directly. Those requests now fall back to the legacy HttpURLConnection transport instead, which does support SOCKS. #2468re2j regular expression engine for regex-based CSS selectors (e.g. [attr~=regex], :matches(regex)), which ensures linear-time performance for regex evaluation. This allows safer handling of arbitrary user-supplied query regexes. To enable, add the com.google.re2j dependency to your classpath, e.g.: <dependency>
<groupId>com.google.re2j</groupId>
<artifactId>re2j</artifactId>
<version>1.8</version>
</dependency>
(If you already have that dependency in your classpath, but you want to keep using the Java regex engine, you can disable re2j via System.setProperty("jsoup.useRe2j", "false").) You can confirm that the re2j engine has been enabled correctly by calling org.jsoup.helper.Regex.usingRe2j(). #2407
Parser#unescape(String, boolean) that unescapes HTML entities using the parser's configuration (e.g. to support error tracking), complementing the existing static utility Parser.unescapeEntities(String, boolean). #2396org.jsoup.parser.Parser#setMaxDepth. #2421Elements of an Element were not correctly invalidated in Node#replaceWith(Node), which could lead to incorrect results when subsequently calling Element#children(). #2391[attr=" foo "]). Now matches align with the CSS specification and browser engines. #2380ProxySelector.getDefault()) was ignored. Now, the system proxy is used if a per-request proxy is not set. #2388, #2390ValidationException could be thrown in the adoption agency algorithm with particularly broken input. Now logged as a parse error. #2393IndexOutOfBoundsException could be thrown when parsing a body fragment with crafted input. Now logged as a parse error. #2397, #2406parent child selector) across many retained threads, their memoized results could also be retained, increasing memory use. These results are now cleared immediately after use, reducing overall memory consumption. #2411Parser now preserves any custom TagSet applied to the parser. #2422, #2423Tag.Void now parse and serialize like the built-in void elements: they no longer consume following content, and the XML serializer emits the expected self-closing form. #2425 element is once again classified as an inline tag (Tag.isBlock() == false), matching common developer expectations and its role as phrasing content in HTML, while pretty-printing and text extraction continue to treat it as a line break in the rendered output. #2387, #2439Jsoup.connect(url).get(). On responses without a charset header, the initial charset sniff could sometimes (depending on buffering / available() behavior) be mistaken for end-of-stream and a partial parse reused, dropping trailing content. #2448TagSet copies no longer mutate their template during lazy lookups, preventing cross-thread ConcurrentModificationException when parsing with shared sessions. #2453<svg> foreignObject content nested within a <p>, which could incorrectly move the HTML subtree outside the SVG. #2452org.jsoup.internal.Functions (for removal in v1.23.1). This was previously used to support older Android API levels without full java.util.function coverage; jsoup now requires core library desugaring so this indirection is no longer necessary. #2412Normalizer#normalize(String, bool) and Attribute#shouldCollapseAttribute(Document.OutputSettings). These will be removed in a future version.Connection#sslSocketFactory(SSLSocketFactory) in favor of the new Connection#sslContext(SSLContext). Using sslSocketFactory will force the use of the legacy HttpUrlConnection implementation, which does not support HTTP/2. #2370Connection.Response#statusMessage() to return a simple loggable string message (e.g. "OK") when using the HttpClient implementation, which doesn't otherwise return any server-set status message. #2356Attributes#size() and Attributes#isEmpty() now exclude any internal attributes (such as user data) from their count. This aligns with the attributes' serialized output and iterator. #2369Connection#sslContext(SSLContext) to provide a custom SSL (TLS) context to requests, supporting both the HttpClient and the legacy HttUrlConnection implementations. #2370element.child(0).remove(), and when using Parser#parseBodyFragement() to parse a large number of direct children. #2373.NodeTraversor, if a last child element was removed during the head() call, the parent would be visited twice. #2355.Attributes#size() and Attributes#isEmpty(). #2356Element#children() on the same element concurrently, a race condition could happen when the method was generating the internal child element cache (a filtered view of its child nodes). Since concurrent reads of DOM objects should be threadsafe without external synchronization, this method has been updated to execute atomically. #2366:matchText pseduo-selector due to its side effects on the DOM; use the new ::textnode selector and the Element#selectNodes(String css, Class type) method instead. #2343Connection.Response#bufferUp() in lieu of Connection.Response#readFully() which can throw a checked IOException.Validate#ensureNotNull (replaced by typed Validate#expectNotNull); protected HTML appenders from Attribute and Node.Selector to support direct matching against nodes such as comments and text nodes. For example, you can now find an element that follows a specific comment: ::comment:contains(prices) + p will select p elements immediately after a <!-- prices: --> comment. Supported types include ::node, ::leafnode, ::comment, ::text, ::data, and ::cdata. Node contextual selectors like ::node:contains(text), :matches(regex), and :blank are also supported. Introduced Element#selectNodes(String css) and Element#selectNodes(String css, Class nodeType) for direct node selection. #2324TagSet#onNewTag(Consumer<Tag> customizer): register a callback that’s invoked for each new or cloned Tag when it’s inserted into the set. Enables dynamic tweaks of tag options (for example, marking all custom tags as self-closing, or everything in a given namespace as preserving whitespace).TokenQueue and CharacterReader autocloseable, to ensure that they will release their buffers back to the buffer pool, for later reuse.Selector#evaluatorOf(String css), as a clearer way to obtain an Evaluator from a CSS query. An alias of QueryParser.parse(String css).TagSet) in a foreign namespace (e.g. SVG) can be configured to parse as data tags.NodeVisitor#traverse(Node) to simplify node traversal calls (vs. importing NodeTraversor).Connection#readFully() as a replacement for Connection#bufferUp() with an explicit IOException. Similarly, added Connection#readBody() over Connection#body(). Deprecated Connection#bufferUp(). #2327< and > characters are now escaped in attributes. This helps prevent a class of mutation XSS attacks. #2337Connection to prefer using the JDK's HttpClient over HttpUrlConnection, if available, to enable HTTP/2 support by default. Users can disable via -Djsoup.useHttpClient=false. #2340script in a svg foreign context should be parsed as script data, not text. #2320Tag#isFormSubmittable() was updating the Tag's options. #2323<foo />)
to close HTML elements by default. Foreign content (SVG, MathML), and content parsed with the XML parser, still
supports self-closing tags. If you need specific HTML tags to support self-closing, you can register a custom tag via
the TagSet configured in Parser.tagSet(), using Tag#set(Tag.SelfClose). Standard void tags (such as ``,
, etc.) continue to behave as usual and are not affected by this
change. #2300.ChangeNotifyingArrayList, Document.updateMetaCharsetElement(), Document.updateMetaCharsetElement(boolean), HtmlTreeBuilder.isContentForTagData(String), Parser.isContentForTagData(String), Parser.setTreeBuilder(TreeBuilder), Tag.formatAsBlock(), Tag.isFormListed(), TokenQueue.addFirst(String), TokenQueue.chompTo(String), TokenQueue.chompToIgnoreCase(String), TokenQueue.consumeToIgnoreCase(String), TokenQueue.consumeWord(), TokenQueue.matchesAny(String...)TagSet tag collection.
Their properties can impact both the parse and how content is
serialized (output as HTML or XML). #2285.Element.cssSelector() will prefer to return shorter selectors by using ancestor IDs when available and unique. E.g.
#id > div > p instead of html > body > div > div > p #2283.Elements.deselect(int index), Elements.deselect(Object o), and Elements.deselectAll() methods to remove
elements from the Elements list without removing them from the underlying DOM. Also added Elements.asList() method
to get a modifiable list of elements without affecting the DOM. (Individual Elements remain linked to the
DOM.) #2100.Connection.requestBodyStream(InputStream stream). #1122.Tag#prefix(), Tag#localName(), Attribute#prefix(), Attribute#localName(), and
Attribute#namespace() to retrieve these. #2299.Element#cssSelector() will emit
appropriately escaped selectors, and the QueryParser supports those. Added Selector.escapeCssIdentifier() and
Selector.unescapeCssIdentifier(). #2297, #2305QueryParser into a clearer recursive descent
parser. #2310.div >> p) will throw an explicit parse
exception. #2311.Parser instances threadsafe, so that inadvertent use of the same instance across threads will not lead to
errors. For actual concurrency, use Parser#newInstance() per
thread. #2314.Document to the W3C DOM in W3CDom, elements with an attribute in an undeclared namespace now
get a declaration of xmlns:prefix="undefined". This allows subsequent serialization to XML via W3CDom.asString()
to succeed. #2087.StreamParser could emit the final elements of a document twice, due to how onNodeCompleted was fired when closing out the stack. #2295.? in <?xml version="1.0"?> would
incorrectly emit an error. #2298.Element#cssSelector() on an element with combining characters in the class or ID now produces the correct output. #1984.Jsoup.connect(), when running on Java 11+, via the Java HttpClient
implementation. #2257.
System.setProperty("jsoup.useHttpClient", "true"); to enable making requests via the HttpClient instead ,
which will enable http/2 support, if available. This will become the default in a later version of jsoup, so now is
a good time to validate it.HttpClient impl is not available in your JRE, requests will continue to be made via
HttpURLConnection (in http/1.1 mode).org.jsoup.UncheckedIOException (replace with java.io.UncheckedIOException);
moved previously deprecated method Element Element#forEach(Consumer) to
void Element#forEach(Consumer()). #2246Document#updateMetaCharsetElement(boolean) and Document#updateMetaCharsetElement(), as the
setting had no effect. When Document#charset(Charset) is called, the document's meta charset or XML encoding
instruction is always set. #2247Safelist that preserves relative links, the isValid() method will now consider these
links valid. Additionally, the enforced attribute rel=nofollow will only be added to external links when configured
in the safelist. #2245Element#selectStream(String query) and Element#selectStream(Evaluator) methods, that return a Stream of
matching elements. Elements are evaluated and returned as they are found, and the stream can be
terminated early. #2092Element objects now implement Iterable, enabling them to be used in enhanced for loops.Reader via
Parser#parseFragmentInput(Reader, Element, String). #1177jsoup-examples.jar. #1702#id .class (and other similar descendant queries) by around 4.6x, by better
balancing the Ancestor evaluator's cost function in the query
planner. #2254<isindex> tags, which would autovivify a form element with labels. This is no
longer in the spec.Elements.selectFirst(String cssQuery) and Elements.expectFirst(String cssQuery), to select the first
matching element from an Elements list. #2263!. #2275< are normalized to _ to ensure valid
XML. For example, <foo<bar> becomes <foo_bar>, as XML does not allow < in element names, but HTML5
does. #2276; in an attribute name, it could not be converted to a W3C DOM element, and so subsequent XPath
queries could miss that element. Now, the attribute name is more completely
normalized. #2244Connection, skip cookies that have no name, rather than throwing a validation
exception. #2242java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
could be thrown when calling Response#body() after parsing from a URL and the buffer size was
exceeded. #2250null InputStream inputs to Jsoup.parse(InputStream stream, ...), by returning
an empty Document. #2252template tag containing an li within an open li would be parsed incorrectly, as it was not recognized as a
"special" tag (which have additional processing rules). Also, added the SVG and MathML namespace tags to the list of
special tags. #2258template tag containing a button within an open button would be parsed incorrectly, as the "in button scope"
check was not aware of the template element. Corrected other instances including MathML and SVG elements,
also. #2271:nth-child selector with a negative digit-less step, such as :nth-child(-n+2), would be parsed incorrectly as a
positive step, and so would not match as expected. #1147doc.charset(charset) on an empty XML document would throw an
IndexOutOfBoundsException. #2266StructuralEvaluator (e.g., a selector ancestor chain like A B C) by
ensuring cache reset calls cascade to inner members. #2277doc.clone().append(html) were not supported. When a document was cloned, its Parser was not cloned but was a shallow copy of the original parser. #2281-, ., or digits were incorrectly marked as invalid and
removed. 2235byte[] and char[]
arrays used to read and parse the input. 2186html() and Entities.escape() when the input contains UTF characters in a supplementary plane, by
around 49%. 2183FormElement.elements() now reflect changes made to the DOM,
subsequently to the original parse. 2140TreeBuilder, the onNodeInserted() and onNodeClosed() events are now also fired for the outermost /
root Document node. This enables source position tracking on the Document node (which was previously unset). And
it also enables the node traversor to see the outer Document node. 2182Elements#set(). 2212Element.cssSelector() would fail if the element's class contained a *
character. 2169html, it should be parsed in Quirks
Mode. 2197div:has(span + a), the has() component was not working correctly, as the inner combining
query caused the evaluator to match those against the outer's siblings, not
children. 2187:has() components in a nested :has() might incorrectly
execute. 2131Connection.Response#cookies() will provide the last one set. Generally it is better to use
the Jsoup.newSession method to maintain a cookie jar, as that
applies appropriate path selection on cookies when making requests. 1831html or body). 2204< as part of a tag name, instead of emitting it as a
character node. 2230< as the start of an attribute name, vs creating a new element. The previous behavior was
intended to parse closer to what we anticipated the author's intent to be, but that does not align to the spec or to
how browsers behave. 1483StreamParser provides a progressive parse of its input. As each Element is completed, it is
emitted via a Stream or Iterator interface. Elements returned will be complete with all their children, and an
(empty) next sibling, if applicable. Elements (or their children) may be removed from the DOM during the parse,
for e.g. to conserve memory, providing a mechanism to parse an input document that would otherwise be too large to fit
into memory, yet still providing a DOM interface to the document and its elements. Additionally, the parser provides
a selectFirst(String query) / selectNext(String query), which will run the parser until a hit is found, at which
point the parse is suspended. It can be resumed via another select() call, or via the stream() or iterator()
methods. 2096Path accepting parse methods: Jsoup.parse(Path), Jsoup.parse(path, charsetName, baseUri, parser),
etc. 2055button tag configuration to include a space between multiple button elements in the Element.text()
method. 2105ns|* all elements in namespace Selector. 1811_, vs being
stripped. This should make the process clearer, and generally prevent an invalid attribute name being coerced
unexpectedly. 2143-1. 2106{, } in the path, a Malformed URL exception would
be thrown (if in development), or the URL might otherwise not be escaped correctly (if in
production). The URL encoding process has been improved to handle these characters
correctly. 2142W3CDom with a custom output Document, a Null Pointer Exception would be
thrown. 2114:has() selector did not match correctly when using sibling combinators (like
e.g.: h1:has(+h2)). 2137:empty selector incorrectly matched elements that started with a blank text node and were followed by
non-empty nodes, due to an incorrect short-circuit. 2130Element.cssSelector() would fail with "Did not find balanced marker" when building a selector for elements that had
a ( or [ in their class names. And selectors with those characters escaped would not match as
expected. 2146Entities.escape(string) to make the escaped text suitable for both text nodes and attributes (previously was
only for text nodes). This does not impact the output of Element.html() which correctly applies a minimal escape
depending on if the use will be for text data or in a quoted
attribute. 1278<base href> URL, in the normalizing regex.
2165Element.attribute(String) and Attributes.attribute(String) to more simply
obtain an Attribute object. 2069Attribute.setKey(String)), the source range is now still tracked
in Attribute.sourceRange(). 2070[*] element with any attribute selector. And also restored
support for selecting by an empty attribute name prefix ([^]). 2079parent [attr=va], other, the , OR was binding
to [attr=va] instead of parent [attr=va], causing incorrect selections. The fix includes a EvaluatorDebug class
that generates a sexpr to represent the query, allowing simpler and more thorough query parse
tests. 2073:has evaluator held a non-thread-safe Iterator, and so if an Evaluator object was
shared across multiple concurrent threads, a NoSuchElement exception may be thrown, and the selected results may be
incorrect. Now, the iterator object is a thread-local. 2088Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in change-archive.txt.