third_party/blink/renderer/platform/wtf/text/README.md
Everything you always wanted to know but were afraid to ask
This document covers the String type in Blink, often written with an
explicit namespace as blink::String to disambiguate from string
concepts or other types. It also briefly covers associated classes
used for constructing strings (StringBuilder, StringBuffer), the
internal StringImpl class, and the special AtomicString variant.
It does not cover other text-related types or utilities (e.g.
encodings, views, line endings, etc).
A blink::String represents a sequence of zero or more Unicode code
points. A String can also represent one of two zero-length strings:
the empty string and the null string. These correspond to "" and
null in JavaScript, respectively. Both the empty and the null string
return true from String::IsEmpty() but only the null string returns
true from String::IsNull().
Unlike std::string, Blink’s String object is a pointer to a
reference counted character buffer. This design makes it easier to
share the underlying character buffer between different consumers
because multiple consumers can reference the same underlying buffer.
Nevertheless, since String is an atomically ref-counted pointer, it
should be passed as function argument by const reference and not by
value where possible.
A String can represent Unicode code points with either LChars or
UChars, which use 8 bits and 16 bits per code unit respectively.
Each LChar represents a single code unit of ISO-8859-1, more
commonly called Latin-1 (hence the L in LChar). Unlike UTF-8, this
encoding cannot represent every Unicode code point. However, also
unlike UTF-8, every representable Unicode code point can represented
with a single code unit and the code unit is simply the 8 least
significant bits of the code point. This property makes Latin-1 an
attractive encoding because we can decode a Latin-1 code unit to a
Unicode code point simply by zero-extending the LChar value.
Each UChar represents a single UTF-16 code unit (hence the U in
UChar). Unfortunately, Strings do not always contain valid UTF-16
sequences. Strings that have round-tripped through JavaScript can
contain invalid UTF-16 sequences because JavaScript isn’t required to
pair surrogates in its strings. Most code that works with Strings can
ignore this issue because they operate code-unit-by-code-unit, but
subsystems need to operate on code points outside the Basic
Multilingual Plane need to be prepared to handle unpaired surrogates.
In addition to LChar and UChar, Strings also use the type
UChar32, which is a UTF-32 code unit. UChar32 is particularly easy
to work with because every UTF-32 code unit has the same numerical
value as its corresponding Unicode code point. Practically speaking,
that means you can treat UChar32 values as if they were Unicode code
points.
The String object itself is simply a pointer to a StringImpl
object, which contains the actual character buffer. String uses a
scoped_refptr<> to automatically AddRef() and Release() the
StringImpl object on construction and destruction. The StringImpl
pointer can be zero, in which case the String represents the null
string. Typically, String objects are allocated on the stack or as
members variables. StringImpl objects are always allocated in the
heap.
StringImpl objects are (logically) immutable, but a given String
object can refer to different StringImpl objects over time. For
example, String::Replace() works by creating a new StringImpl
object with the replaced characters rather than by mutating the
original StringImpl object.
Rather than using a fixed sizeof(StringImpl) allocation size,
StringImpl object are allocated a variable amount of memory and
store their character data in the same memory allocation as the
StringImpl object itself. In a sense, the StringImpl object is a
header for the actual array of characters, be they LChars or
UChars.
The StringImpl header contains three 32 bit fields. The first is a
reference count, which is incremented and decremented by the
AddRef() and Release() functions. When the reference count reaches
zero the StringImpl object is deallocated, unless the StringImpl
is marked static (in which case the StringImpl object is never
deallocated).
The next 32 bits represent the (potentially zero) length of the
string. The length field always represents the number of code units,
regardless of whether the StringImpl uses LChars or UChars. We
use a 32 bit length on both 32-bit and 64-bit systems. To avoid
creating a string whose length is too long to represent in 32 bits, we
RELEASE_ASSERT that the length doesn’t overflow, which means we’ll
crash in a controlled way if you try to create a string that’s
absurdly long.
The final 32 bits are used to cache the string’s hash value and to
store a number of Boolean flags. We use 24 bits for the hash and
reserve 8 bits for flags. As of April 2019, the flags represent
whether the StringImpl:
AtomicString table (discussed below).LChars or UChars.We haven’t evaluated the performance trade-offs of caching the hash
value in the StringImpl object recently. It’s possible that the
cache hit rate is sufficiently low that we should remove the final 32
bit field and move the two flags into the length, reducing the length
field to 30 bits. Small changes to StringImpl can have large effects
on the overall system, which means we should measure the performance
impact of this sort of change carefully.
There are multiple interfaces for constructing strings, each of which is useful in different situations.
String constructorThe most straightforward interface for constructing strings is the
String constructor. You should use this interface when the source
data for the String is an array of LChars or UChars. The
String constructor allocates a new StringImpl object and copies
the source characters into the StringImpl object as efficiently as
possible.
The String constructor that takes a UChar array always creates a
UChar-based StringImpl object even if all the source characters
would actually fit in LChars. If your source data is an array of
UChars but you have reason to believe that the string will usually
be representable in Latin-1, you should consider using
StringImpl::Create8BitIfPossible(), which creates an LChar- or
UChar-based StringImpl object as appropriate at the cost of
checking whether all the source characters can be represented in
Latin-1.
We have multiple ways to create a new string by concatenating multiple
input strings. Use blink::StrCat() if the number of input strings is
fixed. Use StringBuilder otherwise.
StrCat()The StrCat() function is the most efficient way to concatenate strings.
We can use it if the number of input strings is known at compile time
because StrCat()'s argument is a std::initializer_list.
operator+The + operator on Blink string types is the simplest way to combine
smaller strings into larger strings. Using templates, operator+ builds
a tree of temporary objects that mirrors the tree of operator+
invocations. When the temporary objects are (implicitly) collapsed to
a String, we first compute the length of the final string and then
allocate a single StringImpl object of exactly the correct size.
After allocating the object, we copy all the characters into the
string. This approach means we copy the characters exactly once into
the correctly sized buffer, which is efficient.
Because of the complex template usages, code size generated for
operator+ is much larger than StrCat(), and its runtime speed is
slightly slower than StrCat().
StringBuilderIf you’re unable to use operator+ to build your string, for example
because you need to use a loop, you should use StringBuilder. Like
similar interfaces in other libraries, StringBuilder lets you build
a String by incrementally appending content. StringBuilder tries
to use 8-bit StringImpl objects whenever possible but will upconvert
its internal buffer to 16 bits if necessarily.
StringBuilder::Append() grows its buffer exponentially, which means
StringBuilder avoids the pathologically bad O(N^2) performance that
repeated appends/concatenation can cause. One way to further optimize
String construction when using StringBuilder is to call
StringBuilder::ReserveCapacity() with (an estimate of) the final
length of the String (in code units) before appending characters. If
you give StringBuilder an accurate estimate of the length of the
string, StringBuilder can pre-allocate the appropriate amount of
memory and avoid having to reallocate its buffer and copy your string.
StringBufferFinally, there are some cases where neither the String constructor
or StringBuilder work well. For example, sometimes rather than
having a source array of UChars from which to construct a String,
you might have a function that will write the UChars into a buffer
you provide. StringBuffer can help you in these cases by allocating
a character buffer and letting the function write into it.
Conceptually, a StringBuffer represents the underlying character
buffer from a String. However, unlike StringImpl objects,
StringBuffers are mutable. StringBuffers work well when you know
ahead of time exactly how large a buffer you need and whether you want
to use LChars or UChars. If you’re uncertain about the length of
the String you’re constructing, you probably should use
StringBuilder.
Some StringImpl objects are marked as Atomic, which means they’re
stored in a process-wide HashSet called the AtomicString table.
Rather than interacting directly with these anointed StringImpl
objects, we usually hold pointers to them via AtomicString objects
(rather than String objects). Using AtomicString rather than
String to point to an Atomic StringImpl object lets the compiler
generate (and skip!) the appropriate hash lookups in the
AtomicString table as well as use faster comparison operations with
other AtomicStrings.
Typically, constructing an AtomicString from a String object will
involve a hash lookup in the AtomicString table. If the string
represented by the String object is not present in the AtomicString
table, the StringImpl object from that String will be marked Atomic
and added to the table. If the represented string is already present in
the AtomicString table, the already-Atomic StringImpl object from
the table will be used to construct the AtomicString rather than the
StringImpl from the original String.
If you wish to construct an AtomicString from an array of LChars
or UChars, you should use the AtomicString constructor directly
rather than first constructing a String object. If the string is
already present in the AtomicString table, the AtomicString
constructor will grab a reference to the existing Atomic StringImpl
object rather than first allocating and populating a StringImpl
object as the String constructor would.
If two StringImpl objects are atomic, you can compare them for
equality by comparing their addresses rather than by comparing them
character-by-character. The reason this works is that we maintain the
invariant that no two Atomic StringImpl objects represent the same
string. Therefore, the two StringImpl objects represent the same
string if, and only if, they are actually the same StringImpl object.
We’ve overloaded operator== on AtomicString to let the compiler
generate these optimized comparisons automatically.
Because there are no duplicate Atomic StringImpl objects,
AtomicStrings are useful for coalescing duplicate strings
into a single StringImpl object, saving memory.
At startup, we create a number of Static StringImpl objects that
maintain the invariant that the least significant bit of their
reference count is always one, which means their reference count
never reaches zero and they are never deallocated. In addition to
preventing their deallocation, we also pre-populate the hash value
to ensure that Static StringImpl objects are otherwise immutable.
We first introduced these Static strings for the threaded HTML parser,
but we are gradually using them more widely in the codebase. There are
still some delicate interactions between Static strings and the
AtomicString table, but hopefully we’ll smooth over these rough
edges over time.
This document contains a brief introduction to Blink’s String class.
There are many details that are not included, but hopefully this
document has given you a good high-level understanding of Strings.
More details are available in the source, either in code or in
comments. Happy hacking!
Originally authored by Adam Barth (abarth), 5 August 2013.