third_party/blink/renderer/modules/content_extraction/readme.md
Annotated Page Content (APC) is a structured and actionable representation of a webpage's content and layout. Its primary function is to enable a deep understanding of page structure, content, and interactive elements by downstream clients, who can receive the information as a protobuf tree.
APC is designed with the following principles in mind:
The foundation of APC is the AnnotatedPageContent protobuf message, which
organizes page content into a hierarchical tree.
ContentNodesThe representation is a tree of ContentNodes. These nodes can represent
layout containers on the page, grouping related information in a structure
derived from the layout tree. This includes:
<article>, <nav>, <section>)ContentAttributes)Each ContentNode contains attributes that describe the element in detail:
TextInfo): The text content, along with styling information like
size, emphasis, and color.ImageInfo): The image's alt text or caption, its URL, and
security origin.AnchorData): The destination URL and the link's rel attribute.FormInfo, FormControlData): Includes the form's name/ID and
data for individual controls like field name, value, and type. Password field
values are omitted unless the user has made them visible on the page.InteractionInfo): Describes the node's interactivity
(e.g., clickable, editable, focusable).The following elements are under consideration for future inclusion but are not currently part of the APC structure:
<audio>, <video>)<canvas>) and SVG (<svg>)APC is generated by traversing Blink's layout tree, not the DOM tree. This is a critical distinction because the layout tree only includes content that is actually rendered on the page.
The generation algorithm recursively traverses the layout tree, creating a
ContentNode for each rendered object with structured content or a significant
semantic role. It extracts relevant data and organizes the nodes into a
hierarchy that preserves the visual order of the page.
On the browser side, the raw APC proto can be converted into various consumable formats, including:
{#ID}) that link back to the original ContentNode.A key goal of APC is to enable reliable interactions with webpages, even when they change dynamically.
To handle dynamic page changes, an algorithm robustly identifies the target element by matching key properties like its type, interactivity, and location. If needed, it can further verify the element by comparing its text content to ensure the correct action is taken.
APC now supports a selective node-id policy in AIPageContentOptions:
node_id_allowlist: Optional list of AIPageContentAttributeType values
that should emit dom_node_id.The main goal is to avoid assigning DOM node ids to more nodes than needed. Over-emitting ids grows Blink's DOM node id hash map, which hurts overall renderer performance after extraction finishes.
Semantics:
node_id_allowlist is unset, APC preserves legacy behavior and emits ids
broadly.node_id_allowlist is set (empty or non-empty), APC always emits ids for
required override cases (for example actionable targets and metadata-linked
nodes such as focus/selection/label-for references).node_id_allowlist is non-empty, APC also emits ids for the listed
attribute types.Using APC requires careful attention to privacy and security. While APC provides data to help mitigate risks, feature owners bear ultimate responsibility.
-webkit-text-security) are removed from the APC representation to help
prevent sensitive credential leakage.isAccessibleForFree=false](https://developers.google.com/search/docs/appeara
nce/structured-data/paywalled-content)) to flag paid content, and APC includes
this signal.To run the unit tests for content extraction, use the following command:
autoninja -C out/Default blink_unittests && out/Default/blink_unittests --gtest_filter=AIPageContentAgentTest.*
The web tests for content extraction are located in
third_party/blink/web_tests/content_extraction/.
To run the web tests:
third_party/blink/tools/run_web_tests.py -C out/Default content_extraction
To update the web test expectations:
third_party/blink/tools/run_web_tests.py -C out/Default content_extraction --reset-results
AIPageContentOuterBoxMapToAncestorSpace reuses the GeometryMapper mapping for
both the outer_bounding_box and visible_bounding_box. It is enabled by
default because it is a stable Blink runtime feature.
AIPageContentBuildOnLoadForTesting forces every local root frame to build APC
(in actionable mode) immediately after load. This mirrors the behaviour used by
the AnnotatedPageContentExtraction Finch trial.
For example, to launch content_shell with APC-on-load enabled:
out/Default/content_shell \
--enable-blink-features=AIPageContentBuildOnLoadForTesting
A dedicated virtual test suite (content-extraction) exercises existing
layout tests with APC-on-load enabled while reusing the default geometry path:
third_party/blink/tools/run_web_tests.py \
-C out/Default \
--virtual-test-suite content-extraction