Gemini Perf Response

Excellent and incredibly detailed performance analysis request. Your breakdown of the pipeline, timing data, and specific questions are exactly what's needed to provide a high-quality analysis.

Based on the information and source code provided, here is a comprehensive performance optimization analysis.

Executive Summary: Key Bottlenecks

Your analysis is spot on. The three major performance bottlenecks, accounting for over 70% of the execution time, are:

Font Loading (24.3%): A massive 1-second upfront cost to scan and parse all system fonts, most ofwhich are never used. Your STARTUP_LATENCY.md document contains an excellent and well-thought-out plan to solve this, which is the correct approach.
Layout Loop (30.4%): This is disproportionately slow for a flat DOM. The root cause is performing expensive text shaping operations for every single syntax-highlighted <span> individually, instead of shaping larger, coalesced blocks of text.
DOM Creation (str_to_dom, 17.7%): The slowness comes from parsing thousands of individual inline style attributes. Generating CSS classes instead of inline styles will almost eliminate this bottleneck.

The overall strategy is to move from a generic, per-element processing model to a specialized, batch-oriented one that takes advantage of the highly regular structure of syntax-highlighted code.

1. Ranked List of Optimization Opportunities

Here is a ranked list of opportunities, from highest to lowest impact.

#	Opportunity	Phase Affected	Est. Impact	Effort	Description
1	Parallel File Processing	`main.rs` loop	High (2-4x speedup)	Low	Use Rayon to process each source file in parallel. This is the single biggest and easiest win.
2	Optimize HTML Generation	`str_to_dom`	High (90%+ reduction)	Medium	Switch from inline `style` attributes to CSS classes in the generated HTML. Reduces thousands of CSS parses to one.
3	Coalesce Text Runs in Layout	`layout loop`, `display list gen`	High (80%+ reduction)	Medium	Group adjacent `<span>`s with identical styles into a single "text run" to be shaped once, instead of shaping each span individually.
4	Asynchronous Font Loading	`Font pool build`	High (90%+ reduction)	High	Implement the excellent plan from your `STARTUP_LATENCY.md` to move font scanning to background threads and load fonts on-demand.
5	Skip Initial Reconciliation	`reconcile_and_invalidate`	Medium	On the first layout of a document, all nodes are new. Skip the diffing process entirely and just mark all nodes as dirty.
6	Short-circuit Font Resolution	`font resolution`	Low	For documents using a single font (like code), resolve the font chain once and apply it to all text nodes, skipping per-node lookups.
7	Fast Path for Monospace Shaping	`layout loop` (text shaping)	Low	For simple monospace fonts, use a faster character-to-glyph lookup instead of the full OpenType shaping engine.

2. Code-Level Suggestions & Analysis (Answering Your Questions)

Here are detailed answers to your questions with specific suggestions.

1. Why is layout so slow? (404ms for 12k nodes)

The algorithmic bottleneck is text shaping granularity.

Your HTML consists of thousands of small <span>s, each containing a word or a piece of punctuation. The azul-layout engine is treating each <span> as a separate inline element and running the expensive text shaping process (via text3 and allsorts) for each one individually.

This means for main.rs (12,481 nodes), you are performing roughly 12,000 separate text shaping operations. A browser engine would perform a text coalescence step: it would see adjacent text nodes with compatible styles, merge their text content into a single string, and shape that string once.

Code-Level Suggestion: Implement Text Run Coalescence

In your block formatting context (BFC) layout logic (azul-layout/src/solver3/fc.rs), before you iterate through the children of a block (like <pre>), add a pre-processing step:

Iterate through the children (LayoutNodes).
Create a new "text run" when you encounter a text node or an inline element containing only text.
If the next sibling has compatible styles (same font, size, color, etc.), append its text to the current run instead of creating a new one.
If you encounter a block-level element or an inline element with different styles, finalize the current text run.
Pass these larger, coalesced text runs to the text3 engine for shaping.

This changes the complexity from O(num_spans * cost_of_shaping) to O(num_lines * cost_of_shaping), which will be a dramatic improvement. This single change will also massively speed up Display List Generation.

2. Why is `str_to_dom` so slow? (250ms for 252KB)

The bottleneck is per-node CSS parsing.

As you suspected, the problem is not the XML parsing but the CSS parsing. Your html_generator.rs puts an inline style="..." attribute on every single <span>.

rust

// in html_generator.rs
html.push_str(&format!(
    r#"<span style="{}">{}</span>"#,
    css,
    html_escape(text)
));

The str_to_dom function in azul-core/src/xml.rs has to parse each of these thousands of style attributes individually. Each parse is a tiny, self-contained job that involves tokenizing and parsing a string like "color: #d73a49; font-weight: bold;". This is extremely inefficient.

Code-Level Suggestion: Use CSS Classes

Refactor html_generator.rs to use CSS classes.

Generate a <style> block: At the start of the HTML, create a single <style> block that defines classes for each syntax highlighting token type. syntect provides the style information; you just need to translate it to CSS classes.
rust
```
// In generate_html_header or similar
let theme_css = generate_css_classes_from_theme(theme);
let header = format!("<style>{}</style>", theme_css);
```

Assign classes to <span>s: Instead of inline styles, assign a class. You'll need a way to map a syntect::highlighting::Style to a unique class name. A HashMap<Style, String> can cache this mapping.

rust

// In generate_html_for_file
let mut style_to_class = HashMap::new();
let mut class_counter = 0;

// ... in the loop ...
let class_name = style_to_class.entry(style).or_insert_with(|| {
    class_counter += 1;
    format!("c{}", class_counter)
});

html.push_str(&format!(
    r#"<span class="{}">{}</span>"#,
    class_name,
    html_escape(text)
));

This change will reduce thousands of small parsing tasks into one larger, more efficient task for the CSS engine in azul-core, drastically speeding up str_to_dom.

3. Font Resolution Overhead

Your observation is correct. Iterating every node for a document where all text uses the same font is wasteful.

Code-Level Suggestion: Short-Circuit for Monospace

In printpdf/src/html/mod.rs inside xml_to_pdf_pages, before calling layout_document_paged_with_config, you can add a fast path.

rust

// In printpdf/src/html/mod.rs
// ... after styled_dom is created ...

let is_monospace_document = check_if_monospace(&styled_dom);

if is_monospace_document {
    // 1. Resolve the "monospace" font chain once.
    let monospace_chain = font_manager.fc_cache.resolve_font_chain(
        &["monospace".to_string()], /* ... */
    );
    // 2. Pre-populate the font_manager's chain cache with a single entry
    //    that all text nodes will use.
    font_manager.set_font_chain_cache(/* ... */);
    // 3. The layout engine will need a way to know to use this pre-resolved chain.
    //    This might require passing a flag or the pre-resolved chain itself
    //    down into the layout functions.
}

// Then, inside layout_document_paged_with_config, the call to
// collect_and_resolve_font_chains would be skipped if the flag is set.

This avoids iterating 12k nodes just to discover they all need the same font.

4. Reconcile Cost (41ms)

Yes, you can and should skip reconciliation on the initial layout.

Architectural Suggestion: Initial Layout Fast Path

The reconcile_and_invalidate function compares an old layout tree with a new one. On the very first frame, the "old tree" is non-existent. The function is spending 41ms discovering that all 12,481 nodes are new.

Create a separate code path for the initial layout:

In azul-layout/src/solver3/cache.rs, reconcile_and_invalidate should check if cache.tree.is_none().
If it is None, this is an initial layout. Instead of diffing, create the new LayoutTree and simply return a ReconciliationResult where intrinsic_dirty contains all node indices 0..num_nodes.

This completely bypasses the diffing logic and saves the 41ms.

5. Display List Generation (51ms + 52ms)

The primary cause of slowness here is the sheer number of items (19,441). Each <span> becomes one or more DisplayListItems.

The fix for this is the Text Run Coalescence described in point #1.

By coalescing text, you will reduce ~12,000 text-related items to perhaps a few hundred (one per line). This will make the generate_display_list function trivial and likely reduce its time by over 90%. No significant cloning or allocation issues are apparent from the code; the problem is the N, not the cost per item.

6. Architectural Improvements

Batch-process spans: Yes. This is the Text Run Coalescence from #1. It's the most critical optimization for the layout loop.
Fast path for text-only: Yes. For git2pdf, the content is highly predictable. You could create a specialized layout function layout_code_document that assumes a single block formatting context, no floats, no complex positioning, and performs text coalescence by default. This avoids the overhead of the generic CSS layout engine.
Incremental layout per-page: This is a memory optimization, not primarily a speed optimization for this scale. For your target of <1s, laying out the whole document at once is fine. For extremely large repositories (e.g., generating a 5000-page PDF of the Linux kernel), this would be essential to manage memory, but it adds significant complexity (handling floats or tables that cross page boundaries). Defer this until you have a proven need for it.
Different rendering approach: Yes. The fastest possible approach would be to bypass HTML entirely.
- Alternative Pipeline: Rust source -> syntect tokens -> PDF text operations.
- You could write a "direct-to-PDF" renderer. It would take the tokenized output from syntect, resolve the one font it needs (monospace), and directly generate printpdf Ops for placing text. This would be exceptionally fast as it skips XML parsing, DOM creation, CSS parsing, layout, and display list generation.
- Trade-off: You lose the flexibility of CSS. Want to add a fancy title page with flexbox? You can't. But for pure code output, this is the performance ceiling.

7. Cache Invalidation for Window Resize

Your current approach of redoing everything is simple but inefficient for an interactive GUI. The correct strategy involves invalidating only what's necessary.

Detect Change: On resize, the viewport size changes.
Identify Affected Nodes: Traverse the LayoutTree. Any node with a style that depends on the viewport size (width: 50%, height: 100vh, etc.) is dirty.
Propagate Dirtiness: Mark these nodes with DirtyFlag::Layout. This dirtiness should propagate up to their parents (as done in LayoutCache::mark_dirty).
Find Layout Roots: The highest-level dirty nodes become the new "layout roots".
Relayout: Run calculate_layout_for_subtree only for these layout roots. Clean subtrees will be skipped or simply repositioned.

The azul-layout engine seems to have the foundational pieces for this (DirtyFlag, a cache), but the logic to connect a viewport resize to a minimal set of dirty nodes needs to be implemented.

8. Memory Layout (`CssPropertyCache`)

Your observation is astute. The current CssPropertyCache structure is essentially a struct-of-arrays where each array is a Vec<BTreeMap>. This is memory-inefficient when many nodes share the same styles.

The fix for this is the move to CSS classes proposed in point #2.

By switching to classes, the architecture would change:

html_generator produces <span class="c1">...</span>.
str_to_dom attaches the class name to the NodeData.
The CSS engine parses the <style> block into a map: HashMap<ClassName, BTreeMap<CssPropertyType, CssProperty>>.
The CssPropertyCache would then store a Vec<ClassId> for each node, pointing to the shared property map.

This is the flyweight pattern. It drastically reduces memory by storing each unique set of styles only once. It also speeds up style resolution, as you only resolve the class once.

9. Text Shaping for Monospace

You are correct that for simple monospace fonts, the full shaping process is overkill.

Solution: Fast Path for Simple Monospace Fonts

In text3, when resolving a font, you can inspect its tables.

Check the post table for isFixedPitch.
Check if complex GSUB/GPOS tables are absent or minimal.
If it's a "simple" monospace font, flag it.
When shaping text with a simple monospace font, bypass allsorts. Instead, use a much faster loop:
- For each character, look up its glyph ID from the cmap.
- Look up its advance width from the hmtx table (it will be constant for most characters).
- Create the Glyph struct directly.

This avoids the overhead of the OpenType shaping state machine for fonts that don't need it.

10. Parallelism

This is a huge, low-hanging fruit.

Code-Level Suggestion: Use Rayon

Your file processing loop in git2pdf/src/main.rs is serial. Change it to be parallel.

rust

// In git2pdf/src/main.rs

// ... setup before the loop ...
let font_pool = printpdf::html::build_font_pool(/* ... */); // Crucially, this is shared

// Change this:
// for crate_info in crates_to_process { ... }

// To this:
use rayon::prelude::*;

crates_to_process
    .par_iter() // Use parallel iterator
    .for_each(|crate_info| {
        // The entire processing logic for one crate goes here.
        // It needs to be wrapped in a closure.
        // Make sure all shared data (syntax_set, theme_set, font_pool)
        // is wrapped in Arc so it can be shared safely. Your code already does this!

        // Note: The body of the loop might need to be extracted into a function
        // that returns a Result, and then you'd handle the results after the parallel loop.

        let result = (|| -> anyhow::Result<()> {
            // ... existing loop body ...
            Ok(())
        })();

        if let Err(e) = result {
            eprintln!("Error processing crate {}: {}", crate_info.name, e);
        }
    });

The SharedFontPool uses an Arc<Mutex<...>>, which will correctly serialize access to the font cache when multiple threads need to load fonts simultaneously. While there will be some lock contention during the font loading phase, the CPU-intensive layout and display list generation phases will run in parallel with minimal contention, providing a significant speedup on multi-core systems.

3. Architectural Refactoring Proposals

Short-Term (High ROI): Class-Based Styling + Text Coalescence
- Summary: Stop generating inline styles. Generate a single <style> block with classes. In the layout engine, merge adjacent elements with the same class into a single text run before shaping.
- Pros: Massive performance gain in str_to_dom, layout, and display list generation. Reduces memory. Architecturally cleaner.
- Cons: Requires coordinated changes in git2pdf (HTML generation) and azul-layout (layout logic).
Medium-Term: Specialized Code Layout Engine
- Summary: Create a "fast path" layout mode specifically for code. This mode would bypass generic flex/block layout, assume a single vertical flow, and perform text coalescence and simple line-wrapping.
- Pros: Faster than the generic engine for the git2pdf use case.
- Cons: Less flexible. Still requires the HTML -> DOM -> Style pipeline.
Long-Term (Maximum Performance): Direct-to-PDF Renderer
- Summary: Bypass HTML entirely. Create a new pipeline: syntect tokens -> PDF Ops. This would be a specialized renderer that understands syntect's output and writes printpdf operations directly.
- Pros: The absolute fastest approach. Skips multiple intermediate stages.
- Cons: High implementation effort. Loses all CSS flexibility. You are essentially writing a mini-typesetter.

4. Roadmap: 4.1s → <1s

Here is a concrete roadmap to achieve your performance target.

Phase 1: The Quick Wins (~4.1s → ~1.5s)

Parallelize File Processing: Convert the main loop in git2pdf to use rayon::par_iter(). This is the easiest and most impactful change for multi-file repositories.
- Estimated Time Reduction: 50-70% of total time on a multi-core machine.
Implement Font Loading Fix: Execute the plan in STARTUP_LATENCY.md. This removes the ~1s startup penalty, moving it to the background.
- Estimated Time Reduction: ~1s from the total time.

After this phase, the one-time font cost is gone, and file processing is parallel. The per-file time is still high, but the total wall-clock time will be much lower.

Phase 2: Fixing the Core Pipeline (~1.5s → ~0.5s)

Switch to CSS Classes: Modify html_generator.rs to generate a <style> block and use classes on <span>s.
- Target: str_to_dom time should drop from ~700ms total to <50ms.
Implement Text Run Coalescence: Modify the BFC layout logic in azul-layout to merge adjacent text nodes/spans with compatible styles before sending them to the text3 shaper.
- Target: layout loop time should drop from ~1250ms total to <200ms. This will also drastically reduce display list generation time.

After this phase, the core performance issues are resolved. The tool should now be well under your 1-second target for 5 files.

Phase 3: Fine-Tuning (<0.5s)

Implement Fast Paths: Add the short-circuits for initial reconciliation and font resolution. These are smaller gains but improve efficiency.
Monospace Shaping Fast Path: If profiling still shows significant time in allsorts, implement the simplified shaping path for simple monospace fonts.

By following this roadmap, you should be able to comfortably meet and exceed your performance goals. The combination of parallelism, class-based styling, and text coalescence will address the fundamental architectural bottlenecks in your current pipeline. Excellent work on the detailed problem description

Executive Summary: Key Bottlenecks

1. Ranked List of Optimization Opportunities

2. Code-Level Suggestions & Analysis (Answering Your Questions)

1. Why is layout so slow? (404ms for 12k nodes)

2. Why is str_to_dom so slow? (250ms for 252KB)

3. Font Resolution Overhead

4. Reconcile Cost (41ms)

5. Display List Generation (51ms + 52ms)

6. Architectural Improvements

7. Cache Invalidation for Window Resize

8. Memory Layout (CssPropertyCache)

9. Text Shaping for Monospace

10. Parallelism

3. Architectural Refactoring Proposals

4. Roadmap: 4.1s → <1s

2. Why is `str_to_dom` so slow? (250ms for 252KB)

8. Memory Layout (`CssPropertyCache`)