docs/advanced/graphql-shape-logging-runbook.md
GraphQL shape logging provides visibility into query and response complexity through structured logging and Micrometer metrics. This runbook covers interpretation, troubleshooting, and threshold tuning.
graphql.shape.requests.total — Counter of GraphQL requests by shape
top_level_fields (bounded), operation_type (query/mutation/subscription)top_level_fields combinations in productiongraphql.shape.field_count — Summary of total field selections per request
graphql.shape.max_depth — Summary of maximum nesting depth per request
Structured logs are emitted to com.datahub.graphql.shape logger (inherits ASYNC_GRAPHQL_DEBUG_FILE appender) when thresholds are crossed:
{
"operation": "searchDatasets",
"queryShape": "{search(input:...) {results {entity {... on Dataset {name ...}}}}}",
"queryShapeHash": "a1b2c3d4",
"fieldCount": 125,
"maxDepth": 8,
"durationMs": 3500,
"responseFieldCount": 850,
"maxArraySize": 100,
"responseShape": "{results[100] {entity {name_ urn_ ...}}}",
"errorCount": 0,
"thresholdsCrossed": ["field_count", "duration"],
"timestamp": "2026-04-01T12:34:56.789Z"
}
Key fields:
Configure via environment variables or application.yaml:
graphQL:
shapeLogging:
enabled: true
fieldCountThreshold: 100 # Query field selections
durationThresholdMs: 3000 # Request latency
responseSizeThresholdBytes: 1048576 # ~20K fields × 50 bytes/field
errorCountThreshold: 1 # Any errors in response
| Threshold | Default | Dev Tuning | Prod Tuning | Rationale |
|---|---|---|---|---|
| fieldCountThreshold | 100 | 50 | 200+ | Detect complex queries; dev stricter for dev/test |
| durationThresholdMs | 3000 | 1000 | 5000+ | Alert on slow queries; prod allows slower queries |
| responseSizeThresholdBytes | 1048576 | 524288 | 2097152+ | 1MB ≈ 20K fields; dev stricter |
| errorCountThreshold | 1 | 1 | 1 | Always alert on errors; field errors affect query plans |
For Development/Staging:
For Production:
Adaptive tuning:
thresholdsCrossed distribution for 1 weekWhat it means: Query selects many fields across entities.
Examples:
search selecting 100+ fields across result unions (Dataset, Dashboard, Chart, etc.)Action:
What it means: Deeply nested selections; may indicate federation or complex relationships.
Examples:
/dataset/{id}/owner/manager/team/organization (5+ levels)Action:
What it means: Query execution is slow; may indicate resolver complexity or data volume.
Examples:
Action:
responseFieldCount and maxArraySizeWhat it means: Response contains many values; memory/serialization overhead.
Examples:
Action:
maxArraySize; if >100 elements, consider paginationWhat it means: Query completed but with errors; field resolution failures.
Examples:
Action:
Likely cause: Thresholds set too high or logging disabled.
Debug:
graphQL.shapeLogging.enabled = true in configecho $GRAPHQL_SHAPE_DURATION_THRESHOLD_MSGRAPHQL_SHAPE_DURATION_THRESHOLD_MS=1ASYNC_GRAPHQL_DEBUG_FILE in logback.xmlSymptom: Multiple distinct queries have the same queryShapeHash.
Cause: CRC32 collision (expected at scale; ~11% collision rate at 1M shapes).
Action:
QueryShapeAnalyzer.crc32Hex() → SHA-256shapeHashVersion: "sha256")Cause: BYTES_PER_FIELD_ESTIMATE = 50 is order-of-magnitude; actual varies.
Real ranges:
If estimates are consistently off by 2x:
BYTES_PER_FIELD_ESTIMATE in GraphQLShapeConstantsCreate alerts based on these conditions:
Alert 1: Field count spike
graphql.shape.field_count{quantile="0.99"} > 300 for 5m
→ P99 queries suddenly much larger; may indicate query regression
Alert 2: Response size spike
increase(graphql.shape.requests.total{thresholdsCrossed=~"response_size"}[5m]) > 10
→ Multiple requests crossing size threshold; may indicate data explosion
Alert 3: Error rate in shapes
increase(graphql.shape.requests.total{thresholdsCrossed=~"error_count"}[5m]) / increase(graphql.shape.requests.total[5m]) > 0.05
→ >5% of logged requests have errors; investigate resolver failures
Alert 4: Anomalous depth
graphql.shape.max_depth{quantile="0.95"} > 20 for 10m
→ P95 nesting depth unusually high; may indicate federation issues
Shape logging has minimal overhead:
Threshold selection affects volume:
metadata-service/configuration/src/main/resources/application.yaml (graphQL.shapeLogging)metadata-io/src/main/java/com/linkedin/metadata/system_telemetry/
QueryShapeAnalyzer.java — Query shape extraction and hashingResponseShapeAnalyzer.java — Response shape analysis and samplingGraphQLTimingInstrumentation.java — Metric emission and threshold evaluationQ: Why CRC32 instead of SHA-256? A: Performance (CRC32 is ~10x faster). Collisions acceptable for metrics aggregation. Can migrate to SHA-256 at scale.
Q: Can I disable shape logging for specific queries? A: Not currently. Filter in Grafana or post-process logs if needed.
Q: What's the difference between field count and response field count?
A: fieldCount = selections in query (e.g., {a b c} = 3). responseFieldCount = leaf values in response (e.g., array of 100 with 3 fields each = 300).