metadata-ingestion/docs/sources/presto/presto_post.md
Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.
Presto can connect to many different catalogs (Hive, PostgreSQL, MySQL, etc.). Use filtering to control what gets ingested:
source:
type: presto
config:
host_port: presto.company.com:8080
username: datahub_user
# Only ingest specific catalogs
database_pattern:
allow:
- "^hive$"
- "^postgresql$"
deny:
- "system"
- "information_schema"
source:
type: presto
config:
host_port: presto.company.com:8080
username: datahub_user
database: hive # Default catalog
# Filter schemas within catalogs
schema_pattern:
allow:
- "^production_.*"
- "analytics"
deny:
- ".*_test$"
source:
type: presto
config:
host_port: presto.company.com:8080
username: datahub_user
# Filter specific tables
table_pattern:
allow:
- "^fact_.*"
- "^dim_.*"
deny:
- ".*_tmp$"
- ".*_staging$"
When ingesting from multiple Presto clusters, use platform_instance:
source:
type: presto
config:
host_port: prod-presto.company.com:8080
platform_instance: "prod-presto"
This creates URNs like:
urn:li:dataset:(urn:li:dataPlatform:presto,catalog.schema.table,prod-presto)
The Presto connector supports optional data profiling:
source:
type: presto
config:
host_port: presto.company.com:8080
username: datahub_user
# Enable profiling
profiling:
enabled: true
profile_table_level_only: false # Include column-level stats
# Limit profiling scope
profile_pattern:
allow:
- "^production_.*"
Warning: Profiling can be expensive on large tables. Start with profile_table_level_only: true and expand as needed.
For Presto clusters with many catalogs and tables:
Catalog Filtering: Limit ingestion to specific catalogs:
database_pattern:
allow:
- "hive"
- "postgresql"
Disable Profiling: Or limit it to specific tables:
profiling:
enabled: true
profile_table_level_only: true
Stateful Ingestion: Only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: true
information_schema tablesIf you're currently using the deprecated presto-on-hive source:
Old Configuration:
source:
type: presto-on-hive # ← Deprecated
config:
host_port: metastore-db:3306
# ...
New Configuration (Recommended):
source:
type: hive-metastore # ← Use this instead
config:
host_port: metastore-db:3306
mode: presto # ← Set mode to 'presto'
emit_storage_lineage: true # ← Now available!
# ...
Benefits of Migration:
| Feature | presto Connector | hive-metastore (mode: presto) |
|---|---|---|
| Connection | Direct to Presto | Direct to metastore database |
| Catalogs | All Presto catalogs | Only Hive-backed catalogs |
| Storage Lineage | Not supported | Supported |
| Column Lineage | Limited | Full support |
| View Parsing | Basic | Enhanced Presto view parsing |
| Performance | Good | Better (direct DB access) |
| Data Profiling | Supported | Not supported |
| Use Case | Multi-catalog Presto | Presto-on-Hive with lineage |
Choose the Right Connector:
presto for multi-catalog Presto deploymentshive-metastore (mode: presto) for Hive-backed tables with storage lineageFilter Appropriately:
system, information_schemaEnable Stateful Ingestion:
Test First:
Monitor Presto Load:
Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.
Not Supported: The Presto connector cannot extract storage lineage because it doesn't have access to underlying storage locations.
Solution: Use the Hive Metastore connector with mode: presto to get storage lineage for Presto views backed by Hive.
Presto's catalog connectors (Hive, PostgreSQL, etc.) may have different metadata available. The connector extracts common metadata that works across all connectors.
Information Schema Latency: Presto's information_schema may have delays in reflecting recent DDL changes.
Large Result Sets: Catalogs with 10,000+ tables may be slow to ingest.
View Lineage Parsing: Complex Presto SQL with window functions, CTEs, or Presto-specific syntax may have incomplete lineage.
Connector-Specific Metadata: Some Presto connectors (e.g., Cassandra) have limited metadata available through information_schema.
Problem: Could not connect to Presto
Solutions:
host_port is correct and points to the Presto coordinatorcurl http://<host>:<port>/v1/infoProblem: Authentication failed
Solutions:
klist)/var/log/presto/Problem: Not all catalogs/tables appear in DataHub
Solutions:
SHOW CATALOGS; in Prestodatabase_patternProblem: Metadata extraction takes too long
Solutions:
Problem: No lineage for Presto views
Solutions:
mode: prestoIf ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.