docs/connectors/gravitino.md
Apache Gravitino is an open-source data catalog that provides unified metadata management for various data sources and storage systems. Users of Gravitino can work with data assets such as tables (Iceberg, Hive, etc.) and filesets (storing raw files, on s3, gcs, azure blob, etc).
To use Daft with Gravitino, you will need to install Daft with the gravitino option specified like so:
pip install daft[gravitino]
!!! warning "Warning"
These APIs are in beta and may be subject to change as the Gravitino connector continues to be developed.
catalog.get_table("...").read()gvfs:// URLs for seamless fileset accessCatalog.from_gravitino()=== "🐍 Python"
```python
import daft
from daft.catalog import Catalog
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
username="admin",
)
# List all available tables
tables = catalog.list_tables("my_catalog.my_schema")
# Read a table (format detected automatically)
df = catalog.get_table("my_catalog.my_schema.my_table").read()
df.show()
```
Catalog.from_gravitino supports two authentication methods:
=== "🐍 Python"
```python
from daft.catalog import Catalog
# Simple auth with username only
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
auth_type="simple",
username="admin",
)
# OAuth2 auth
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
auth_type="oauth2",
token="my-bearer-token",
)
```
Gravitino manages storage credentials through table and fileset properties. The client automatically extracts and configures:
Daft supports reading and writing files directly from Gravitino filesets using the gvfs:// protocol. This provides a unified interface for accessing files stored in various cloud storage systems through Gravitino's metadata management.
GVFS URLs follow this format:
gvfs://fileset/<catalog>/<schema>/<fileset>/<path>
Where:
<catalog> - Name of the Gravitino catalog<schema> - Name of the schema within the catalog<fileset> - Name of the fileset<path> - Optional path to specific files within the fileset=== "🐍 Python"
```python
import daft
from daft.io import IOConfig, GravitinoConfig
# Build an IOConfig for GVFS fileset access
io_config = IOConfig(
gravitino=GravitinoConfig(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
username="admin",
)
)
# Read parquet files from a fileset
df = daft.read_parquet(
"gvfs://fileset/my_catalog/my_schema/my_fileset/**/*.parquet",
io_config=io_config
)
# Read specific file
df = daft.read_parquet(
"gvfs://fileset/my_catalog/my_schema/my_fileset/data.parquet",
io_config=io_config
)
# Read CSV files
df = daft.read_csv(
"gvfs://fileset/my_catalog/my_schema/my_fileset/*.csv",
io_config=io_config
)
# Use glob patterns for file discovery
files_df = daft.from_glob_path(
"gvfs://fileset/my_catalog/my_schema/my_fileset/**/*.json",
io_config=io_config
)
```
=== "🐍 Python"
```python
import daft
from daft.io import IOConfig, GravitinoConfig
io_config = IOConfig(
gravitino=GravitinoConfig(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
username="admin",
)
)
# Create sample data
df = daft.from_pydict({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [25, 30, 35]
})
# Write parquet files to a fileset
df.write_parquet(
"gvfs://fileset/my_catalog/my_schema/my_fileset/output.parquet",
io_config=io_config
)
# Write CSV files
df.write_csv(
"gvfs://fileset/my_catalog/my_schema/my_fileset/output.csv",
io_config=io_config
)
# Write JSON files
df.write_json(
"gvfs://fileset/my_catalog/my_schema/my_fileset/output.json",
io_config=io_config
)
```
Creates a Daft Catalog from a Gravitino metalake.
=== "🐍 Python"
```python
from daft.catalog import Catalog
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
username="admin",
)
catalog.list_tables("my_catalog.my_schema")
```
Creates a Daft Table from a GravitinoTable.
=== "🐍 Python"
```python
from daft.catalog import Catalog, Table
catalog = Catalog.from_gravitino(
endpoint="http://localhost:8090",
metalake_name="my_metalake",
username="admin",
)
table = catalog.get_table("my_catalog.my_schema.my_table")
df = table.read()
```
This integration supports both legacy and current Gravitino API formats:
properties.locationstorageLocations with configurable defaultThe client automatically detects and handles both formats for seamless compatibility.
Please open issues on the Daft repository or Gravitino repository if you have any use-cases that Daft Gravitino connector does not currently cover!