docs/connectors/unity_catalog.md
Unity Catalog is an open-sourced catalog developed by Databricks. Users of Unity Catalog are able to work with data assets such as tables (Parquet, CSV, Iceberg, Delta), volumes (storing raw files), functions and models.
To use Daft with the Unity Catalog, you will need to install Daft with the unity option specified like so:
pip install daft[unity]
!!! warning "Warning"
These APIs are in beta and may be subject to change as the Unity Catalog continues to be developed.
Daft includes an abstraction for the Unity Catalog. For more information, see also Unity Catalog Documentation.
You can authenticate either with a personal access token (PAT) or with OAuth credentials.
To use OAuth, create a Databricks service principal and generate an OAuth secret for it. See the Databricks docs for the exact steps: Databricks OAuth M2M docs.
=== "🐍 Python (PAT)"
```python
from daft.catalog.__unity import UnityCatalogClient as UnityCatalog
unity = UnityCatalog(
endpoint="https://<databricks_workspace_id>.cloud.databricks.com",
# Authentication can be retrieved from your provider of Unity Catalog
token="my-token",
)
# See all available catalogs
print(unity.list_catalogs())
# See available schemas in a given catalog
print(unity.list_schemas("my_catalog_name"))
# See available tables in a given schema
print(unity.list_tables("my_catalog_name.my_schema_name"))
```
=== "🐍 Python (OAuth client credentials)"
```python
from daft.catalog.__unity import UnityCatalogClient as UnityCatalog
from daft.catalog.__unity import OAuth2Credentials
unity = UnityCatalog(
endpoint="https://<databricks_workspace_id>.cloud.databricks.com",
oauth=OAuth2Credentials(
client_id="your-client-id",
client_secret="your-client-secret",
),
)
# See all available catalogs
print(unity.list_catalogs())
# See available schemas in a given catalog
print(unity.list_schemas("my_catalog_name"))
# See available tables in a given schema
print(unity.list_tables("my_catalog_name.my_schema_name"))
```
=== "🐍 Python"
```python
unity_table = unity.load_table("my_catalog_name.my_schema_name.my_table_name")
df = daft.read_deltalake(unity_table)
df.show()
```
Any subsequent filter operations on the Daft df DataFrame object will be correctly optimized to take advantage of DeltaLake features:
=== "🐍 Python"
```python
# Filter which takes advantage of partition pruning capabilities of Delta Lake
df = df.where(df["partition_key"] < 1000)
df.show()
```
See also Delta Lake for more information about how to work with the Delta Lake tables provided by the Unity Catalog.
Daft supports downloading from Unity Catalog volumes using [Expression.download()][daft.expressions.expressions.download]. File paths that start with vol+dbfs:/ or dbfs:/ will be downloaded using the configurations in [IOConfig.unity][daft.daft.IOConfig.unity]. These configurations can be created using UnityCatalog.to_io_config, or automatically derived from the global session.
=== "🐍 Python"
```python
df = daft.from_pydict({
"files": [
"/Volumes/my_catalog/my_schema_name/my_volume_name/file1.txt",
"/Volumes/my_catalog/my_schema_name/my_volume_name/file2.txt"
]
})
# explicitly specify the unity catalog
io_config = unity.to_io_config()
data_df = df.select("vol+dbfs:" + df["files"].download(io_config=io_config))
data_df.show()
# use the global session
from daft.catalog import Catalog
import daft.session
catalog = Catalog.from_unity(unity)
daft.session.attach(catalog)
data_df = df.select(df["files"].download())
data_df.show()
```
Please make issues on the Daft repository if you have any use-cases that Daft does not currently cover! For the overall Daft development plan, see Daft Roadmap.