docs/integrations/data-integrations/databricks.mdx
This documentation describes the integration of MindsDB with Databricks, the world's first data intelligence platform powered by generative AI. The integration allows MindsDB to access data stored in a Databricks workspace and enhance it with AI capabilities.
<Tip> This data source integration is thread-safe, utilizing a connection pool where each thread is assigned its own connection. When handling requests in parallel, threads retrieve connections from the pool as needed. </Tip>Before proceeding, ensure the following prerequisites are met:
To avoid any delays, ensure that the Databricks cluster is running before executing the queries. </Note>
Establish a connection to your Databricks workspace from MindsDB by executing the following SQL command:
CREATE DATABASE databricks_datasource
WITH
engine = 'databricks',
parameters = {
"server_hostname": "adb-1234567890123456.7.azuredatabricks.net",
"http_path": "sql/protocolv1/o/1234567890123456/1234-567890-test123",
"access_token": "dapi1234567890ab1cde2f3ab456c7d89efa",
"schema": "example_db"
};
Required connection parameters include the following:
server_hostname: The server hostname for the cluster or SQL warehouse.http_path: The HTTP path of the cluster or SQL warehouse.access_token: A Databricks personal access token for the workspace.Optional connection parameters include the following:
session_configuration: Additional (key, value) pairs to set as Spark session configuration parameters. This should be provided as a JSON string.http_headers: Additional (key, value) pairs to set in HTTP headers on every RPC request the client makes. This should be provided as "http_headers": [['Header-1', 'value1'], ['Header-2', 'value2']].catalog: The catalog to use for the connection. Default is hive_metastore.schema: The schema (database) to use for the connection. Default is default.Retrieve data from a specified table by providing the integration name, catalog, schema, and table name:
SELECT *
FROM databricks_datasource.catalog_name.schema_name.table_name
LIMIT 10;
Run Databricks SQL queries directly on the connected Databricks workspace:
SELECT * FROM databricks_datasource (
--Native Query Goes Here
SELECT
city,
car_model,
RANK() OVER (PARTITION BY car_model ORDER BY quantity) AS rank
FROM dealer
QUALIFY rank = 1;
);