docs/en/data_source/catalog/elasticsearch_catalog.md
StarRocks supports Elasticsearch catalogs from v3.1 onwards.
StarRocks and Elasticsearch are both popular analytical systems with distinct strengths. StarRocks excels in large-scale distributed computing and supports querying data from Elasticsearch through external tables. Elasticsearch is known for its full-text search capabilities. The combination of StarRocks and Elasticsearch provides a more comprehensive OLAP solution. With Elasticsearch catalogs, you can directly analyze all indexed data in your Elasticsearch cluster by using SQL statements on StarRocks without the need for data migration.
Unlike catalogs for other data sources, an Elasticsearch catalog has only one database named default_db in it upon creation. Each Elasticsearch index is automatically mapped to a data table and mounted to the default_db database.
CREATE EXTERNAL CATALOG <catalog_name>
[COMMENT <comment>]
PROPERTIES ("key"="value", ...)
catalog_nameThe name of the Elasticsearch catalog. The naming conventions are as follows:
commentThe description of the Elasticsearch catalog. This parameter is optional.
The properties of the Elasticsearch catalog. The following table describes the properties supported for Elasticsearch catalogs.
| Parameter | Required | Default value | Description |
|---|---|---|---|
| hosts | Yes | None | The connection address of the Elasticsearch cluster. You can specify one or more addresses. StarRocks can parse the Elasticsearch version and index shard allocation from this address. StarRocks communicates with your Elasticsearch cluster based on the address returned by the GET /_nodes/http API operation. Therefore, the value of the hosts parameter must be the same as the address returned by the GET /_nodes/http API operation. Otherwise, BEs or CNs may not be able to communicate with your Elasticsearch cluster. |
| type | Yes | None | The type of the data source. Set this parameter to es when you create an Elasticsearch catalog. |
| user | No | Empty | The username that is used to log in to the Elasticsearch cluster with HTTP basic authentication enabled. Make sure that you have permissions to access paths such as /cluster/state/ nodes/http and have permissions to read the index. |
| password | No | Empty | The password that is used to log in to the Elasticsearch cluster. |
| es.type | No | _doc | The type of the index. If you want to query data in Elasticsearch 8 and later versions, you do not need to configure this parameter because the mapping types have been removed in Elasticsearch 8 and later versions. |
| es.nodes.wan.only | No | FALSE | Specifies whether StarRocks only uses the addresses specified by hosts to access the Elasticsearch cluster and fetch data.<ul><li>true: StarRocks only uses the addresses specified by hosts to access the Elasticsearch cluster and fetch data and does not sniff data nodes on which the shards of the Elasticsearch index reside. If StarRocks cannot access the addresses of the data nodes inside the Elasticsearch cluster, you need to set this parameter to true.</li><li>false: StarRocks uses the addresses specified by hosts to sniff data nodes on which the shards of the Elasticsearch cluster indexes reside. After StarRocks generates a query execution plan, BEs or CNs directly access the data nodes inside the Elasticsearch cluster to fetch data from the shards of indexes. If StarRocks can access the addresses of the data nodes inside the Elasticsearch cluster, we recommend that you retain the default value false.</li></ul> |
| es.net.ssl | No | FALSE | Specifies whether the HTTPS protocol can be used to access the Elasticsearch cluster. Only StarRocks v2.4 and later support configuring this parameter.<ul><li>true: Both the HTTPS and HTTP protocols can be used to access your Elasticsearch cluster.</li><li>false: Only the HTTP protocol can be used to access your Elasticsearch cluster.</li></ul> |
| enable_docvalue_scan | No | TRUE | Specifies whether to obtain the values of the target fields from Elasticsearch columnar storage. In most cases, reading data from columnar storage outperforms reading data from row storage. |
| enable_keyword_sniff | No | TRUE | Specifies whether to sniff TEXT-type fields in Elasticsearch based on KEYWORD-type fields. If this parameter is set to false, StarRocks performs matching after tokenization. |
The following example creates an Elasticsearch catalog named es_test:
CREATE EXTERNAL CATALOG es_test
COMMENT 'test123'
PROPERTIES
(
"type" = "es",
"es.type" = "_doc",
"hosts" = "https://xxx:9200",
"es.net.ssl" = "true",
"user" = "admin",
"password" = "xxx",
"es.nodes.wan.only" = "true"
);
StarRocks supports pushing the predicates specified in queries against Elasticsearch tables down to Elasticsearch for execution. This minimizes the distance between the query engine and the storage source and improves query performance. The following table lists the operators that can be pushed down to Elasticsearch.
| SQL syntax | Elasticsearch syntax |
|---|---|
= | term query |
in | terms query |
>=, <=, >, < | range |
and | bool.filter |
or | bool.should |
not | bool.must_not |
not in | bool.must_not + terms |
esquery | ES Query DSL |
The esquery() function can be used to push Elasticsearch queries, such as match and geoshape queries, that cannot be expressed in SQL down to Elasticsearch for filtering and processing. In the esquery() function, the first parameter that specifies a column name is used to associate with the index, and the second parameter is an Elasticsearch query's Elasticsearch Query DSL-based JSON representation enclosed in curly braces ({}). The JSON representation can and must have only one root key, such as match, geo_shape, or bool.
Match query
SELECT * FROM es_table WHERE esquery(k4, '{
"match": {
"k4": "StarRocks on elasticsearch"
}
}');
Geoshape query
SELECT * FROM es_table WHERE esquery(k4, '{
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates": [
[
13,
53
],
[
14,
52
]
]
},
"relation": "within"
}
}
}');
Boolean query
SELECT * FROM es_table WHERE esquery(k4, ' {
"bool": {
"must": [
{
"terms": {
"k1": [
11,
12
]
}
},
{
"terms": {
"k2": [
100
]
}
}
]
}
}');
count()-involved queries, run much slower on StarRocks than on Elasticsearch, because Elasticsearch can directly read the metadata related to the specified number of documents that meet the query conditions without the need to filter the requested data.