docs/en/sql-reference/sql-functions/hive_bitmap_udf.md
Hive Bitmap UDF provides UDFs that can be directly used in Hive. They can be used to generate Bitmap data and perform Bitmap-related calculations.
The Bitmap format defined by Hive Bitmap UDF is consistent with the format in StarRocks and can be directly used for loading Bitmap data into StarRocks and unloading Bitmap data from StarRocks to Hive.
Applicable scenarios:
Supported source and target data types:
com.starrocks.hive.udf.UDAFBitmapAgg
Combines multiple rows of non-null values in a column into one row of Bitmap values, which is equivalent to StarRocks' built-in aggregate function bitmap_agg.
com.starrocks.hive.udf.UDAFBitmapUnion
Calculates the union of a set of bitmaps, which is equivalent to StarRocks' built-in aggregate function bitmap_union.
com.starrocks.hive.udf.UDFBase64ToBitmap
Converts a base64-encoded string into a bitmap, which is equivalent to StarRocks' built-in function base64_to_bitmap.
com.starrocks.hive.udf.UDFBitmapAnd
Calculates the intersection of two bitmaps, which is equivalent to StarRocks' built-in function bitmap_and.
com.starrocks.hive.udf.UDFBitmapCount
Counts the number of values in the bitmap, which is equivalent to StarRocks' built-in function bitmap_count.
com.starrocks.hive.udf.UDFBitmapFromString
Converts a comma-separated string to a bitmap, equivalent to StarRocks' built-in function bitmap_from_string.
com.starrocks.hive.udf.UDFBitmapOr
Calculates the union of two bitmaps, equivalent to StarRocks' built-in function bitmap_or.
com.starrocks.hive.udf.UDFBitmapToBase64
Converts Bitmap to Base64 string, equivalent to StarRocks' built-in function bitmap_to_base64.
com.starrocks.hive.udf.UDFBitmapToString
Converts a bitmap to a comma-separated string, equivalent to StarRocks' built-in function bitmap_to_string.
com.starrocks.hive.udf.UDFBitmapXor
Calculates the set of unique elements in two Bitmaps, which is equivalent to StarRocks' built-in function bitmap_xor.
Compile and generate Hive UDF on the FE.
./build.sh --hive-udf
A JAR package hive-udf-*.jar will be generated in the fe/hive-udf/ directory.
Upload the JAR package to HDFS.
hadoop fs -put -f ./hive-udf-*.jar hdfs://<hdfs_ip>:<hdfs_port>/hive-udf-*.jar
Load the JAR package to Hive.
hive> add jar hdfs://<hdfs_ip>:<hdfs_port>/hive-udf-*.jar;
Load UDF functions.
hive> create temporary function bitmap_agg as 'com.starrocks.hive.udf.UDAFBitmapAgg';
hive> create temporary function bitmap_union as 'com.starrocks.hive.udf.UDAFBitmapUnion';
hive> create temporary function base64_to_bitmap as 'com.starrocks.hive.udf.UDFBase64ToBitmap';
hive> create temporary function bitmap_and as 'com.starrocks.hive.udf.UDFBitmapAnd';
hive> create temporary function bitmap_count as 'com.starrocks.hive.udf.UDFBitmapCount';
hive> create temporary function bitmap_from_string as 'com.starrocks.hive.udf.UDFBitmapFromString';
hive> create temporary function bitmap_or as 'com.starrocks.hive.udf.UDFBitmapOr';
hive> create temporary function bitmap_to_base64 as 'com.starrocks.hive.udf.UDFBitmapToBase64';
hive> create temporary function bitmap_to_string as 'com.starrocks.hive.udf.UDFBitmapToString';
hive> create temporary function bitmap_xor as 'com.starrocks.hive.udf.UDFBitmapXor';
Create a Hive source table.
hive> create table t_src(c1 bigint, c2 bigint) stored as parquet;
hive> insert into t_src values (1,1), (1,2), (1,3), (2,4), (2,5);
hive> select * from t_src;
1 1
1 2
1 3
2 4
2 5
Create a Hive bitmap table.
hive> create table t_bitmap(c1 bigint, c2 binary) stored as parquet;
Hive generates bitmap through UDFBitmapAgg and writes it into the bitmap table.
hive> insert into t_bitmap select c1, bitmap_agg(c2) from t_src group by c1;
Create a StarRocks Bitmap table.
mysql> create table t1(c1 int, c2 bitmap bitmap_union) aggregate key(c1) distributed by hash(c1);
Load Bitmap data into StarRocks in different ways.
mysql> insert into t1 select c1, bitmap_from_binary(c2) from files (
"path" = "hdfs://<hdfs_ip>:<hdfs_port>/<hdfs_db>/t_bitmap/*",
"format"="parquet",
"compression" = "uncompressed"
);
mysql> insert into t1 select c1, bitmap_from_binary(c2) from hive_catalog_hms.xxx_db.t_bitmap;
View the results.
mysql> select c1, bitmap_to_string(c2) from t1;
+------+----------------------+
| c1 | bitmap_to_string(c2) |
+------+----------------------+
| 1 | 1,2,3 |
| 2 | 4,5 |
+------+----------------------+
Create a Bitmap table in StarRocks and write data into this table.
mysql> create table t1(c1 int, c2 bitmap bitmap_union) aggregate key(c1) buckets 3 distributed by hash(c1);
mysql> select c1, bitmap_to_string(c2) from t1;
+------+----------------------+
| c1 | bitmap_to_string(c2) |
+------+----------------------+
| 1 | 1,2,3 |
| 2 | 4,5 |
+------+----------------------+
Create a Bitmap table in Hive.
hive> create table t_bitmap(c1 bigint, c2 binary) stored as parquet;
Export data in different ways.
mysql> insert into files (
"path" = "hdfs://<hdfs_ip>:<hdfs_port>/<hdfs_db>/t_bitmap/",
"format"="parquet",
"compression" = "uncompressed"
) select c1, bitmap_to_binary(c2) as c2 from t1;
mysql> insert into hive_catalog_hms.<hdfs_db>.t_bitmap select c1, bitmap_to_binary(c2) from t1;
View results in Hive.
hive> select c1, bitmap_to_string(c2) from t_bitmap;
1 1,2,3
2 4,5