docs/content/v2.20/manage/data-migration/bulk-import-ycql.md
Depending on the data volume imported, various bulk import tools can be used to load data into YugabyteDB. This page documents bulk import for YugabyteDB's Cassandra-compatible YCQL API.
You should first export data from existing Apache Cassandra and MySQL tables. Next, you can import the data using the various bulk load options supported by YugabyteDB.
The import process is illustrated as follows, using a generic IoT time series data use case as a running example.
Following is the schema of the destination YugabyteDB table:
CREATE KEYSPACE example;
USE EXAMPLE;
CREATE TABLE SensorData (
customer_name text,
device_id int,
ts timestamp,
sensor_data map<text, double>,
PRIMARY KEY((customer_name, device_id), ts)
);
Prepare a comma-separated values (CSV) file where each row of entries matches the column types declared in the table schema provided in Create destination table. Concretely, each CSV must be a valid Cassandra Query Language (CQL) literal for its corresponding type, except for the top-level quotes (for example, use foo rather than 'foo' for strings).
If you do not have the data already available in a database table, you can create sample data for the import using the following example:
#!/bin/bash
# Usage: ./generate_data.sh <number_of_rows> <output_filename>
# Example ./generate_data.sh 1000 sample.csv
if [ "$#" -ne 2 ]
then
echo "Usage: ./generate_data.sh <number_of_rows> <output_filename>"
Echo "Example ./generate_data.sh 1000 sample.csv"
exit 1
fi
> $2 # clearing file
for i in $(seq 1 "$1"); do
timestamp=$(date -u +"%Y-%m-%dT%H:%M:%S.000000+0000")
echo "customer${i},${i},${timestamp},\"{temp:${i}, humidity:${i}}\"" >> "$2"
done
customer1,1,2017-11-11 12:32:1.000000+0000,"{temp:1, humidity:1}"
customer2,2,2017-11-11 12:32:2.000000+0000,"{temp:2, humidity:2}"
customer3,0,2017-11-11 12:32:3.000000+0000,"{temp:3, humidity:3}"
customer4,1,2017-11-11 12:32:4.000000+0000,"{temp:4, humidity:4}"
customer5,2,2017-11-11 12:32:5.000000+0000,"{temp:5, humidity:5}"
customer6,0,2017-11-11 12:32:6.000000+0000,"{temp:6, humidity:6}"
If you already have the data in an Apache Cassandra table, then use the following command to create a CSV file with that data:
ycqlsh> COPY example.SensorData TO '/path/to/sample.csv';
If you already have the data in a MySQL table named SensorData, then use the following command to create a CSV file with that data:
SELECT customer_name, device_id, ts, sensor_data
FROM SensorData
INTO OUTFILE '/path/to/sample.csv' FIELDS TERMINATED BY ',';
The import data instructions are organized by the size of the input datasets, ranging from small (megabytes of data) to larger datasets (gigabytes of data).
Cassandra's CQL shell provides the COPY FROM command, which allows importing data from CSV files:
ycqlsh> COPY example.SensorData FROM '/path/to/sample.csv';
By default, COPY exports timestamps in the yyyy-MM-dd HH:mm:ss.SSSZ format.
See also COPY TO .
cassandra-loader is a general-purpose bulk loader for CQL that supports various types of delimited files (particularly CSV files). For details, review the README of the YugabyteDB cassandra-loader fork. Note that cassandra-loader requires quotes for collection types (for example, "[1,2,3]" rather than [1,2,3] for lists).
You can install cassandra-loader as follows:
wget https://github.com/yugabyte/cassandra-loader/releases/download/<latest-version>/cassandra-loader
chmod a+x cassandra-loader
Run cassandra-loader using the following command:
time ./cassandra-loader \
-dateFormat 'yyyy-MM-dd HH:mm:ss.SSSSSSX' \
-f sample.csv \
-host <clusterNodeIP> \
-schema "example.SensorData(customer_name, device_id, ts, sensor_data)"
For additional options, refer to cassandra-loader options.
To ensure that the migration was successful, you can perform a number of steps.
In YCQL, the count() query can be executed using the ycrc tool.
The tool uses the exposed partition_hash function in order to execute smaller, more manageable queries which are individually less resource intensive and tend to not time out.
Set up and run the ycrc tool using the following steps:
Download the ycrc tool by compiling the source from the GitHub repository.
Run the following command to confirm that the ycrc tool is working:
./ycrc --help
YCQL Row Count (ycrc) parallelizes counting the number of rows in a table for YugabyteDB CQL, allowing count(*) on tables that otherwise would fail with query timeouts
Usage:
ycrc <keyspace> [flags]
Flags:
-d, --debug Verbose logging
-h, --help help for ycrc
-c, --hosts strings Cluster to connect to (default [127.0.0.1])
-p, --parallel int Number of concurrent tasks (default 16)
--password string user password
-s, --scale int Scaling factor of tasks per table, an int between 1 and 10 (default 6)
--sslca string SSL root ca path
--sslcert string SSL cert path
--sslkey string SSL key path
--tables strings List of tables inside of the keyspace - default to all
-t, --timeout int Timeout of a single query, in ms (default 1500)
-u, --user string database user (default "cassandra")
--verify Strictly verify SSL host (off by default)
-v, --version version for ycrc
Run the ycrc tool to count the rows in a given keyspace using a command similar to the following:
./ycrc -c 127.0.0.1 example
Checking table row counts for keyspace: example
Checking row counts for: example.sensordata
Partitioning columns for example.sensordata:(customer_name,device_id)
Performing 4096 checks for example.sensordata with 16 parallel tasks
Total time: 261 ms
==========
Total Row Count example.sensordata = 60
Checking row counts for: example.emp
Partitioning columns for example.emp:(emp_id)
Performing 4096 checks for example.emp with 16 parallel tasks
Total time: 250 ms
==========
Total Row Count example.emp = 3
The following is an example with additional flags on the keyspace example:
./ycrc example \
-c 127.0.0.1 \ # Cluster to connect to (default [127.0.0.1])
-u test \ # database user (default "cassandra")
--verify --password xxx \ # user password
--tables sensordata \ # List of tables inside of the keyspace - default to all
-s 7 \ # Scaling factor of tasks per table, an int between 1 and 10 (default 6)
-p 32 \ # Number of concurrent tasks (default 16)
-t 3000 \ # Timeout of a single query, in ms (default 1500)
--sslca /opt/yugabyte/certs/tests.crt # This flag needs to specified if client to node authentication is enabled.