Migrate and Validate Tables between Origin and Target Cassandra Clusters.
⚠️ Please note this job has been tested with spark version 3.4.1
- Get the latest image that includes all dependencies from DockerHub
- All migration tools (
cassandra-data-migrator
+dsbulk
+cqlsh
) would be available in the/assets/
folder of the container
- All migration tools (
- Download the latest jar file from the GitHub packages area here
- Install Java8 as spark binaries are compiled with it.
- Install Spark version 3.4.1 on a single VM (no cluster necessary) where you want to run this job. Spark can be installed by running the following: -
wget https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.4.1-bin-hadoop3-scala2.13.tgz
⚠️ Note that Version 4 of the tool is not backward-compatible with .properties files created in previous versions, and that package names have changed.
cdm.properties
file needs to be configured as applicable for the environment. Parameter descriptions and defaults are described in the file. The file can have any name, it does not need to becdm.properties
.- A simplified sample properties file configuration can be found here as cdm.properties
- A complete sample properties file configuration can be found here as cdm-detailed.properties
- Place the properties file where it can be accessed while running the job via spark-submit.
- Run the below job using
spark-submit
command as shown below:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Note:
- Above command generates a log file
logfile_name_*.txt
to avoid log output on the console. - Update the memory options (driver & executor memory) based on your use-case
- To run the job in Data validation mode, use class option
--class com.datastax.cdm.job.DiffData
as shown below
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
- Validation job will report differences as “ERRORS” in the log file as shown below
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
- Please grep for all
ERROR
from the output log files to get the list of missing and mismatched records.- Note that it lists differences by primary-key values.
- The Validation job can also be run in an AutoCorrect mode. This mode can
- Add any missing records from origin to target
- Update any mismatched records between origin and target (makes target same as origin).
- Enable/disable this feature using one or both of the below setting in the config file
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
Note:
- The validation job will never delete records from target i.e. it only adds or updates data on target
- You can also use the tool to Migrate or Validate specific partition ranges by using a partition-file with the name
./<keyspacename>.<tablename>_partitions.csv
in the below format in the current folder as input
-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540
Each line above represents a partition-range (min,max
). Alternatively, you can also pass the partition-file via command-line param as shown below
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.tokenRange.partitionFile="/<path-to-file>/<csv-input-filename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
This mode is specifically useful to processes a subset of partition-ranges that may have failed during a previous run.
Note: A file named
./<keyspacename>.<tablename>_partitions.csv
is auto generated by the Migration & Validation jobs in the above format containing any failed partition ranges. No file is created if there are no failed partitions. You can use this file as an input to process any failed partition in a following run.
- The tool can be used to identify large fields from a table that may break you cluster guardrails (e.g. AstraDB has a 10MB limit for a single large field)
--class com.datastax.cdm.job.GuardrailCheck
as shown below
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
- Auto-detects table schema (column names, types, keys, collections, UDTs, etc.)
- Including counter table Counter tables
- Preserve writetimes and TTLs
- Supports migration/validation of advanced DataTypes (Sets, Lists, Maps, UDTs)
- Filter records from
Origin
usingwritetimes
and/or CQL conditions and/or a list of token-ranges - Perform guardrail checks (identify large fields)
- Supports adding
constants
as new columns onTarget
- Supports expanding
Map
columns onOrigin
into multiple records onTarget
- Fully containerized (Docker and K8s friendly)
- SSL Support (including custom cipher algorithms)
- Migrate from any Cassandra
Origin
(Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) to any CassandraTarget
(Apache Cassandra® / DataStax Enterprise™ / DataStax Astra DB™) - Supports migration/validation from and to Azure Cosmos Cassandra
- Validate migration accuracy and performance using a smaller randomized data-set
- Supports adding custom fixed
writetime
- Validation - Log partitions range level exceptions, use the exceptions file as input for rerun
- This tool does not migrate
ttl
&writetime
at the field-level (for optimization reasons). It instead finds the field with the highestttl
& the field with the highestwritetime
within anorigin
row and uses those values on the entiretarget
row.
- Clone this repo
- Move to the repo folder
cd cassandra-data-migrator
- Run the build
mvn clean package
(Needs Maven 3.9.x) - The fat jar (
cassandra-data-migrator-4.x.x.jar
) file should now be present in thetarget
folder
Checkout all our wonderful contributors here.