documentation/sphinx/source/bulkload.rst
############################## BulkLoad (Dev) ##############################
The BulkLoad feature works in conjunction with :doc:BulkDump <bulkdump> to provide a complete data migration solution.
BulkLoad takes manifest files and SST files generated by :doc:BulkDump <bulkdump> and efficiently loads them into a target FoundationDB cluster.
When a user wants to start a bulkload job, the user provides:
blobstore URL <https://apple.github.io/foundationdb/backups.html#backup-urls>_ containing the dump filesRequired Configuration: BulkLoad requires the following server knobs to be enabled:
--knob_shard_encode_location_metadata=1: Enables shard-aware location metadata--knob_enable_read_lock_on_range=1: Enables exclusive range locking during load operationsInput File Structure:
BulkLoad expects the input files to be organized as produced by :doc:BulkDump <bulkdump>.
Currently, FDBCLI tools and low-level ManagementAPIs are provided to submit a job or clear a job. These operations are achieved by issuing transactions to update the bulkload metadata and taking exclusive locks on the target range. Submitting a job involves validating the input parameters, taking an exclusive read lock on the target range, and writing job metadata. When submitting a job, the API checks if there is any ongoing bulkload job or conflicting locks. If yes, it will reject the job. Otherwise, it accepts the job. Clearing a job releases the range lock and marks the job as cancelled in the metadata.
FDBCLI provides following interfaces to do the operations:
For detailed usage examples and quickstart guide, see :doc:bulkload-user.
ManagementAPI provides following interfaces to do the operations:
submitBulkLoadJob(BulkLoadJobState jobState)cancelBulkLoadJob(UID jobId)setBulkLoadMode(int mode) // Set mode = 1 to enable; Set mode = 0 to disablegetBulkLoadJobStatus(Database cx)createBulkLoadJob()submitBulkLoadJob() specifying the source JobID, target range, and data locationtakeExclusiveReadLockOnRange()\\xff/bulkLoadJob/ prefix) and task space is initializedbulkLoadJobManager() detects the new job and downloads the global job-manifest.txt fileMANIFEST_COUNT_MAX_PER_BULKLOAD_TASK per task)\\xff/bulkLoadTask/ prefix) and triggers data movementdoBulkLoadTask() coordinates with data movement system to load SST files into target shardsBulkLoadPhase::Complete or BulkLoadPhase::ErrorBulkLoad uses FoundationDB's range locking mechanism to ensure data consistency:
registerRangeLockOwner() registers the BulkLoad system as a lock owner with name "BulkLoad"takeExclusiveReadLockOnRange() takes an exclusive read lock on the entire job range during submitBulkLoadJob()releaseExclusiveReadLockOnRange() when the job completes, is cancelled, or errors\\xff/rangeLock/ keyspace with owner information in \\xff/rangeLockOwner/range_lock_reject if the target range is already locked by another operationsubmitBulkLoadJob() checks for existing BulkLoad or BulkDump jobs and rejects with bulkload_task_failed() if conflicts existMANIFEST_COUNT_MAX_PER_BULKLOAD_TASK manifest entriesBulkLoadTaskCollection to coordinate with data movement and prevent shard boundary changesgetBulkLoadManifestMetadataFromEntry()bulkload_dataset_not_cover_required_range()\\xff/bulkLoadTask/ metadata and are automatically resumedcancelBulkLoadJob() clears all metadata and releases range locks immediatelyBulkLoadTaskCollection to handle shard reassignmentsrange_lock_reject if the target range is already lockedBulkLoadPhase::Error and can be acknowledged by usersbulkload_dataset_not_cover_required_range() if source data doesn't cover the requested rangeDD_BULKLOAD_PARALLELISM knob for DD-level parallelism