Back to Claude Scientific Skills

DNAnexus Data Operations

scientific-skills/dnanexus-integration/references/data-operations.md

2.38.07.9 KB
Original Source

DNAnexus Data Operations

Overview

DNAnexus provides comprehensive data management capabilities for files, records, databases, and other data objects. All data operations can be performed via the Python SDK (dxpy) or command-line interface (dx).

Data Object Types

Files

Binary or text data stored on the platform.

Records

Structured data objects with arbitrary JSON details and metadata.

Databases

Structured database objects for relational data.

Applets and Apps

Executable programs (covered in app-development.md).

Workflows

Multi-step analysis pipelines.

Data Object Lifecycle

States

Open State: Data can be modified

  • Files: Contents can be written
  • Records: Details can be updated
  • Applets: Created in closed state by default

Closed State: Data becomes immutable

  • File contents are fixed
  • Metadata fields are locked (types, details, links, visibility)
  • Objects are ready for sharing and analysis

Transitions

Create (open) → Modify → Close (immutable)

Most objects start open and require explicit closure:

python
# Close a file
file_obj.close()

Some objects can be created and closed in one operation:

python
# Create closed record
record = dxpy.new_dxrecord(details={...}, close=True)

File Operations

Uploading Files

From local file:

python
import dxpy

# Upload a file
file_obj = dxpy.upload_local_file("data.txt", project="project-xxxx")
print(f"Uploaded: {file_obj.get_id()}")

With metadata:

python
file_obj = dxpy.upload_local_file(
    "data.txt",
    name="my_data",
    project="project-xxxx",
    folder="/results",
    properties={"sample": "sample1", "type": "raw"},
    tags=["experiment1", "batch2"]
)

Streaming upload:

python
# For large files or generated data
file_obj = dxpy.new_dxfile(project="project-xxxx", name="output.txt")
file_obj.write("Line 1\n")
file_obj.write("Line 2\n")
file_obj.close()

Downloading Files

To local file:

python
# Download by ID
dxpy.download_dxfile("file-xxxx", "local_output.txt")

# Download using handler
file_obj = dxpy.DXFile("file-xxxx")
dxpy.download_dxfile(file_obj.get_id(), "local_output.txt")

Read file contents:

python
file_obj = dxpy.DXFile("file-xxxx")
with file_obj.open_file() as f:
    contents = f.read()

Download to specific directory:

python
dxpy.download_dxfile("file-xxxx", "/path/to/directory/filename.txt")

File Metadata

Get file information:

python
file_obj = dxpy.DXFile("file-xxxx")
describe = file_obj.describe()

print(f"Name: {describe['name']}")
print(f"Size: {describe['size']} bytes")
print(f"State: {describe['state']}")
print(f"Created: {describe['created']}")

Update file metadata:

python
file_obj.set_properties({"experiment": "exp1", "version": "v2"})
file_obj.add_tags(["validated", "published"])
file_obj.rename("new_name.txt")

Record Operations

Records store structured metadata with arbitrary JSON.

Creating Records

python
# Create a record
record = dxpy.new_dxrecord(
    name="sample_metadata",
    types=["SampleMetadata"],
    details={
        "sample_id": "S001",
        "tissue": "blood",
        "age": 45,
        "conditions": ["diabetes"]
    },
    project="project-xxxx",
    close=True
)

Reading Records

python
record = dxpy.DXRecord("record-xxxx")
describe = record.describe()

# Access details
details = record.get_details()
sample_id = details["sample_id"]
tissue = details["tissue"]

Updating Records

python
# Record must be open to update
record = dxpy.DXRecord("record-xxxx")
details = record.get_details()
details["processed"] = True
record.set_details(details)
record.close()

Search and Discovery

Finding Data Objects

Search by name:

python
results = dxpy.find_data_objects(
    name="*.fastq",
    project="project-xxxx",
    folder="/raw_data"
)

for result in results:
    print(f"{result['describe']['name']}: {result['id']}")

Search by properties:

python
results = dxpy.find_data_objects(
    classname="file",
    properties={"sample": "sample1", "type": "processed"},
    project="project-xxxx"
)

Search by type:

python
# Find all records of specific type
results = dxpy.find_data_objects(
    classname="record",
    typename="SampleMetadata",
    project="project-xxxx"
)

Search with state filter:

python
# Find only closed files
results = dxpy.find_data_objects(
    classname="file",
    state="closed",
    project="project-xxxx"
)
python
# Search across all accessible projects
results = dxpy.find_data_objects(
    name="important_data.txt",
    describe=True  # Include full descriptions
)

Cloning and Copying

Clone Data Between Projects

python
# Clone file to another project
new_file = dxpy.DXFile("file-xxxx").clone(
    project="project-yyyy",
    folder="/imported_data"
)

Clone Multiple Objects

python
# Clone folder contents
files = dxpy.find_data_objects(
    classname="file",
    project="project-xxxx",
    folder="/results"
)

for file in files:
    file_obj = dxpy.DXFile(file['id'])
    file_obj.clone(project="project-yyyy", folder="/backup")

Project Management

Creating Projects

python
# Create a new project
project = dxpy.api.project_new({
    "name": "My Analysis Project",
    "description": "RNA-seq analysis for experiment X"
})

project_id = project['id']

Project Permissions

python
# Invite user to project
dxpy.api.project_invite(
    project_id,
    {
        "invitee": "user-xxxx",
        "level": "CONTRIBUTE"  # VIEW, UPLOAD, CONTRIBUTE, ADMINISTER
    }
)

List Projects

python
# List accessible projects
projects = dxpy.find_projects(describe=True)

for proj in projects:
    desc = proj['describe']
    print(f"{desc['name']}: {proj['id']}")

Folder Operations

Creating Folders

python
# Create nested folders
dxpy.api.project_new_folder(
    "project-xxxx",
    {"folder": "/analysis/batch1/results", "parents": True}
)

Moving Objects

python
# Move file to different folder
file_obj = dxpy.DXFile("file-xxxx", project="project-xxxx")
file_obj.move("/new_location")

Removing Objects

python
# Remove file from project (not permanent deletion)
dxpy.api.project_remove_objects(
    "project-xxxx",
    {"objects": ["file-xxxx"]}
)

# Permanent deletion
file_obj = dxpy.DXFile("file-xxxx")
file_obj.remove()

Archival

Archive Data

Archived data is moved to cheaper long-term storage:

python
# Archive a file
dxpy.api.project_archive(
    "project-xxxx",
    {"files": ["file-xxxx"]}
)

Unarchive Data

python
# Unarchive when needed
dxpy.api.project_unarchive(
    "project-xxxx",
    {"files": ["file-xxxx"]}
)

Batch Operations

Upload Multiple Files

python
import os

# Upload all files in directory
for filename in os.listdir("./data"):
    filepath = os.path.join("./data", filename)
    if os.path.isfile(filepath):
        dxpy.upload_local_file(
            filepath,
            project="project-xxxx",
            folder="/batch_upload"
        )

Download Multiple Files

python
# Download all files from folder
files = dxpy.find_data_objects(
    classname="file",
    project="project-xxxx",
    folder="/results"
)

for file in files:
    file_obj = dxpy.DXFile(file['id'])
    filename = file_obj.describe()['name']
    dxpy.download_dxfile(file['id'], f"./downloads/{filename}")

Best Practices

  1. Close Files: Always close files after writing to make them accessible
  2. Use Properties: Tag data with meaningful properties for easier discovery
  3. Organize Folders: Use logical folder structures
  4. Clean Up: Remove temporary or obsolete data
  5. Batch Operations: Group operations when processing many objects
  6. Error Handling: Check object states before operations
  7. Permissions: Verify project permissions before data operations
  8. Archive Old Data: Use archival for long-term storage cost savings