scientific-skills/datamol/references/io_module.md
The datamol.io module provides comprehensive file handling for molecular data across multiple formats.
dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)Read Structure-Data File (SDF) format.
filename: Path to SDF file (supports local and remote paths via fsspec)sanitize: Apply sanitization to moleculesremove_hs: Remove explicit hydrogensas_df: Return as DataFrame (True) or list of molecules (False)mol_column: Name of molecule column in DataFramen_jobs: Enable parallel processingdf = dm.read_sdf("compounds.sdf")dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)Read SMILES file (space-delimited by default).
df = dm.read_smi("molecules.smi")dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)Read CSV file with optional automatic SMILES-to-molecule conversion.
smiles_column: Column containing SMILES stringsmol_column: If specified, creates molecule objects from SMILES columndf = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)Read Excel files with molecule handling.
sheet_name: Sheet to read (index or name)read_csvdf = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")dm.read_molblock(molblock, sanitize=True, remove_hs=True)Parse MOL block string (molecular structure text representation).
dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)Read Mol2 format files.
dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)Read Protein Data Bank (PDB) format files.
dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)Parse PDB block string.
dm.open_df(filename, ...)Universal DataFrame reader - automatically detects format.
df = dm.open_df("data.csv") or df = dm.open_df("molecules.sdf")dm.to_sdf(mols, filename, mol_column=None, ...)Write molecules to SDF file.
mol_column: Column name if input is DataFramedm.to_sdf(mols, "output.sdf")
# or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, filename, mol_column=None, ...)Write molecules to SMILES file with optional validation.
dm.to_xlsx(df, filename, mol_columns=None, ...)Export DataFrame to Excel with rendered molecular images.
mol_columns: Columns containing molecules to render as imagesdm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])dm.to_molblock(mol, ...)Convert molecule to MOL block string.
dm.to_pdbblock(mol, ...)Convert molecule to PDB block string.
dm.save_df(df, filename, ...)Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
All I/O functions support remote file paths through fsspec integration:
dm.read_sdf("s3://bucket/compounds.sdf")
dm.read_csv("https://example.com/data.csv")
sanitize: Apply molecule sanitization (default: True)remove_hs: Remove explicit hydrogens (default: True)as_df: Return DataFrame vs list (default: True for most functions)n_jobs: Enable parallel processing (None = all cores, 1 = sequential)mol_column: Name of molecule column in DataFramessmiles_column: Name of SMILES column in DataFrames