scientific-skills/exploratory-data-analysis/references/chemistry_molecular_formats.md
This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.
Description: Standard format for 3D structures of biological macromolecules Typical Data: Atomic coordinates, residue information, secondary structure, crystal structure data Use Cases: Protein structure analysis, molecular visualization, docking studies Python Libraries:
Biopython: Bio.PDBMDAnalysis: MDAnalysis.Universe('file.pdb')PyMOL: pymol.cmd.load('file.pdb')ProDy: prody.parsePDB('file.pdb')
EDA Approach:Description: Structured data format for crystallographic information Typical Data: Unit cell parameters, atomic coordinates, symmetry operations, experimental data Use Cases: Crystal structure determination, structural biology, materials science Python Libraries:
gemmi: gemmi.cif.read_file('file.cif')PyCifRW: CifFile.ReadCif('file.cif')Biopython: Bio.PDB.MMCIFParser()
EDA Approach:Description: Chemical structure file format by MDL/Accelrys Typical Data: 2D/3D coordinates, atom types, bond orders, charges Use Cases: Chemical database storage, cheminformatics, drug design Python Libraries:
RDKit: Chem.MolFromMolFile('file.mol')Open Babel: pybel.readfile('mol', 'file.mol')ChemoPy: For descriptor calculation
EDA Approach:Description: Complete 3D molecular structure format with atom typing Typical Data: Coordinates, SYBYL atom types, bond types, charges, substructures Use Cases: Molecular docking, QSAR studies, drug discovery Python Libraries:
RDKit: Chem.MolFromMol2File('file.mol2')Open Babel: pybel.readfile('mol2', 'file.mol2')MDAnalysis: Can parse mol2 topology
EDA Approach:Description: Multi-structure file format with associated data Typical Data: Multiple molecular structures with properties/annotations Use Cases: Chemical databases, virtual screening, compound libraries Python Libraries:
RDKit: Chem.SDMolSupplier('file.sdf')Open Babel: pybel.readfile('sdf', 'file.sdf')PandasTools (RDKit): For DataFrame integration
EDA Approach:Description: Simple Cartesian coordinate format Typical Data: Atom types and 3D coordinates Use Cases: Quantum chemistry, geometry optimization, molecular dynamics Python Libraries:
ASE: ase.io.read('file.xyz')Open Babel: pybel.readfile('xyz', 'file.xyz')cclib: For parsing QM outputs with xyz
EDA Approach:Description: Line notation for chemical structures Typical Data: Text representation of molecular structure Use Cases: Chemical databases, literature mining, data exchange Python Libraries:
RDKit: Chem.MolFromSmiles(smiles)Open Babel: Can parse SMILESDeepChem: For ML on SMILES
EDA Approach:Description: Modified PDB format for AutoDock docking Typical Data: Coordinates, partial charges, atom types for docking Use Cases: Molecular docking, virtual screening Python Libraries:
Meeko: For PDBQT preparationOpen Babel: Can read PDBQTProDy: Limited PDBQT support
EDA Approach:Description: Schrödinger's proprietary molecular structure format Typical Data: Structures, properties, annotations from Schrödinger suite Use Cases: Drug discovery, molecular modeling with Schrödinger tools Python Libraries:
schrodinger.structure: Requires Schrödinger installationDescription: Molecular structure file for GROMACS MD simulations Typical Data: Atom positions, velocities, box vectors Use Cases: Molecular dynamics simulations, GROMACS workflows Python Libraries:
MDAnalysis: Universe('file.gro')MDTraj: mdtraj.load_gro('file.gro')GromacsWrapper: For GROMACS integration
EDA Approach:Description: Output from Gaussian quantum chemistry calculations Typical Data: Energies, geometries, frequencies, orbitals, populations Use Cases: QM calculations, geometry optimization, frequency analysis Python Libraries:
cclib: cclib.io.ccread('file.log')GaussianRunPack: For Gaussian workflowsDescription: Generic output file from various QM packages Typical Data: Calculation results, energies, properties Use Cases: QM calculations across different software Python Libraries:
cclib: Universal parser for QM outputsASE: Can read some output formats
EDA Approach:Description: Wavefunction data for quantum chemical analysis Typical Data: Molecular orbitals, basis sets, density matrices Use Cases: Electron density analysis, QTAIM analysis Python Libraries:
Multiwfn: Interface via PythonHorton: For wavefunction analysisDescription: Formatted checkpoint file from Gaussian Typical Data: Complete wavefunction data, results, geometry Use Cases: Post-processing Gaussian calculations Python Libraries:
cclib: Can parse fchk filesGaussView Python API (if available)Description: Volumetric data on a 3D grid Typical Data: Electron density, molecular orbitals, ESP on grid Use Cases: Visualization of volumetric properties Python Libraries:
cclib: cclib.io.ccread('file.cube')ase.io: ase.io.read('file.cube')pyquante: For cube file manipulation
EDA Approach:Description: Binary trajectory format (CHARMM, NAMD) Typical Data: Time series of atomic coordinates Use Cases: MD trajectory analysis Python Libraries:
MDAnalysis: Universe(topology, 'traj.dcd')MDTraj: mdtraj.load_dcd('traj.dcd', top='topology.pdb')PyTraj (Amber): Limited support
EDA Approach:Description: GROMACS compressed trajectory format Typical Data: Compressed coordinates from MD simulations Use Cases: Space-efficient MD trajectory storage Python Libraries:
MDAnalysis: Universe(topology, 'traj.xtc')MDTraj: mdtraj.load_xtc('traj.xtc', top='topology.pdb')
EDA Approach:Description: Full precision GROMACS trajectory Typical Data: Coordinates, velocities, forces from MD Use Cases: High-precision MD analysis Python Libraries:
MDAnalysis: Full supportMDTraj: Can read trr filesGromacsWrapper
EDA Approach:Description: Network Common Data Form trajectory Typical Data: MD coordinates, velocities, forces Use Cases: Amber MD simulations, large trajectory storage Python Libraries:
MDAnalysis: NetCDF supportPyTraj: Native Amber analysisnetCDF4: Low-level access
EDA Approach:Description: Molecular topology for GROMACS Typical Data: Atom types, bonds, angles, force field parameters Use Cases: MD simulation setup and analysis Python Libraries:
ParmEd: parmed.load_file('system.top')MDAnalysis: Can parse topologyDescription: Topology file for CHARMM/NAMD Typical Data: Atom connectivity, types, charges Use Cases: CHARMM/NAMD MD simulations Python Libraries:
MDAnalysis: Native PSF supportParmEd: Can read PSF files
EDA Approach:Description: Amber topology and parameter file Typical Data: System topology, force field parameters Use Cases: Amber MD simulations Python Libraries:
ParmEd: parmed.load_file('system.prmtop')PyTraj: Native Amber support
EDA Approach:Description: Amber coordinate/restart file Typical Data: Atomic coordinates, velocities, box info Use Cases: Starting coordinates for Amber MD Python Libraries:
ParmEd: Works with prmtopPyTraj: Amber coordinate reading
EDA Approach:Description: Joint Committee on Atomic and Molecular Physical Data eXchange Typical Data: Spectroscopic data (IR, NMR, MS, UV-Vis) Use Cases: Spectroscopy data exchange and archiving Python Libraries:
jcamp: jcamp.jcamp_reader('file.jdx')nmrglue: For NMR JCAMP filesDescription: Standard XML format for mass spectrometry data Typical Data: MS/MS spectra, chromatograms, metadata Use Cases: Proteomics, metabolomics, mass spectrometry workflows Python Libraries:
pymzml: pymzml.run.Reader('file.mzML')pyteomics: pyteomics.mzml.read('file.mzML')MSFileReader wrappers
EDA Approach:Description: Open XML format for MS data Typical Data: Mass spectra, retention times, peak lists Use Cases: Legacy MS data, metabolomics Python Libraries:
pymzml: Can read mzXMLpyteomics.mzxmllxml for direct XML parsing
EDA Approach:Description: Proprietary instrument data files (Thermo, Bruker, etc.) Typical Data: Raw instrument signals, unprocessed data Use Cases: Direct instrument data access Python Libraries:
pymsfilereader: For Thermo RAW filesThermoRawFileParser: CLI wrapperDescription: Agilent's data folder structure Typical Data: LC-MS, GC-MS data and metadata Use Cases: Agilent instrument data processing Python Libraries:
agilent-reader: Community toolsChemstation Python integrationDescription: Raw NMR time-domain data Typical Data: Time-domain NMR signal Use Cases: NMR processing and analysis Python Libraries:
nmrglue: nmrglue.bruker.read_fid('fid')nmrstarlib: For NMR-STAR files
EDA Approach:Description: Processed NMR spectrum Typical Data: Frequency-domain NMR data Use Cases: NMR analysis and interpretation Python Libraries:
nmrglue: Comprehensive NMR supportpyNMR: For processing
EDA Approach:Description: Thermo Galactic spectroscopy format Typical Data: IR, Raman, UV-Vis spectra Use Cases: Spectroscopic data from various instruments Python Libraries:
spc: spc.File('file.spc')Description: Text identifier for chemical substances Typical Data: Layered chemical structure representation Use Cases: Chemical database keys, structure searching Python Libraries:
RDKit: Chem.MolFromInchi(inchi)Open Babel: InChI conversion
EDA Approach:Description: ChemDraw drawing file format Typical Data: 2D chemical structures with annotations Use Cases: Chemical drawing, publication figures Python Libraries:
RDKit: Can import some CDXMLOpen Babel: Limited supportChemDraw Python API (commercial)
EDA Approach:Description: XML-based chemical structure format Typical Data: Chemical structures, reactions, properties Use Cases: Semantic chemical data representation Python Libraries:
RDKit: CML supportOpen Babel: Good CML supportlxml: For XML parsing
EDA Approach:Description: Chemical reaction structure file Typical Data: Reactants, products, reaction arrows Use Cases: Reaction databases, synthesis planning Python Libraries:
RDKit: Chem.ReactionFromRxnFile('file.rxn')Open Babel: Reaction support
EDA Approach:Description: Multi-reaction file format Typical Data: Multiple reactions with data Use Cases: Reaction databases Python Libraries:
RDKit: RDF reading capabilitiesDescription: Container for scientific data arrays Typical Data: Large arrays, metadata, hierarchical organization Use Cases: Large dataset storage, computational results Python Libraries:
h5py: h5py.File('file.h5', 'r')pytables: Advanced HDF5 interfacepandas: Can read HDF5
EDA Approach:Description: Serialized Python objects Typical Data: Any Python object (molecules, dataframes, models) Use Cases: Intermediate data storage, model persistence Python Libraries:
pickle: Built-in serializationjoblib: Enhanced pickling for large arraysdill: Extended pickle support
EDA Approach:Description: NumPy array binary format Typical Data: Numerical arrays (coordinates, features, matrices) Use Cases: Fast numerical data I/O Python Libraries:
numpy: np.load('file.npy')Description: MATLAB workspace data Typical Data: Arrays, structures from MATLAB Use Cases: MATLAB-Python data exchange Python Libraries:
scipy.io: scipy.io.loadmat('file.mat')h5py: For v7.3 MAT files
EDA Approach:Description: Tabular data in text format Typical Data: Chemical properties, experimental data, descriptors Use Cases: Data exchange, analysis, machine learning Python Libraries:
pandas: pd.read_csv('file.csv')csv: Built-in modulepolars: Fast CSV reading
EDA Approach:Description: Structured text data format Typical Data: Chemical properties, metadata, API responses Use Cases: Data interchange, configuration, web APIs Python Libraries:
json: Built-in JSON supportpandas: pd.read_json()ujson: Faster JSON parsing
EDA Approach:Description: Columnar storage format Typical Data: Large tabular datasets efficiently Use Cases: Big data, efficient columnar analytics Python Libraries:
pandas: pd.read_parquet('file.parquet')pyarrow: Direct parquet accessfastparquet: Alternative implementation
EDA Approach: