scientific-skills/exploratory-data-analysis/references/proteomics_metabolomics_formats.md
This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows.
Description: Standard XML format for MS data Typical Data: MS1 and MS2 spectra, retention times, intensities Use Cases: Proteomics, metabolomics pipelines Python Libraries:
pymzml: pymzml.run.Reader('file.mzML')pyteomics.mzml: pyteomics.mzml.read('file.mzML')pyopenms: OpenMS Python bindings
EDA Approach:Description: Older XML-based MS format Typical Data: Mass spectra with metadata Use Cases: Legacy proteomics data Python Libraries:
pyteomics.mzxmlpymzml: Can read mzXML
EDA Approach:Description: PSI standard for peptide identifications Typical Data: Peptide-spectrum matches, proteins, scores Use Cases: Search engine results, proteomics workflows Python Libraries:
pyteomics.mzidpyopenms: MzIdentML support
EDA Approach:Description: TPP format for peptide identifications Typical Data: Search results with statistical validation Use Cases: Proteomics database search output Python Libraries:
pyteomics.pepxml
EDA Approach:Description: TPP protein-level identifications Typical Data: Protein groups, probabilities, peptides Use Cases: Protein-level analysis Python Libraries:
pyteomics.protxml
EDA Approach:Description: Proteomics Identifications Database format Typical Data: Complete proteomics experiment data Use Cases: Public data deposition (legacy) Python Libraries:
pyteomics.prideDescription: Tab or comma-separated proteomics results Typical Data: Peptide or protein quantification tables Use Cases: MaxQuant, Proteome Discoverer, Skyline output Python Libraries:
pandas: pd.read_csv() or pd.read_table()
EDA Approach:Description: Proteome Discoverer results database Typical Data: SQLite database with search results Use Cases: Thermo Proteome Discoverer workflows Python Libraries:
sqlite3: Database accessDescription: Proteome Discoverer study results Typical Data: Comprehensive search and quantification Use Cases: PD study exports Python Libraries:
Description: Compact peptide identification format Typical Data: Peptide sequences, modifications, scores Use Cases: Downstream analysis input Python Libraries:
pyteomics: XML parsing
EDA Approach:Description: Skyline targeted proteomics document Typical Data: Transition lists, chromatograms, results Use Cases: Targeted proteomics (SRM/MRM/PRM) Python Libraries:
skyline: Python API (limited)Description: Skyline document with external files Typical Data: Complete Skyline analysis Use Cases: Sharing Skyline projects Python Libraries:
zipfile: Extract for processing
EDA Approach:Description: SCIEX instrument data with quantitation Typical Data: LC-MS/MS with MRM transitions Use Cases: SCIEX QTRAP, TripleTOF data Python Libraries:
Description: Thermo raw instrument file Typical Data: Full MS data from Orbitrap, Q Exactive Use Cases: Label-free and TMT quantification Python Libraries:
pymsfilereader: Thermo RawFileReaderThermoRawFileParser: Cross-platform CLI
EDA Approach:Description: Agilent data directory Typical Data: LC-MS and GC-MS data Use Cases: Agilent instrument workflows Python Libraries:
Description: Standard MS format for metabolomics Typical Data: Full scan MS, targeted MS/MS Use Cases: Untargeted and targeted metabolomics Python Libraries:
Description: Analytical Data Interchange for MS Typical Data: GC-MS, LC-MS chromatography data Use Cases: Metabolomics, GC-MS workflows Python Libraries:
netCDF4: Low-level accesspyopenms: CDF supportxcms via R integration
EDA Approach:Description: NIST spectral library format Typical Data: Reference mass spectra Use Cases: Metabolite identification, library matching Python Libraries:
matchms: Spectral matchingDescription: Mascot Generic Format for MS/MS Typical Data: MS/MS spectra for metabolite ID Use Cases: Spectral library searching Python Libraries:
matchms: Metabolomics spectral analysispyteomics.mgf
EDA Approach:Description: Standard XML format for NMR metabolomics Typical Data: 1D/2D NMR spectra with metadata Use Cases: NMR-based metabolomics Python Libraries:
nmrml2isa: Format conversionDescription: JSON format for metabolomics results Typical Data: Feature tables, annotations, metadata Use Cases: GNPS, MetaboAnalyst, web tools Python Libraries:
json: Standard librarypandas: JSON normalization
EDA Approach:Description: Tab-delimited feature tables Typical Data: m/z, RT, intensities across samples Use Cases: MZmine, XCMS, MS-DIAL output Python Libraries:
pandas: Text file reading
EDA Approach:Description: OpenMS detected features Typical Data: LC-MS features with quality scores Use Cases: OpenMS workflows Python Libraries:
pyopenms: FeatureXML support
EDA Approach:Description: Linked features across samples Typical Data: Aligned features with group info Use Cases: Multi-sample LC-MS analysis Python Libraries:
pyopenms: ConsensusXML reading
EDA Approach:Description: Peptide/metabolite identifications Typical Data: MS/MS identifications with scores Use Cases: OpenMS ID workflows Python Libraries:
pyopenms: IdXML support
EDA Approach:Description: LipidCreator transition list Typical Data: Lipid transitions for targeted MS Use Cases: Targeted lipidomics Python Libraries:
Description: PSI tabular summary format Typical Data: Protein/peptide/metabolite quantification Use Cases: Publication and data sharing Python Libraries:
pyteomics.mztabpandas for TSV-like structure
EDA Approach:Description: Lipid identification results Typical Data: Lipid annotations, grades, intensities Use Cases: Lipidomics software output Python Libraries:
pandas: CSV reading
EDA Approach:Description: Structure data file for metabolites Typical Data: Chemical structures with properties Use Cases: Metabolite database creation Python Libraries:
RDKit: Chem.SDMolSupplier('file.sdf')
EDA Approach:Description: Single molecule structure files Typical Data: Metabolite chemical structure Use Cases: Structure-based searches Python Libraries:
RDKit: Chem.MolFromMolFile('file.mol')
EDA Approach:Description: HDF5 for large omics datasets Typical Data: Feature matrices, spectra, metadata Use Cases: Large-scale studies, cloud computing Python Libraries:
h5py: HDF5 accessanndata: For single-cell proteomics
EDA Approach:Description: Serialized R analysis objects Typical Data: Processed omics results from R packages Use Cases: xcms, CAMERA, MSnbase workflows Python Libraries:
pyreadr: pyreadr.read_r('file.Rdata')rpy2: R-Python integration
EDA Approach:Description: mzTab specific to metabolomics Typical Data: Small molecule quantification Use Cases: Metabolomics data sharing Python Libraries:
pyteomics.mztab: Can parse mzTab-M
EDA Approach:Description: Columnar storage for large tables Typical Data: Feature matrices, metadata Use Cases: Efficient big data omics Python Libraries:
pandas: pd.read_parquet()pyarrow: Direct parquet access
EDA Approach:Description: Pickled Python objects Typical Data: ML models, processed data Use Cases: Workflow intermediate storage Python Libraries:
pickle: Standard serializationjoblib: Enhanced pickling
EDA Approach:Description: Chunked, compressed array storage Typical Data: Multi-dimensional omics data Use Cases: Cloud-optimized analysis Python Libraries:
zarr: Array storage
EDA Approach: