scientific-skills/matchms/references/importing_exporting.md
This document details all file format support in matchms for loading and saving mass spectrometry data.
Matchms provides dedicated functions for loading spectra from various file formats. All import functions return generators for memory-efficient processing of large files.
from matchms.importing import load_from_mgf
# Load spectra (returns generator)
spectra_generator = load_from_mgf("spectra.mgf")
# Convert to list for processing
spectra = list(spectra_generator)
Function: load_from_mgf(filename, metadata_harmonization=True)
Description: Loads spectra from MGF files, a common format for mass spectrometry data exchange.
Parameters:
filename (str): Path to MGF filemetadata_harmonization (bool, default=True): Apply automatic metadata key harmonizationExample:
from matchms.importing import load_from_mgf
# Load with metadata harmonization
spectra = list(load_from_mgf("data.mgf"))
# Load without harmonization
spectra = list(load_from_mgf("data.mgf", metadata_harmonization=False))
MGF Format: Text-based format with BEGIN IONS/END IONS blocks containing metadata and peak lists.
Function: load_from_msp(filename, metadata_harmonization=True)
Description: Loads spectra from MSP files, commonly used for spectral libraries.
Parameters:
filename (str): Path to MSP filemetadata_harmonization (bool, default=True): Apply automatic metadata harmonizationExample:
from matchms.importing import load_from_msp
spectra = list(load_from_msp("library.msp"))
MSP Format: Text-based format with Name/MW/Comment fields followed by peak lists.
Function: load_from_mzml(filename, ms_level=2, metadata_harmonization=True)
Description: Loads spectra from mzML files, the standard XML-based format for raw mass spectrometry data.
Parameters:
filename (str): Path to mzML filems_level (int, default=2): MS level to extract (1 for MS1, 2 for MS2/tandem)metadata_harmonization (bool, default=True): Apply automatic metadata harmonizationExample:
from matchms.importing import load_from_mzml
# Load MS2 spectra (default)
ms2_spectra = list(load_from_mzml("data.mzML"))
# Load MS1 spectra
ms1_spectra = list(load_from_mzml("data.mzML", ms_level=1))
mzML Format: XML-based standard format containing raw instrument data and rich metadata.
Function: load_from_mzxml(filename, ms_level=2, metadata_harmonization=True)
Description: Loads spectra from mzXML files, an earlier XML-based format for mass spectrometry data.
Parameters:
filename (str): Path to mzXML filems_level (int, default=2): MS level to extractmetadata_harmonization (bool, default=True): Apply automatic metadata harmonizationExample:
from matchms.importing import load_from_mzxml
spectra = list(load_from_mzxml("data.mzXML"))
mzXML Format: XML-based format, predecessor to mzML.
Function: load_from_json(filename, metadata_harmonization=True)
Description: Loads spectra from JSON files, particularly GNPS-compatible JSON format.
Parameters:
filename (str): Path to JSON filemetadata_harmonization (bool, default=True): Apply automatic metadata harmonizationExample:
from matchms.importing import load_from_json
spectra = list(load_from_json("spectra.json"))
JSON Format: Structured JSON with spectrum metadata and peak arrays.
Function: load_from_pickle(filename)
Description: Loads previously saved matchms Spectrum objects from pickle files. Fast loading of preprocessed spectra.
Parameters:
filename (str): Path to pickle fileExample:
from matchms.importing import load_from_pickle
spectra = list(load_from_pickle("processed_spectra.pkl"))
Use case: Saving and loading preprocessed spectra for faster subsequent analyses.
Function: load_from_usi(usi)
Description: Loads a single spectrum from a metabolomics USI reference.
Parameters:
usi (str): Universal Spectrum Identifier stringExample:
from matchms.importing import load_from_usi
usi = "mzspec:GNPS:TASK-...:spectrum..."
spectrum = load_from_usi(usi)
USI Format: Standardized identifier for accessing spectra from online repositories.
Matchms provides functions to save processed spectra to various formats for sharing and archival.
Function: save_as_mgf(spectra, filename, write_mode='w')
Description: Saves spectra to MGF format.
Parameters:
spectra (list): List of Spectrum objects to savefilename (str): Output file pathwrite_mode (str, default='w'): File write mode ('w' for write, 'a' for append)Example:
from matchms.exporting import save_as_mgf
save_as_mgf(processed_spectra, "output.mgf")
Function: save_as_msp(spectra, filename, write_mode='w')
Description: Saves spectra to MSP format.
Parameters:
spectra (list): List of Spectrum objects to savefilename (str): Output file pathwrite_mode (str, default='w'): File write modeExample:
from matchms.exporting import save_as_msp
save_as_msp(library_spectra, "library.msp")
Function: save_as_json(spectra, filename, write_mode='w')
Description: Saves spectra to JSON format (GNPS-compatible).
Parameters:
spectra (list): List of Spectrum objects to savefilename (str): Output file pathwrite_mode (str, default='w'): File write modeExample:
from matchms.exporting import save_as_json
save_as_json(spectra, "spectra.json")
Function: save_as_pickle(spectra, filename)
Description: Saves spectra as Python pickle file. Preserves all Spectrum attributes and is fastest for loading.
Parameters:
spectra (list): List of Spectrum objects to savefilename (str): Output file pathExample:
from matchms.exporting import save_as_pickle
save_as_pickle(processed_spectra, "processed.pkl")
Advantages:
Disadvantages:
from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf, save_as_pickle
from matchms.filtering import default_filters, normalize_intensities
from matchms.filtering import select_by_relative_intensity
# Load raw spectra
spectra = list(load_from_mgf("raw_data.mgf"))
# Process spectra
processed = []
for spectrum in spectra:
spectrum = default_filters(spectrum)
spectrum = normalize_intensities(spectrum)
spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01)
if spectrum is not None:
processed.append(spectrum)
# Save processed spectra (MGF for sharing)
save_as_mgf(processed, "processed_data.mgf")
# Save as pickle for fast reloading
save_as_pickle(processed, "processed_data.pkl")
from matchms.importing import load_from_mzml
from matchms.exporting import save_as_mgf, save_as_msp
# Convert mzML to MGF
spectra = list(load_from_mzml("data.mzML", ms_level=2))
save_as_mgf(spectra, "data.mgf")
# Convert to MSP library format
save_as_msp(spectra, "data.msp")
from matchms.importing import load_from_mgf
import glob
# Load all MGF files in directory
all_spectra = []
for mgf_file in glob.glob("data/*.mgf"):
spectra = list(load_from_mgf(mgf_file))
all_spectra.extend(spectra)
print(f"Loaded {len(all_spectra)} spectra from multiple files")
from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf
from matchms.filtering import default_filters, normalize_intensities
# Process large file without loading all into memory
def process_spectrum(spectrum):
spectrum = default_filters(spectrum)
spectrum = normalize_intensities(spectrum)
return spectrum
# Stream processing
with open("output.mgf", 'w') as outfile:
for spectrum in load_from_mgf("large_file.mgf"):
processed = process_spectrum(spectrum)
if processed is not None:
# Write immediately without storing in memory
save_as_mgf([processed], outfile, write_mode='a')
MGF:
MSP:
JSON:
Pickle:
mzML/mzXML:
The metadata_harmonization parameter (available in most import functions) automatically standardizes metadata keys:
# Without harmonization
spectrum = load_from_mgf("data.mgf", metadata_harmonization=False)
# May have: "PRECURSOR_MZ", "Precursor_mz", "precursormz"
# With harmonization (default)
spectrum = load_from_mgf("data.mgf", metadata_harmonization=True)
# Standardized to: "precursor_mz"
Recommended: Keep harmonization enabled (default) for consistent metadata access across different data sources.
For detailed format specifications:
For complete API documentation: https://matchms.readthedocs.io/en/latest/api/matchms.importing.html https://matchms.readthedocs.io/en/latest/api/matchms.exporting.html