scientific-skills/exploratory-data-analysis/references/general_scientific_formats.md
This reference covers general-purpose scientific data formats used across multiple disciplines.
Description: Binary NumPy array format Typical Data: N-dimensional arrays of any data type Use Cases: Fast I/O for numerical data, intermediate results Python Libraries:
numpy: np.load('file.npy'), np.save()np.load('file.npy', mmap_mode='r')
EDA Approach:Description: Multiple NumPy arrays in one file Typical Data: Collections of related arrays Use Cases: Saving multiple arrays together, compressed storage Python Libraries:
numpy: np.load('file.npz') returns dict-like objectnp.savez() or np.savez_compressed()
EDA Approach:Description: Plain text tabular data Typical Data: Experimental measurements, results tables Use Cases: Universal data exchange, spreadsheet export Python Libraries:
pandas: pd.read_csv('file.csv')csv: Built-in modulepolars: High-performance CSV readingnumpy: np.loadtxt() or np.genfromtxt()
EDA Approach:Description: Tab-delimited tabular data Typical Data: Similar to CSV but tab-separated Use Cases: Bioinformatics, text processing output Python Libraries:
pandas: pd.read_csv('file.tsv', sep='\t')
EDA Approach:Description: Microsoft Excel binary/XML formats Typical Data: Tabular data with formatting, formulas Use Cases: Lab notebooks, data entry, reports Python Libraries:
pandas: pd.read_excel('file.xlsx')openpyxl: Full Excel file manipulationxlrd: Reading .xls (legacy)
EDA Approach:Description: Hierarchical text data format Typical Data: Nested data structures, metadata Use Cases: API responses, configuration, results Python Libraries:
json: Built-in modulepandas: pd.read_json()ujson: Faster JSON parsing
EDA Approach:Description: Hierarchical markup format Typical Data: Structured data with metadata Use Cases: Standards-based data exchange, APIs Python Libraries:
lxml: lxml.etree.parse()xml.etree.ElementTree: Built-in XMLxmltodict: Convert XML to dict
EDA Approach:Description: Human-readable data serialization Typical Data: Configuration, metadata, parameters Use Cases: Experiment configurations, pipelines Python Libraries:
yaml: yaml.safe_load() or yaml.load()ruamel.yaml: YAML 1.2 support
EDA Approach:Description: Configuration file format Typical Data: Settings, parameters Use Cases: Python package configuration, settings Python Libraries:
tomli / tomllib: TOML reading (tomllib in Python 3.11+)toml: Reading and writing
EDA Approach:Description: Simple configuration format Typical Data: Application settings Use Cases: Legacy configurations, simple settings Python Libraries:
configparser: Built-in INI parser
EDA Approach:Description: Container for large scientific datasets Typical Data: Multi-dimensional arrays, metadata, groups Use Cases: Large datasets, multi-modal data, parallel I/O Python Libraries:
h5py: h5py.File('file.h5', 'r')pytables: Advanced HDF5 interfacepandas: HDF5 storage via HDFStore
EDA Approach:Description: Cloud-optimized chunked arrays Typical Data: Large N-dimensional arrays Use Cases: Cloud storage, parallel computing, streaming Python Libraries:
zarr: zarr.open('file.zarr')xarray: Zarr backend support
EDA Approach:Description: Compressed data files Typical Data: Any compressed text or binary Use Cases: Compression for storage/transfer Python Libraries:
gzip: Built-in gzip modulepandas: Automatic gzip handling in read functions
EDA Approach:Description: Bzip2 compression Typical Data: Highly compressed files Use Cases: Better compression than gzip Python Libraries:
bz2: Built-in bz2 moduleDescription: Archive with multiple files Typical Data: Collections of files Use Cases: File distribution, archiving Python Libraries:
zipfile: Built-in ZIP supportpandas: Can read zipped CSVs
EDA Approach:Description: Unix tape archive Typical Data: Multiple files and directories Use Cases: Software distribution, backups Python Libraries:
tarfile: Built-in TAR support
EDA Approach:Description: Audio waveform data Typical Data: Acoustic signals, audio recordings Use Cases: Acoustic analysis, ultrasound, signal processing Python Libraries:
scipy.io.wavfile: scipy.io.wavfile.read()wave: Built-in modulesoundfile: Enhanced audio I/O
EDA Approach:Description: MATLAB workspace variables Typical Data: Arrays, structures, cells Use Cases: MATLAB-Python interoperability Python Libraries:
scipy.io: scipy.io.loadmat()h5py: For MATLAB v7.3 files (HDF5-based)mat73: Pure Python for v7.3
EDA Approach:Description: Time series data (especially medical) Typical Data: EEG, physiological signals Use Cases: Medical signal storage Python Libraries:
pyedflib: EDF/EDF+ reading and writingmne: Neurophysiology data (supports EDF)
EDA Approach:Description: CSV with timestamp column Typical Data: Time-indexed measurements Use Cases: Sensor data, monitoring, experiments Python Libraries:
pandas: pd.read_csv() with parse_dates
EDA Approach:Description: Geospatial vector data Typical Data: Geographic features (points, lines, polygons) Use Cases: GIS analysis, spatial data Python Libraries:
geopandas: gpd.read_file('file.shp')fiona: Lower-level shapefile accesspyshp: Pure Python shapefile reader
EDA Approach:Description: JSON format for geographic data Typical Data: Features with geometry and properties Use Cases: Web mapping, spatial analysis Python Libraries:
geopandas: Native GeoJSON supportjson: Parse as JSON then process
EDA Approach:Description: GeoTIFF with spatial reference Typical Data: Satellite imagery, DEMs, rasters Use Cases: Remote sensing, terrain analysis Python Libraries:
rasterio: rasterio.open('file.tif')gdal: Geospatial Data Abstraction Libraryxarray with rioxarray: N-D geospatial arrays
EDA Approach:Description: Self-describing array-based data Typical Data: Climate, atmospheric, oceanographic data Use Cases: Scientific datasets, model output Python Libraries:
netCDF4: netCDF4.Dataset('file.nc')xarray: xr.open_dataset('file.nc')
EDA Approach:Description: Meteorological data format Typical Data: Weather forecasts, climate data Use Cases: Numerical weather prediction Python Libraries:
pygrib: GRIB file readingxarray with cfgrib: GRIB to xarray
EDA Approach:Description: Older HDF format Typical Data: NASA Earth Science data Use Cases: Satellite data (MODIS, etc.) Python Libraries:
pyhdf: HDF4 accessgdal: Can read HDF4
EDA Approach:Description: Astronomy data format Typical Data: Images, tables, spectra from telescopes Use Cases: Astronomical observations Python Libraries:
astropy.io.fits: fits.open('file.fits')fitsio: Alternative FITS library
EDA Approach:Description: Next-gen data format for astronomy Typical Data: Complex hierarchical scientific data Use Cases: James Webb Space Telescope data Python Libraries:
asdf: asdf.open('file.asdf')
EDA Approach:Description: CERN ROOT framework format Typical Data: High-energy physics data Use Cases: Particle physics experiments Python Libraries:
uproot: Pure Python ROOT readingROOT: Official PyROOT bindings
EDA Approach:Description: Generic text-based data Typical Data: Tab/space-delimited, custom formats Use Cases: Simple data exchange, logs Python Libraries:
pandas: pd.read_csv() with custom delimitersnumpy: np.loadtxt(), np.genfromtxt()Description: Binary or text data Typical Data: Instrument output, custom formats Use Cases: Various scientific instruments Python Libraries:
numpy: np.fromfile() for binarystruct: Parse binary structures
EDA Approach:Description: Text logs from software/instruments Typical Data: Timestamped events, messages Use Cases: Troubleshooting, experiment tracking Python Libraries:
pandas: Structured log parsing