scientific-skills/citation-management/references/metadata_extraction.md
Comprehensive guide to extracting accurate citation metadata from DOIs, PMIDs, arXiv IDs, and URLs using various APIs and services.
Accurate metadata is essential for proper citations. This guide covers:
Format: 10.XXXX/suffix
Examples:
10.1038/s41586-021-03819-2 # Nature article
10.1126/science.aam9317 # Science article
10.1016/j.cell.2023.01.001 # Cell article
10.1371/journal.pone.0123456 # PLOS ONE article
Properties:
Where to find:
Format: 8-digit number (typically)
Examples:
34265844
28445112
35476778
Properties:
Where to find:
Format: PMC followed by numbers
Examples:
PMC8287551
PMC7456789
Properties:
Format: YYMM.NNNNN or archive/YYMMNNN
Examples:
2103.14030 # New format (since 2007)
2401.12345 # 2024 submission
arXiv:hep-th/9901001 # Old format
Properties:
Where to find:
ISBN (Books):
978-0-12-345678-9
0-123-45678-9
arXiv category:
cs.LG # Computer Science - Machine Learning
q-bio.QM # Quantitative Biology - Quantitative Methods
math.ST # Mathematics - Statistics
Primary source for DOIs - Most comprehensive metadata for journal articles.
Base URL: https://api.crossref.org/works/
No API key required, but polite pool recommended:
Request:
GET https://api.crossref.org/works/10.1038/s41586-021-03819-2
Response (simplified):
{
"message": {
"DOI": "10.1038/s41586-021-03819-2",
"title": ["Article title here"],
"author": [
{"given": "John", "family": "Smith"},
{"given": "Jane", "family": "Doe"}
],
"container-title": ["Nature"],
"volume": "595",
"issue": "7865",
"page": "123-128",
"published-print": {"date-parts": [[2021, 7, 1]]},
"publisher": "Springer Nature",
"type": "journal-article",
"ISSN": ["0028-0836"]
}
}
Always present:
DOI: Digital Object Identifiertitle: Article title (array)type: Content type (journal-article, book-chapter, etc.)Usually present:
author: Array of author objectscontainer-title: Journal/book titlepublished-print or published-online: Publication datevolume, issue, page: Publication detailspublisher: Publisher nameSometimes present:
abstract: Article abstractsubject: Subject categoriesISSN: Journal ISSNISBN: Book ISBNreference: Reference listis-referenced-by-count: Citation countCrossRef type field values:
journal-article: Journal articlesbook-chapter: Book chaptersbook: Booksproceedings-article: Conference papersposted-content: Preprintsdataset: Research datasetsreport: Technical reportsdissertation: Theses/dissertationsSpecialized for biomedical literature - Curated metadata with MeSH terms.
Base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
API key recommended (free):
Step 1: EFetch for full record
GET https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&
id=34265844&
retmode=xml&
api_key=YOUR_KEY
Response: XML with comprehensive metadata
Step 2: Parse XML
Key fields:
<PubmedArticle>
<MedlineCitation>
<PMID>34265844</PMID>
<Article>
<ArticleTitle>Title here</ArticleTitle>
<AuthorList>
<Author><LastName>Smith</LastName><ForeName>John</ForeName></Author>
</AuthorList>
<Journal>
<Title>Nature</Title>
<JournalIssue>
<Volume>595</Volume>
<Issue>7865</Issue>
<PubDate><Year>2021</Year></PubDate>
</JournalIssue>
</Journal>
<Pagination><MedlinePgn>123-128</MedlinePgn></Pagination>
<Abstract><AbstractText>Abstract text here</AbstractText></Abstract>
</Article>
</MedlineCitation>
<PubmedData>
<ArticleIdList>
<ArticleId IdType="doi">10.1038/s41586-021-03819-2</ArticleId>
<ArticleId IdType="pmc">PMC8287551</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
MeSH Terms: Controlled vocabulary
<MeshHeadingList>
<MeshHeading>
<DescriptorName UI="D003920">Diabetes Mellitus</DescriptorName>
</MeshHeading>
</MeshHeadingList>
Publication Types:
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D016449">Randomized Controlled Trial</PublicationType>
</PublicationTypeList>
Grant Information:
<GrantList>
<Grant>
<GrantID>R01-123456</GrantID>
<Agency>NIAID NIH HHS</Agency>
<Country>United States</Country>
</Grant>
</GrantList>
Preprints in physics, math, CS, q-bio - Free, open access.
Base URL: http://export.arxiv.org/api/query
No API key required
Request:
GET http://export.arxiv.org/api/query?id_list=2103.14030
Response: Atom XML
<entry>
<id>http://arxiv.org/abs/2103.14030v2</id>
<title>Highly accurate protein structure prediction with AlphaFold</title>
<author><name>John Jumper</name></author>
<author><name>Richard Evans</name></author>
<published>2021-03-26T17:47:17Z</published>
<updated>2021-07-01T16:51:46Z</updated>
<summary>Abstract text here...</summary>
<arxiv:doi>10.1038/s41586-021-03819-2</arxiv:doi>
<category term="q-bio.BM" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
</entry>
id: arXiv URLtitle: Preprint titleauthor: Author listpublished: First version dateupdated: Latest version datesummary: Abstractarxiv:doi: DOI if publishedarxiv:journal_ref: Journal reference if publishedcategory: arXiv categoriesarXiv tracks versions:
v1: Initial submissionv2, v3, etc.: RevisionsAlways check if preprint has been published in journal (use DOI if available).
Research datasets, software, other outputs - Assigns DOIs to non-traditional scholarly works.
Base URL: https://api.datacite.org/dois/
Similar to CrossRef but for datasets, software, code, etc.
Request:
GET https://api.datacite.org/dois/10.5281/zenodo.1234567
Response: JSON with metadata for dataset/software
Required:
author: Author namestitle: Article titlejournal: Journal nameyear: Publication yearOptional but recommended:
volume: Volume numbernumber: Issue numberpages: Page range (e.g., 123--145)doi: Digital Object Identifierurl: URL if no DOImonth: Publication monthExample:
@article{Smith2024,
author = {Smith, John and Doe, Jane},
title = {Novel Approach to Protein Folding},
journal = {Nature},
year = {2024},
volume = {625},
number = {8001},
pages = {123--145},
doi = {10.1038/nature12345}
}
Required:
author or editor: Author(s) or editor(s)title: Book titlepublisher: Publisher nameyear: Publication yearOptional but recommended:
edition: Edition number (if not first)address: Publisher locationisbn: ISBNurl: URLseries: Series nameExample:
@book{Kumar2021,
author = {Kumar, Vinay and Abbas, Abul K. and Aster, Jon C.},
title = {Robbins and Cotran Pathologic Basis of Disease},
publisher = {Elsevier},
year = {2021},
edition = {10},
isbn = {978-0-323-53113-9}
}
Required:
author: Author namestitle: Paper titlebooktitle: Conference/proceedings nameyear: YearOptional but recommended:
pages: Page rangeorganization: Organizing bodypublisher: Publisheraddress: Conference locationmonth: Conference monthdoi: DOI if availableExample:
@inproceedings{Vaswani2017,
author = {Vaswani, Ashish and Shazeer, Noam and others},
title = {Attention is All You Need},
booktitle = {Advances in Neural Information Processing Systems},
year = {2017},
pages = {5998--6008},
volume = {30}
}
Required:
author: Chapter author(s)title: Chapter titlebooktitle: Book titlepublisher: Publisher nameyear: Publication yearOptional but recommended:
editor: Book editor(s)pages: Chapter page rangechapter: Chapter numberedition: Editionaddress: Publisher locationExample:
@incollection{Brown2020,
author = {Brown, Peter O. and Botstein, David},
title = {Exploring the New World of the Genome with {DNA} Microarrays},
booktitle = {DNA Microarrays: A Molecular Cloning Manual},
editor = {Eisen, Michael B. and Brown, Patrick O.},
publisher = {Cold Spring Harbor Laboratory Press},
year = {2020},
pages = {1--45}
}
Required:
author: Author nametitle: Thesis titleschool: Institutionyear: YearOptional:
type: Type (e.g., "PhD dissertation")address: Institution locationmonth: Monthurl: URLExample:
@phdthesis{Johnson2023,
author = {Johnson, Mary L.},
title = {Novel Approaches to Cancer Immunotherapy},
school = {Stanford University},
year = {2023},
type = {{PhD} dissertation}
}
Required:
author: Author(s)title: Titleyear: YearFor preprints, add:
howpublished: Repository (e.g., "bioRxiv")doi: Preprint DOInote: Preprint IDExample (preprint):
@misc{Zhang2024,
author = {Zhang, Yi and Chen, Li and Wang, Hui},
title = {Novel Therapeutic Targets in Alzheimer's Disease},
year = {2024},
howpublished = {bioRxiv},
doi = {10.1101/2024.01.001},
note = {Preprint}
}
Example (software):
@misc{AlphaFold2021,
author = {DeepMind},
title = {{AlphaFold} Protein Structure Database},
year = {2021},
howpublished = {Software},
url = {https://alphafold.ebi.ac.uk/},
doi = {10.5281/zenodo.5123456}
}
Best practice - Most reliable source:
# Single DOI
python scripts/extract_metadata.py --doi 10.1038/s41586-021-03819-2
# Multiple DOIs
python scripts/extract_metadata.py \
--doi 10.1038/nature12345 \
--doi 10.1126/science.abc1234 \
--output refs.bib
Process:
For biomedical literature:
# Single PMID
python scripts/extract_metadata.py --pmid 34265844
# Multiple PMIDs
python scripts/extract_metadata.py \
--pmid 34265844 \
--pmid 28445112 \
--output refs.bib
Process:
For preprints:
python scripts/extract_metadata.py --arxiv 2103.14030
Process:
Important: Always check if preprint has been published!
When you only have URL:
python scripts/extract_metadata.py \
--url "https://www.nature.com/articles/s41586-021-03819-2"
Process:
URL patterns:
# DOI URLs
https://doi.org/10.1038/nature12345
https://dx.doi.org/10.1126/science.abc123
https://www.nature.com/articles/s41586-021-03819-2
# PubMed URLs
https://pubmed.ncbi.nlm.nih.gov/34265844/
https://www.ncbi.nlm.nih.gov/pubmed/34265844
# arXiv URLs
https://arxiv.org/abs/2103.14030
https://arxiv.org/pdf/2103.14030.pdf
From file with mixed identifiers:
# Create file with one identifier per line
# identifiers.txt:
# 10.1038/nature12345
# 34265844
# 2103.14030
# https://doi.org/10.1126/science.abc123
python scripts/extract_metadata.py \
--input identifiers.txt \
--output references.bib
Process:
Issue: Preprint cited, but journal version now available.
Solution:
Example:
% Originally: arXiv:2103.14030
% Published as:
@article{Jumper2021,
author = {Jumper, John and Evans, Richard and others},
title = {Highly Accurate Protein Structure Prediction with {AlphaFold}},
journal = {Nature},
year = {2021},
volume = {596},
pages = {583--589},
doi = {10.1038/s41586-021-03819-2}
}
Issue: Many authors (10+).
BibTeX practice:
Example:
@article{LargeCollaboration2024,
author = {First, Author and Second, Author and Third, Author and others},
...
}
Issue: Authors publish under different name formats.
Standardization:
# Common variations
John Smith
John A. Smith
John Andrew Smith
J. A. Smith
Smith, J.
Smith, J. A.
# BibTeX format (recommended)
author = {Smith, John A.}
Extraction preference:
Issue: Older papers or books without DOIs.
Solutions:
Example:
@article{OldPaper1995,
author = {Author, Name},
title = {Title Here},
journal = {Journal Name},
year = {1995},
volume = {123},
pages = {45--67},
url = {https://stable-url-here},
note = {PMID: 12345678}
}
Issue: Same work published in both.
Best practice:
If citing conference:
@inproceedings{Smith2024conf,
author = {Smith, John},
title = {Title},
booktitle = {Proceedings of NeurIPS 2024},
year = {2024}
}
If citing journal:
@article{Smith2024journal,
author = {Smith, John},
title = {Title},
journal = {Journal of Machine Learning Research},
year = {2024}
}
Extract correctly:
@incollection@bookeditor fieldauthor fieldUse @misc with appropriate fields:
@misc{DatasetName2024,
author = {Author, Name},
title = {Dataset Title},
year = {2024},
howpublished = {Zenodo},
doi = {10.5281/zenodo.123456},
note = {Version 1.2}
}
Always validate extracted metadata:
python scripts/validate_citations.py extracted_refs.bib
Check:
DOIs provide:
Spot-check:
LaTeX special characters:
{AlphaFold}M{\"u}ller or use UnicodeH$_2$O or \ce{H2O}Convention: FirstAuthorYEARkeyword
Smith2024protein
Doe2023machine
Johnson2024cancer
All papers published after ~2000 should have DOI:
doi = {10.1038/nature12345}
For non-standard sources, add note:
note = {Preprint, not peer-reviewed}
note = {Technical report}
note = {Dataset accompanying [citation]}
Metadata extraction workflow:
Use scripts to automate:
extract_metadata.py: Universal extractordoi_to_bibtex.py: Quick DOI conversionvalidate_citations.py: Verify accuracyAlways validate extracted metadata before final submission!