scientific-skills/citation-management/references/citation_validation.md
Comprehensive guide to validating citation accuracy, completeness, and formatting in BibTeX files.
Citation validation ensures:
Validation should be performed:
Purpose: Ensure DOIs are valid and resolve correctly.
DOI format:
Valid: 10.1038/s41586-021-03819-2
Valid: 10.1126/science.aam9317
Invalid: 10.1038/invalid
Invalid: doi:10.1038/... (should omit "doi:" prefix in BibTeX)
DOI resolution:
Metadata consistency:
Manual check:
Automated check (recommended):
python scripts/validate_citations.py references.bib --check-dois
Process:
Broken DOIs:
Mismatched metadata:
Missing DOIs:
Purpose: Ensure all necessary information is present.
@article:
author % REQUIRED
title % REQUIRED
journal % REQUIRED
year % REQUIRED
volume % Highly recommended
pages % Highly recommended
doi % Highly recommended for modern papers
@book:
author OR editor % REQUIRED (at least one)
title % REQUIRED
publisher % REQUIRED
year % REQUIRED
isbn % Recommended
@inproceedings:
author % REQUIRED
title % REQUIRED
booktitle % REQUIRED (conference/proceedings name)
year % REQUIRED
pages % Recommended
@incollection (book chapter):
author % REQUIRED
title % REQUIRED (chapter title)
booktitle % REQUIRED (book title)
publisher % REQUIRED
year % REQUIRED
editor % Recommended
pages % Recommended
@phdthesis:
author % REQUIRED
title % REQUIRED
school % REQUIRED
year % REQUIRED
@misc (preprints, datasets, etc.):
author % REQUIRED
title % REQUIRED
year % REQUIRED
howpublished % Recommended (bioRxiv, Zenodo, etc.)
doi OR url % At least one required
python scripts/validate_citations.py references.bib --check-required-fields
Output:
Error: Entry 'Smith2024' missing required field 'journal'
Error: Entry 'Doe2023' missing required field 'year'
Warning: Entry 'Jones2022' missing recommended field 'volume'
Purpose: Ensure consistent, correct author name formatting.
Recommended BibTeX format:
author = {Last1, First1 and Last2, First2 and Last3, First3}
Examples:
% Correct
author = {Smith, John}
author = {Smith, John A.}
author = {Smith, John Andrew}
author = {Smith, John and Doe, Jane}
author = {Smith, John and Doe, Jane and Johnson, Mary}
% For many authors
author = {Smith, John and Doe, Jane and others}
% Incorrect
author = {John Smith} % First Last format (not recommended)
author = {Smith, J.; Doe, J.} % Semicolon separator (wrong)
author = {Smith J, Doe J} % Missing commas
Suffixes (Jr., III, etc.):
author = {King, Jr., Martin Luther}
Multiple surnames (hyphenated):
author = {Smith-Jones, Mary}
Van, von, de, etc.:
author = {van der Waals, Johannes}
author = {de Broglie, Louis}
Organizations as authors:
author = {{World Health Organization}}
% Double braces treat as single author
Automated validation:
python scripts/validate_citations.py references.bib --check-authors
Checks for:
Purpose: Ensure all fields contain valid, reasonable values.
Valid years:
year = {2024} % Current/recent
year = {1953} % Watson & Crick DNA structure (historical)
year = {1665} % Hooke's Micrographia (very old)
Invalid years:
year = {24} % Two digits (ambiguous)
year = {202} % Typo
year = {2025} % Future (unless accepted/in press)
year = {0} % Obviously wrong
Check:
volume = {123} % Numeric
volume = {12} % Valid
number = {3} % Valid
number = {S1} % Supplement issue (valid)
Invalid:
volume = {Vol. 123} % Should be just number
number = {Issue 3} % Should be just number
Correct format:
pages = {123--145} % En-dash (two hyphens)
pages = {e0123456} % PLOS-style article ID
pages = {123} % Single page
Incorrect format:
pages = {123-145} % Single hyphen (use --)
pages = {pp. 123-145} % Remove "pp."
pages = {123–145} % Unicode en-dash (may cause issues)
Check:
Valid:
url = {https://www.nature.com/articles/nature12345}
url = {https://arxiv.org/abs/2103.14030}
Questionable:
url = {http://...} % HTTP instead of HTTPS
url = {file:///...} % Local file path
url = {bit.ly/...} % URL shortener (not permanent)
Purpose: Find and remove duplicate entries.
Exact duplicates (same DOI):
@article{Smith2024a,
doi = {10.1038/nature12345},
...
}
@article{Smith2024b,
doi = {10.1038/nature12345}, % Same DOI!
...
}
Near duplicates (similar title/authors):
@article{Smith2024,
title = {Machine Learning for Drug Discovery},
...
}
@article{Smith2024method,
title = {Machine learning for drug discovery}, % Same, different case
...
}
Preprint + Published:
@misc{Smith2023arxiv,
title = {AlphaFold Results},
howpublished = {arXiv},
...
}
@article{Smith2024,
title = {AlphaFold Results}, % Same paper, now published
journal = {Nature},
...
}
% Keep published version only
By DOI (most reliable):
By title similarity:
By author-year-title:
Automated detection:
python scripts/validate_citations.py references.bib --check-duplicates
Output:
Warning: Possible duplicate entries:
- Smith2024a (DOI: 10.1038/nature12345)
- Smith2024b (DOI: 10.1038/nature12345)
Recommendation: Keep one entry, remove the other.
Purpose: Ensure valid BibTeX syntax.
Missing commas:
@article{Smith2024,
author = {Smith, John} % Missing comma!
title = {Title}
}
% Should be:
author = {Smith, John}, % Comma after each field
Unbalanced braces:
title = {Title with {Protected} Text % Missing closing brace
% Should be:
title = {Title with {Protected} Text}
Missing closing brace for entry:
@article{Smith2024,
author = {Smith, John},
title = {Title}
% Missing closing brace!
% Should end with:
}
Invalid characters in keys:
@article{Smith&Doe2024, % & not allowed in key
...
}
% Use:
@article{SmithDoe2024,
...
}
Entry structure:
@TYPE{citationkey,
field1 = {value1},
field2 = {value2},
...
fieldN = {valueN}
}
Citation keys:
Field values:
year = 2024Special characters:
{ and } for grouping\ for LaTeX commands{AlphaFold}{\"u}, {\'e}, {\aa}python scripts/validate_citations.py references.bib --check-syntax
Checks:
Run comprehensive validation:
python scripts/validate_citations.py references.bib
Checks all:
Examine validation report:
{
"total_entries": 150,
"valid_entries": 140,
"errors": [
{
"entry": "Smith2024",
"error": "missing_required_field",
"field": "journal",
"severity": "high"
},
{
"entry": "Doe2023",
"error": "invalid_doi",
"doi": "10.1038/broken",
"severity": "high"
}
],
"warnings": [
{
"entry": "Jones2022",
"warning": "missing_recommended_field",
"field": "volume",
"severity": "medium"
}
],
"duplicates": [
{
"entries": ["Smith2024a", "Smith2024b"],
"reason": "same_doi",
"doi": "10.1038/nature12345"
}
]
}
High-priority (errors):
Medium-priority (warnings):
Low-priority:
Use auto-fix for safe corrections:
python scripts/validate_citations.py references.bib \
--auto-fix \
--output fixed_references.bib
Auto-fix can:
Auto-fix cannot:
Review auto-fixed file:
# Check what changed
diff references.bib fixed_references.bib
# Review specific entries that had errors
grep -A 10 "Smith2024" fixed_references.bib
Validate after fixes:
python scripts/validate_citations.py fixed_references.bib --verbose
Should show:
✓ All DOIs valid
✓ All required fields present
✓ No duplicates found
✓ Syntax valid
✓ 150/150 entries valid
Use this checklist before final submission:
# After extraction
python scripts/extract_metadata.py --doi ... --output refs.bib
python scripts/validate_citations.py refs.bib
# After manual edits
python scripts/validate_citations.py refs.bib
# Before submission
python scripts/validate_citations.py refs.bib --strict
Don't validate manually - use scripts:
# Before auto-fix
cp references.bib references_backup.bib
# Run auto-fix
python scripts/validate_citations.py references.bib \
--auto-fix \
--output references_fixed.bib
# Review changes
diff references.bib references_fixed.bib
# If satisfied, replace
mv references_fixed.bib references.bib
Priority order:
For entries that can't be fixed:
@article{Old1950,
author = {Smith, John},
title = {Title},
journal = {Obscure Journal},
year = {1950},
volume = {12},
pages = {34--56},
note = {DOI not available for publications before 2000}
}
Different journals have different requirements:
Check journal author guidelines!
Problem: BibTeX says 2023, CrossRef says 2024.
Cause:
Solution:
Problem: LaTeX compilation fails on special characters.
Cause:
Solution:
% Use LaTeX commands
author = {M{\"u}ller, Hans} % Müller
title = {Study of H\textsubscript{2}O} % H₂O
% Or use UTF-8 with proper LaTeX packages
Problem: Extracted metadata missing fields.
Cause:
Solution:
Problem: Same paper appears twice, not detected.
Cause:
Solution:
Validation ensures citation quality:
✓ Accuracy: DOIs resolve, metadata correct
✓ Completeness: All required fields present
✓ Consistency: Proper formatting throughout
✓ No duplicates: Each paper cited once
✓ Valid syntax: BibTeX compiles without errors
Always validate before final submission!
Use automated tools:
python scripts/validate_citations.py references.bib
Follow workflow: