skills/research/research-paper-writing/references/citation-workflow.md
This reference provides a complete workflow for managing citations programmatically, preventing AI-generated citation hallucinations, and maintaining clean bibliographies.
Research has documented significant issues with AI-generated citations:
Never generate citations from memory—always verify programmatically.
| API | Coverage | Rate Limits | Best For |
|---|---|---|---|
| Semantic Scholar | 214M papers | 1 RPS (free key) | ML/AI papers, citation graphs |
| CrossRef | 140M+ DOIs | Polite pool with mailto | DOI lookup, BibTeX retrieval |
| arXiv | Preprints | 3-second delays | ML preprints, PDF access |
| OpenAlex | 240M+ works | 100K/day, 10 RPS | Open alternative to MAG |
Need ML paper search? → Semantic Scholar
Have DOI, need BibTeX? → CrossRef content negotiation
Looking for preprint? → arXiv API
Need open data, bulk access? → OpenAlex
Google Scholar has no official API. Scraping violates ToS. Use SerpApi ($75-275/month) only if Semantic Scholar coverage is insufficient.
1. SEARCH → Query Semantic Scholar with specific keywords
↓
2. VERIFY → Confirm paper exists in 2+ sources
↓
3. RETRIEVE → Get BibTeX via DOI content negotiation
↓
4. VALIDATE → Confirm the claim appears in source
↓
5. ADD → Add verified entry to .bib file
Use Semantic Scholar for ML/AI papers:
from semanticscholar import SemanticScholar
sch = SemanticScholar()
results = sch.search_paper("transformer attention mechanism", limit=10)
for paper in results:
print(f"Title: {paper.title}")
print(f"Year: {paper.year}")
print(f"DOI: {paper.externalIds.get('DOI', 'N/A')}")
print(f"arXiv: {paper.externalIds.get('ArXiv', 'N/A')}")
print(f"Citation count: {paper.citationCount}")
print("---")
Confirm paper exists in at least two sources:
import requests
def verify_paper(doi=None, arxiv_id=None, title=None):
"""Verify paper exists in multiple sources."""
sources_found = []
# Check Semantic Scholar
sch = SemanticScholar()
if doi:
paper = sch.get_paper(f"DOI:{doi}")
if paper:
sources_found.append("Semantic Scholar")
# Check CrossRef (via DOI)
if doi:
resp = requests.get(f"https://api.crossref.org/works/{doi}")
if resp.status_code == 200:
sources_found.append("CrossRef")
# Check arXiv
if arxiv_id:
resp = requests.get(
f"http://export.arxiv.org/api/query?id_list={arxiv_id}"
)
if "<entry>" in resp.text:
sources_found.append("arXiv")
return len(sources_found) >= 2, sources_found
Use DOI content negotiation for guaranteed accuracy:
import requests
def doi_to_bibtex(doi: str) -> str:
"""Get verified BibTeX from DOI via CrossRef content negotiation."""
response = requests.get(
f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"},
allow_redirects=True
)
response.raise_for_status()
return response.text
# Example: "Attention Is All You Need"
bibtex = doi_to_bibtex("10.48550/arXiv.1706.03762")
print(bibtex)
Before citing a paper for a specific claim, verify the claim exists:
def get_paper_abstract(doi):
"""Get abstract to verify claims."""
sch = SemanticScholar()
paper = sch.get_paper(f"DOI:{doi}")
return paper.abstract if paper else None
# Verify claim appears in abstract
abstract = get_paper_abstract("10.48550/arXiv.1706.03762")
claim = "attention mechanism"
if claim.lower() in abstract.lower():
print("Claim appears in paper")
Add verified entry to your .bib file with consistent key format:
def generate_citation_key(bibtex: str) -> str:
"""Generate consistent citation key: author_year_firstword."""
import re
# Extract author
author_match = re.search(r'author\s*=\s*\{([^}]+)\}', bibtex, re.I)
if author_match:
first_author = author_match.group(1).split(',')[0].split()[-1]
else:
first_author = "unknown"
# Extract year
year_match = re.search(r'year\s*=\s*\{?(\d{4})\}?', bibtex, re.I)
year = year_match.group(1) if year_match else "0000"
# Extract title first word
title_match = re.search(r'title\s*=\s*\{([^}]+)\}', bibtex, re.I)
if title_match:
first_word = title_match.group(1).split()[0].lower()
first_word = re.sub(r'[^a-z]', '', first_word)
else:
first_word = "paper"
return f"{first_author.lower()}_{year}_{first_word}"
{% raw %}
"""
Citation Manager - Verified citation workflow for ML papers.
"""
import requests
import time
from typing import Optional, List, Dict, Tuple
from dataclasses import dataclass
try:
from semanticscholar import SemanticScholar
except ImportError:
print("Install: pip install semanticscholar")
SemanticScholar = None
@dataclass
class Paper:
title: str
authors: List[str]
year: int
doi: Optional[str]
arxiv_id: Optional[str]
venue: Optional[str]
citation_count: int
abstract: Optional[str]
class CitationManager:
"""Manage citations with verification."""
def __init__(self, api_key: Optional[str] = None):
self.sch = SemanticScholar(api_key=api_key) if SemanticScholar else None
self.verified_papers: Dict[str, Paper] = {}
def search(self, query: str, limit: int = 10) -> List[Paper]:
"""Search for papers using Semantic Scholar."""
if not self.sch:
raise RuntimeError("Semantic Scholar not available")
results = self.sch.search_paper(query, limit=limit)
papers = []
for r in results:
paper = Paper(
title=r.title,
authors=[a.name for a in (r.authors or [])],
year=r.year or 0,
doi=r.externalIds.get('DOI') if r.externalIds else None,
arxiv_id=r.externalIds.get('ArXiv') if r.externalIds else None,
venue=r.venue,
citation_count=r.citationCount or 0,
abstract=r.abstract
)
papers.append(paper)
return papers
def verify(self, paper: Paper) -> Tuple[bool, List[str]]:
"""Verify paper exists in multiple sources."""
sources = []
# Already found in Semantic Scholar via search
sources.append("Semantic Scholar")
# Check CrossRef if DOI available
if paper.doi:
try:
resp = requests.get(
f"https://api.crossref.org/works/{paper.doi}",
timeout=10
)
if resp.status_code == 200:
sources.append("CrossRef")
except Exception:
pass
# Check arXiv if ID available
if paper.arxiv_id:
try:
resp = requests.get(
f"http://export.arxiv.org/api/query?id_list={paper.arxiv_id}",
timeout=10
)
if "<entry>" in resp.text and "<title>" in resp.text:
sources.append("arXiv")
except Exception:
pass
return len(sources) >= 2, sources
def get_bibtex(self, paper: Paper) -> Optional[str]:
"""Get BibTeX for verified paper."""
if paper.doi:
try:
resp = requests.get(
f"https://doi.org/{paper.doi}",
headers={"Accept": "application/x-bibtex"},
timeout=10,
allow_redirects=True
)
if resp.status_code == 200:
return resp.text
except Exception:
pass
# Fallback: generate from paper data
return self._generate_bibtex(paper)
def _generate_bibtex(self, paper: Paper) -> str:
"""Generate BibTeX from paper metadata."""
# Generate citation key
first_author = paper.authors[0].split()[-1] if paper.authors else "unknown"
first_word = paper.title.split()[0].lower().replace(',', '').replace(':', '')
key = f"{first_author.lower()}_{paper.year}_{first_word}"
# Format authors
authors = " and ".join(paper.authors) if paper.authors else "Unknown"
bibtex = f"""@article{{{key},
title = {{{paper.title}}},
author = {{{authors}}},
year = {{{paper.year}}},
{'doi = {' + paper.doi + '},' if paper.doi else ''}
{'eprint = {' + paper.arxiv_id + '},' if paper.arxiv_id else ''}
{'journal = {' + paper.venue + '},' if paper.venue else ''}
}}"""
return bibtex
def cite(self, query: str) -> Optional[str]:
"""Full workflow: search, verify, return BibTeX."""
# Search
papers = self.search(query, limit=5)
if not papers:
return None
# Take top result
paper = papers[0]
# Verify
verified, sources = self.verify(paper)
if not verified:
print(f"Warning: Could only verify in {sources}")
# Get BibTeX
bibtex = self.get_bibtex(paper)
# Cache
if bibtex:
self.verified_papers[paper.title] = paper
return bibtex
# Usage example
if __name__ == "__main__":
cm = CitationManager()
# Search and cite
bibtex = cm.cite("attention is all you need transformer")
if bibtex:
print(bibtex)
{% endraw %}
def quick_cite(query: str) -> str:
"""One-liner citation."""
cm = CitationManager()
return cm.cite(query)
def batch_cite(queries: List[str], output_file: str = "references.bib"):
"""Cite multiple papers and save to file."""
cm = CitationManager()
bibtex_entries = []
for query in queries:
print(f"Processing: {query}")
bibtex = cm.cite(query)
if bibtex:
bibtex_entries.append(bibtex)
time.sleep(1) # Rate limiting
with open(output_file, 'w') as f:
f.write("\n\n".join(bibtex_entries))
print(f"Saved {len(bibtex_entries)} citations to {output_file}")
| Feature | BibTeX | BibLaTeX |
|---|---|---|
| Unicode support | Limited | Full |
| Entry types | Standard | Extended (@online, @dataset) |
| Customization | Limited | Highly flexible |
| Backend | bibtex | Biber (recommended) |
Recommendation: Use natbib with BibTeX for conference submissions — all major venue templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM) ship with natbib and .bst files. BibLaTeX with Biber is an option for journals or personal projects where you control the template.
% In preamble
\usepackage[
backend=biber,
style=numeric,
sorting=none
]{biblatex}
\addbibresource{references.bib}
% In document
\cite{vaswani_2017_attention}
% At end
\printbibliography
\cite{key} % Numeric: [1]
\citep{key} % Parenthetical: (Author, 2020)
\citet{key} % Textual: Author (2020)
\citeauthor{key} % Just author name
\citeyear{key} % Just year
Use format: author_year_firstword
vaswani_2017_attention
devlin_2019_bert
brown_2020_language
@inproceedings{vaswani_2017_attention,
title = {Attention Is All You Need},
author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and
Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and
Kaiser, Lukasz and Polosukhin, Illia},
booktitle = {Advances in Neural Information Processing Systems},
volume = {30},
year = {2017},
publisher = {Curran Associates, Inc.}
}
@article{hochreiter_1997_long,
title = {Long Short-Term Memory},
author = {Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
journal = {Neural Computation},
volume = {9},
number = {8},
pages = {1735--1780},
year = {1997},
publisher = {MIT Press}
}
@misc{brown_2020_language,
title = {Language Models are Few-Shot Learners},
author = {Brown, Tom and Mann, Benjamin and Ryder, Nick and others},
year = {2020},
eprint = {2005.14165},
archiveprefix = {arXiv},
primaryclass = {cs.CL}
}
Issue: Semantic Scholar returns no results
Issue: DOI doesn't resolve to BibTeX
Issue: Rate limiting errors
Issue: Encoding problems in BibTeX
{\"u} for üBefore adding a citation:
APIs:
Python Libraries:
semanticscholar: https://pypi.org/project/semanticscholar/arxiv: https://pypi.org/project/arxiv/habanero (CrossRef): https://github.com/sckott/habaneroVerification Tools: