.cursor/BUGBOT.md
The repository is separated into main code and experimental code.
Small non-invasive improvements that make experimental code more consistent with the main codebase are encouraged, but avoid large refactors.
If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to paper_index.md.
When reviewing such PRs, ensure that paper_index.md was updated.
Trainers in this repository are self-contained by design. Shared logic (generation, reward computation, metric logging, weight syncing, etc.) is deliberately duplicated across trainers rather than abstracted into a shared base class.
This is intentional: each trainer must be readable, modifiable, and evolvable in isolation. The base class (_BaseTrainer) provides only minimal utilities (model card generation). Everything else — vLLM generation paths, _get_per_token_logps_and_entropies, _calculate_rewards, _prepare_inputs, metric logging — is copied in full.
The tradeoff: duplication is accepted, but consistency is mandatory. When the same logic appears in multiple trainers, the duplicated blocks must stay aligned:
self._last_loaded_step, self._metrics[mode], …)Consistency over correctness: this is a strong requirement. When duplicating code, reproduce it exactly — even if you believe the original has a bug. Do not silently fix the issue in your copy. Instead, keep your copy consistent with the source and report the problem so it can be fixed across all trainers in a dedicated PR. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep.
When modifying duplicated code: if you change a pattern that exists in multiple trainers (e.g., the vLLM generation path in _generate_single_turn), apply the same change to all other trainers. A fix in GRPO often implies the same fix in RLOO, and vice versa. Not propagating a change is a bug.
When reviewing: if a PR touches duplicated logic, verify that all copies are updated consistently. A common mistake is fixing one trainer and forgetting the others.
This codebase values leanness and simplicity above all. Prefer straightforward, inline code over abstractions, helpers, or utilities — even at the cost of some robustness or generality.
Concretely:
hasattr and getattr. Their use is almost always a symptom of overly defensive programming or a disguised version check (e.g., "this attribute was added in version X"). Instead, either drop the conditional entirely or express the version check explicitly with a version comparison. There is nearly always a cleaner alternative.Docstrings must follow the repository format below. Do not convert docstrings to other styles (Google, NumPy, etc.).
Rules:
str)*optional*defaults to <value>None, prefer (`str`, *optional*) instead of (`str` or `None`, *optional*, defaults to `None`)or: str or None~transformers.PreTrainedModel]> Parameters for X:Example:
def method(self, param1: str, param2: int = 1, param3: float | None = None):
"""
Brief one-line description of what this does.
Args:
param1 (`str`):
Description of required param.
param2 (`int`, *optional*, defaults to `1`):
Description of optional param with default.
param3 (`float`, *optional*):
Description of optional param without explicit default.
Returns:
`dict` with keys:
- `key1` (`list[int]`):
Description of this key.
Examples:
```python
>>> my_func("hello")
```
"""
When linking to papers, use https://huggingface.co/papers/<id> instead of https://arxiv.org/abs/<id> (same ID suffix system).