.tasks/core/AI-002-create-finetuning-dataset.md
To enhance the AI-native capabilities of Spacedrive, we need to fine-tune a Large Language Model (LLM) to understand the concepts, commands, and architecture of the system. This task involves creating a high-quality dataset for this purpose.
The dataset will enable the AI agent to answer user questions accurately and translate natural language commands into structured API calls.
The dataset will be created as a JSONL file (training_data.jsonl), where each line is a JSON object representing a training example. We will generate two primary types of examples:
These pairs will teach the model the fundamental concepts of Spacedrive. They will be generated by parsing the whitepaper and technical documentation.
Example:
{
"type": "qa",
"question": "What is the dual purpose of the Content Identity system in Spacedrive?",
"answer": "Spacedrive's Content Identity system serves a dual purpose: it eliminates storage waste through intelligent deduplication and simultaneously acts as a data guardian by tracking redundancy across all devices, turning content identification into a holistic data protection strategy."
}
These pairs will train the model to act as a "semantic parser," translating user requests into API calls for a hypothetical GraphQL endpoint.
Example:
{
"type": "text-to-graphql",
"natural_language_query": "find videos larger than 1GB that I modified in the last month, newest first",
"graphql_query": "query { searchEntries(filter: { contentKind: { eq: \"video\" }, size: { gt: 1073741824 }, modifiedAt: { gte: \"2025-08-03T00:00:00Z\" } }, sortBy: { field: modifiedAt, direction: DESC }) { edges { node { id, name, size, modifiedAt } } } }"
}
We will start by creating a Proof of Concept dataset.
training_data.jsonl file is created in the project root.