data/datasets/README.md
<a href="https://github-com.translate.goog/LAION-AI/Open-Assistant/blob/main/data/datasets/README.md?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=wapp"></a>
This repository aims to provide a diverse and accessible collection of datasets that can be used to train OpenAssistant models. Our goal is to cover a wide range of topics, languages and tasks.
To see the datasets people are currently working on, please refer to the spreadsheet.
__init__.py lists the dataset names and corresponding Hugging Face
datasetsUTF-8 encoded to simplify training!To simplify the training process, all datasets must be UTF-8 encoded and
stored in either one of these two formats:
row_group_size=100 and index=FalseThere are 4 types of datasets that currently accepted:
Instruction datasets are designed to align language models with human interactions. These can take the form of question-answer, request-response, task-solution pairs, and so on. The instruction dataset must include the following columns:
{"nsfw": true}This type of dataset is designed for conversations with multiple continuations. In this format, each conversation is represented as a tree structure, where each node represents a message from the user or the assistant. For instance, Open-Assistant is collecting the data in a similar format (example).
The dataset must be a jsonl file with the following schema:
{
"thread": {
"text": "", # Message text
"role": "", # Message role: "prompter" or "assistant"
"meta": {}, # Message optional metadata, for example, message rank, safety score and so on
"replies": [] # A list of message responses, each with the same structure as "thread"
},
"source": "", # Source of the conversation
"meta": {} # Optional metadata of the conversation
}
For example:
{
"thread": {
"text": "What is the best programing language in 2023?",
"role": "prompter",
"meta": { "lang": "en" },
"replies": [
{
"text": "It depends on the task that you aiming to solve.",
"role": "assistant",
"meta": { "rank": 0 },
"replies": [
{
"text": "I want to start learning to code",
"role": "prompter",
"meta": { "rank": 0 },
"replies": []
},
{
"text": "I want to make money",
"role": "prompter",
"meta": { "rank": 1 },
"replies": []
}
]
},
{
"text": "Python is the best.",
"role": "assistant",
"meta": { "rank": 1 },
"replies": []
}
]
},
"source": "twitter",
"meta": { "post_id": "..." }
}
For datasets that are intended to be used to train safety models, prosocial format is proposed. The format is given below
For datasets that do not fit any previous types. The text-only dataset must include the following columns:
The dataset must adhere to the following requirements:
To add a new dataset to OpenAssistant, follow these steps:
Create an issue: Create a new issue and describe your proposal for the new dataset.
Create a dataset on Hugging Face: Create a dataset on HuggingFace. See below for more details.
Make a pull request: Add a new dataset loading script to this folder and link the issue in the pull request description. For more information, see below.
To create a new dataset on Hugging Face, follow these steps:
import pandas as pd
# Create a pandas dataframe from your dataset file(s)
df = pd.read_json(...) # or any other way
# Save the file in the Parquet format
df.to_parquet("dataset.parquet", row_group_size=100, engine="pyarrow", index=False)
Make sure the text data in the dataframe is properly encoded as UTF-8!
pip install huggingface_hub
Use your access token to login:
huggingface-cli login
from huggingface_hub import notebook_login
notebook_login()
from datasets import Dataset
ds = Dataset.from_parquet("dataset.parquet")
ds.push_to_hub("your_huggingface_name/dataset_name")
README.md fileUpdate the README.md file of your dataset by visiting this link:
https://huggingface.co/datasets/your_huggingface_name/dataset_name/edit/main/README.md
(paste your HuggingFace name and dataset)
__init__.pyINSTRUCTION_DATASETS = {
...,
"dataset_name": "your_huggingface_name/dataset_name"
}
pre-commit run
Resolves #123