docs/dataset_management.md
Create and manage datasets easily for your projects using the ragaai_catalyst library. This guide provides steps to list, create, and manage datasets efficiently.
To start managing datasets for a specific project, initialize the Dataset class with your project name.
from ragaai_catalyst import Dataset
# Initialize Dataset management for a specific project
dataset_manager = Dataset(project_name="project_name")
# List existing datasets
datasets = dataset_manager.list_datasets()
print("Existing Datasets:", datasets)
You can create a new dataset by uploading a CSV file and mapping its columns to the required schema elements.
get_schema_mapping()This function retrieves the valid schema elements that the CSV column names must map to. It helps ensure that your CSV column names align correctly with the expected schema.
schemaElements = dataset_manager.get_schema_mapping()
print('Supported column names: ', schemaElements)
create_from_csv()Uploads the CSV file to the server, performs schema mapping, and creates a new dataset.
csv_path (str): Path to the CSV file.dataset_name (str): The name you want to assign to the new dataset created from the CSV.schema_mapping (dict): A dictionary that maps CSV columns to schema elements in the format {csv_column: schema_element}.Example usage:
dataset_manager.create_from_csv(
csv_path='path/to/your.csv',
dataset_name='MyDataset',
schema_mapping={'column1': 'schema_element1', 'column2': 'schema_element2'}
)
schema_mappingThe schema_mapping parameter is crucial when creating datasets from a CSV file. It ensures that the data in your CSV file correctly maps to the expected schema format required by the system.
schema_mappingschema_mapping dictionary represent the column names in your CSV file.schema_mappingSuppose your CSV file has columns user_id and response_time. If the valid schema elements for these are user_identifier and response_duration, your schema_mapping would look like this:
schema_mapping = {
'user_id': 'user_identifier',
'response_time': 'response_duration'
}
This mapping ensures that when the CSV is uploaded, the data in user_id is understood as user_identifier, and response_time is understood as response_duration, aligning the data with the system's expectations.
add_rows_csv_path = "path to dataset"
dataset_manager.add_rows(csv_path=add_rows_csv_path, dataset_name=dataset_name)
text_fields = [
{
"role": "system",
"content": "you are an evaluator, which answers only in yes or no."
},
{
"role": "user",
"content": "are any of the {{context1}} {{feedback1}} related to broken hand"
}
]
column_name = "column_name"
provider = "openai"
model = "gpt-4o-mini"
variables={
"context1": "context",
"feedback1": "feedback"
}
dataset_manager.add_columns(
text_fields=text_fields,
dataset_name=dataset_name,
column_name=column_name,
provider=provider,
model=model,
variables=variables
)
create_from_jsonl()dataset_manager.create_from_jsonl(
jsonl_path='jsonl_path',
dataset_name='MyDataset',
schema_mapping={'column1': 'schema_element1', 'column2': 'schema_element2'}
)
add_rows_from_jsonl()dataset_manager.add_rows_from_jsonl(
jsonl_path='jsonl_path',
dataset_name='MyDataset',
)
create_from_df()dataset_manager.create_from_df(
df=df,
dataset_name='MyDataset',
schema_mapping={'column1': 'schema_element1', 'column2': 'schema_element2'}
)
add_rows_from_df()dataset_manager.add_rows_from_df(
df=df.tail(2),
dataset_name='MyDataset',
)