docs/v3/semantic-layer/new.mdx
pai.create() method with CSV and parquet filesThe simplest way to define a semantic layer schema is using the create method:
import pandasai as pai
# Load your data: for example, in this case, a CSV
file = pai.read_csv("data.csv")
df = pai.create(
# Format: "organization/dataset"
path="company/sales-data",
# Input dataframe
df = file,
# Optional description
description="Sales data from our retail stores",
# Define the structure and metadata of your dataset's columns.
# If not provided, all columns from the input dataframe will be included.
columns=[
{
"name": "transaction_id",
"type": "string",
"description": "Unique identifier for each sale"
},
{
"name": "sale_date"
"type": "datetime",
"description": "Date and time of the sale"
}
]
)
The path uniquely identifies your dataset in the PandasAI ecosystem using the format "organization/dataset".
file = pai.read_csv("data.csv")
pai.create(
path="acme-corp/sales-data", # Format: "organization/dataset"
...
)
Type: str
The input dataframe that contains your data, typically created using pai.read_csv().
file = pai.read_csv("data.csv") # Create the input dataframe
pai.create(
path="acme-corp/sales-data",
df=file, # Pass your dataframe here
...
)
Type: DataFrame
pai.read_csv()A clear text description that helps others understand the dataset's contents and purpose.
file = pai.read_csv("data.csv")
pai.create(
path="company/sales-data",
df = file,
description="Daily sales transactions from all retail stores, including transaction IDs, dates, and amounts",
...
)
Type: str
Define the structure and metadata of your dataset's columns to help PandasAI understand your data better.
Note: If the columns parameter is not provided, all columns from the input dataframe will be included in the semantic layer.
When specified, only the declared columns will be included, allowing you to select specific columns for your semantic layer.
file = pai.read_csv("data.csv")
pai.create(
path="company/sales-data",
df = file,
description="Daily sales transactions from all retail stores",
columns=[
{
"name": "transaction_id",
"type": "string",
"description": "Unique identifier for each sale"
},
{
"name": "sale_date"
"type": "datetime",
"description": "Date and time of the sale"
},
{
"name": "quantity",
"type": "integer",
"description": "Number of units sold"
},
{
"name": "price",
"type": "float",
"description": "Price per unit in USD"
},
{
"name": "is_online",
"type": "boolean",
"description": "Whether the sale was made online"
}
]
)
Type: dict[str, dict]
type (str): Data type of the column
description (str): Clear explanation of what the column representspai.create() method for SQL databasesFor SQL databases, you can use the create method to define your data source and schema. Here's an example using a MySQL database:
sql_table = pai.create(
# Format: "organization/dataset"
path="company/health-data",
# Optional description
description="Heart disease dataset from MySQL database",
# Define the source of the data, including connection details and
# table name
source={
"type": "mysql",
"connection": {
"host": "${DB_HOST}",
"port": 3306,
"user": "${DB_USER}",
"password": "${DB_PASSWORD}",
"database": "${DB_NAME}"
},
"table": "heart_data"
}
)
In this example:
path defines where the dataset will be stored in your projectdescription provides context about the datasetsource object contains:
You can then use this dataset like any other:
# Load the dataset
heart_data = pai.load("organization/health-data")
# Query the data
response = heart_data.chat("What is the average age of patients with heart disease?")
Whenever you create a semantic layer schema using the create method, a YAML configuration file is automatically generated for you in the datasets/ directory of your project.
As an alternative, you can use a YAML schema.yaml file directly in the datasets/organization_name/dataset_name directory.
The following sections detail all available configuration options for your schema.yaml file:
A clear text description that helps others understand the dataset's contents and purpose.
Type: str
description: Daily sales transactions from all retail stores, including transaction IDs, dates, and amounts
Specify the data source for your dataset.
source:
type: postgres
connection:
host: postgres-host
port: 5432
database: postgres
user: postgres
password: ******
table: orders
view: false
The available data sources depends on the installed data extensions (sql databases, data lakehouses, yahoo_finance).
Type: dict
type (str): Type of data source
connection_string (str): Connection string for the data sourcequery (str): Query to retrieve data from the data sourceDefine the structure and metadata of your dataset's columns to help PandasAI understand your data better.
columns:
- name: transaction_id
type: string
description: Unique identifier for each sale
- name: sale_date
type: datetime
description: Date and time of the sale
Type: list[dict]
name (str): Name of the column.
transaction_id).type (str): Data type of the column.
"string": IDs, names, categories."integer": Counts, whole numbers."float": Prices, percentages."datetime": Timestamps, dates."boolean": Flags, true/false values.description (str): Clear explanation of what the column represents.Constraints:
[table].[column].Apply transformations to your data to clean, convert, or anonymize it.
transformations:
- type: anonymize
params:
columns:
- transaction_id
method: hash
- type: convert_timezone
params:
columns:
- sale_date
from_timezone: UTC
to_timezone: America/New_York
Type: list[dict]
type (str): Type of transformation
params (dict): Parameters for the transformationIf you want to learn more about transformations, check out the transformations documentation.
The group_by field allows you to specify which columns can be used for grouping operations. This is particularly useful for aggregation queries and data analysis.
columns:
- name: order.date
type: datetime
description: Date and time of the sale
...
group_by:
- order.date
- order.status
Configuration Options:
group_by (list[str]):
table.columnThe expression field allows you to specify a SQL expression for a column. This expression will be used in the query instead of the column name.
columns:
- name: transaction_amount
type: float
description: Amount of the transaction
alias: amount
- name: total_revenue
type: float
description: Total revenue including tax
expression: "transaction_amount * (1 + tax_rate)"
alias: revenue
Configuration Options:
alias (str):
expression (str):
Best Practices: