docs/v2/semantic-agent.mdx
The SemanticAgent (currently in beta) extends the capabilities of the PandasAI library by adding a semantic layer to its results. Unlike the standard Agent, the SemanticAgent generates a JSON query, which can then be used to produce Python or SQL code. This approach ensures more accurate and interpretable outputs.
Note: Usage of the Semantic Agent in production is subject to a license. For more details, refer to the license documentation. If you plan to use it in production, contact us.
Creating an instance of the SemanticAgent is similar to creating an instance of an Agent.
from pandasai.ee.agents.semantic_agent import SemanticAgent
import pandas as pd
df = pd.read_csv('revenue.csv')
agent = SemanticAgent(df, config=config)
agent.chat("What are the top 5 revenue streams?")
The Semantic Agent operates in two main steps:
The first step is schema generation, which structures the data into a schema that the Semantic Agent can use to generate JSON queries. By default, this schema is automatically created, but you can also provide a custom schema if necessary.
By default, the SemanticAgent considers all dataframes passed to it and generates an appropriate schema.
To provide a custom schema, pass a schema parameter during the instantiation of the SemanticAgent.
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Marketing", "IT", "Marketing", "Finance"],
}
)
schema = [
{
"name": "Employees",
"table": "Employees",
"measures": [
{
"name": "count",
"type": "count",
"sql": "EmployeeID"
}
],
"dimensions": [
{
"name": "EmployeeID",
"type": "string",
"sql": "EmployeeID"
},
{
"name": "Department",
"type": "string",
"sql": "Department"
}
],
"joins": [
{
"name": "Salaries",
"join_type":"left",
"sql": "Employees.EmployeeID = Salaries.EmployeeID"
}
]
},
{
"name": "Salaries",
"table": "Salaries",
"measures": [
{
"name": "count",
"type": "count",
"sql": "EmployeeID"
},
{
"name": "avg_salary",
"type": "avg",
"sql": "Salary"
},
{
"name": "max_salary",
"type": "max",
"sql": "Salary"
}
],
"dimensions": [
{
"name": "EmployeeID",
"type": "string",
"sql": "EmployeeID"
},
{
"name": "Salary",
"type": "string",
"sql": "Salary"
}
],
"joins": [
{
"name": "Employees",
"join_type":"left",
"sql": "Contracts.contract_code = Fees.contract_id"
}
]
}
]
agent = SemanticAgent([employees_df, salaries_df], schema=schema)
The second step involves generating a JSON query based on the schema. This query is then used to produce the Python or SQL code required for execution.
Here's an example of a JSON query generated by the SemanticAgent:
{
"type": "number",
"dimensions": [],
"measures": ["Salaries.avg_salary"],
"timeDimensions": [],
"filters": [],
"order": []
}
This query is interpreted by the Semantic Agent and converted into executable Python or SQL code.
A schema in the SemanticAgent is a comprehensive representation of the data, including tables, columns, measures, dimensions, and relationships between tables. Here's a breakdown of its components:
Measures are the quantitative metrics used in the analysis, such as sums, averages, counts, etc.
count, avg, sum, max, min).Example:
{
"name": "avg_salary",
"type": "avg",
"sql": "Salary"
}
Dimensions are the categorical variables used to slice and dice the data.
Example:
{
"name": "Department",
"type": "string",
"sql": "Department"
}
Joins define the relationships between tables, specifying how they should be connected in queries.
left, right, inner).Example:
{
"name": "Salaries",
"join_type": "left",
"sql": "Employees.EmployeeID = Salaries.EmployeeID"
}
The JSON query is a structured representation of the request, specifying what data to retrieve and how to process it. Here's a detailed look at its fields:
The type of query determines the format of the result, such as a single number, a table, or a chart.
Example:
{
"type": "number",
...
}
Columns used to group the data. In an SQL GROUP BY clause, these would be the columns listed.
Example:
{
...,
"dimensions": ["Department"]
}
Columns used to calculate data, typically involving aggregate functions like sum, average, count, etc.
Example:
{
...,
"measures": ["Salaries.avg_salary"]
}
Columns used to group the data by time, often involving date functions. Each timeDimensions entry specifies a time period and its granularity. The dateRange field allows various formats, including specific dates such as ["2022-01-01", "2023-03-31"], relative periods like "last week", "last month", "this month", "this week", "today", "this year", and "last year".
Example:
{
...,
"timeDimensions": [
{
"dimension": "Sales.time_period",
"dateRange": ["2023-01-01", "2023-03-31"],
"granularity": "day"
}
]
}
Conditions to filter the data, equivalent to SQL WHERE clauses. Each filter specifies a member, an operator, and a set of values. The operators allowed include: "equals", "notEquals", "contains", "notContains", "startsWith", "endsWith", "gt" (greater than), "gte" (greater than or equal to), "lt" (less than), "lte" (less than or equal to), "set", "notSet", "inDateRange", "notInDateRange", "beforeDate", and "afterDate".
Example:
{
...,
"filters": [
{
"member": "Ticket.category",
"operator": "notEquals",
"values": ["null"]
}
]
}
Columns used to order the data, equivalent to SQL ORDER BY clauses. Each entry in the order array specifies an identifier and the direction of sorting. The direction can be either "asc" for ascending or "desc" for descending order.
Example:
{
...,
"order": [
{
"id": "Contratti.contract_count",
"direction": "asc"
}
]
}
When these components come together, they form a complete query that the Semantic Agent can interpret and execute. Here's an example that combines all elements:
{
"type": "table",
"dimensions": ["Department"],
"measures": ["Salaries.avg_salary"],
"timeDimensions": [],
"filters": [
{
"member": "Department",
"operator": "equals",
"values": ["Marketing", "IT"]
}
],
"order": [
{
"measure": "Salaries.avg_salary",
"direction": "desc"
}
]
}
This query translates to an SQL statement like:
SELECT Department, AVG(Salary) AS avg_salary,
FROM Employees
JOIN Salaries ON Employees.EmployeeID = Salaries.EmployeeID
WHERE Department IN ('Marketing', 'IT')
GROUP BY Department
ORDER BY avg_salary DESC;