docs/modalities/json.md
Daft provides powerful capabilities for working with JSON data and nested data structures. Whether you're processing API responses, log files, or complex hierarchical data, Daft's JSON modality makes it easy to parse, query, and manipulate structured data.
If you have a column of JSON strings, Daft provides the .jq() method to run JQ-style filters on them. For example, to extract a value from a JSON object:
=== "๐ Python"
python df = daft.from_pydict({ "json": [ '{"a": 1, "b": 2}', '{"a": 3, "b": 4}', ], }) df = df.with_column("a", df["json"].jq(".a")) df.collect()
=== "โ๏ธ SQL"
python df = daft.from_pydict({ "json": [ '{"a": 1, "b": 2}', '{"a": 3, "b": 4}', ], }) df = daft.sql(""" SELECT json, json_query(json, '.a') AS a FROM df """) df.collect()
โญโโโโโโโโโโโโโโโโโโโฌโโโโโโโฎ
โ json โ a โ
โ --- โ --- โ
โ Utf8 โ Utf8 โ
โโโโโโโโโโโโโโโโโโโโชโโโโโโโก
โ {"a": 1, "b": 2} โ 1 โ
โโโโโโโโโโโโโโโโโโโโผโโโโโโโค
โ {"a": 3, "b": 4} โ 3 โ
โฐโโโโโโโโโโโโโโโโโโโดโโโโโโโฏ
(Showing first 2 of 2 rows)
Daft uses jaq as the underlying executor, so you can find the full list of supported filters in the jaq documentation.
<!-- ### Deserializing JSON and extracting multiple fields -->When working with nested data---like log files, metadata, deserialized JSON---we often need to extract specific fields or flatten the entire structure into individual columns. Daft provides two main approaches for this:
[] operator to access nested fields.unnest() or the * wildcard to expand all nested fields into separate columnsConsider the following example reading from the nebius/SWE-rebench dataset.
=== "๐ Python" ``` python
import daft
from daft import col
swe_rebench_metadata = daft.read_parquet("hf://datasets/nebius/SWE-rebench/data/*.parquet").select("meta")
swe_rebench_metadata.schema()
```
โญโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ column_name โ type โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ meta โ Struct[commit_name: Utf8, failed_lite_validators: List[Utf8], has_test_patch: Boolean, is_lite: Boolean, llm_score: Struct[difficulty_score: Int64, issue_text_score: Int64, test_score: Int64], num_modified_files: Int64] โ
โฐโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
We could extract a specific field from the struct by using the [] operator. For example, to extract the difficulty_score from the llm_score struct:
=== "๐ Python"
python swe_rebench_metadata.select(col("meta")["llm_score"]["difficulty_score"]).show()
โญโโโโโโโโโโโโโโโโโโโฎ
โ difficulty_score โ
โ --- โ
โ Int64 โ
โโโโโโโโโโโโโโโโโโโโก
โ 2 โ
โโโโโโโโโโโโโโโโโโโโค
โ 1 โ
โโโโโโโโโโโโโโโโโโโโค
โ 2 โ
โโโโโโโโโโโโโโโโโโโโค
โ 2 โ
โโโโโโโโโโโโโโโโโโโโค
โ 0 โ
โโโโโโโโโโโโโโโโโโโโค
โ 0 โ
โโโโโโโโโโโโโโโโโโโโค
โ 1 โ
โโโโโโโโโโโโโโโโโโโโค
โ 0 โ
โฐโโโโโโโโโโโโโโโโโโโฏ
(Showing first 8 rows)
If we want to extract all the nested columns, we can use the [.unnest()][daft.expressions.Expression.unnest] expression or the wildcard * to access all fields of the meta struct column.
=== "๐ Python" ``` python
swe_rebench_metadata.select(daft.col("meta").unnest()).show()
# Alternatively:
# swe_rebench_metadata.select(daft.col("meta")["*"]).show()
```
โญโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฎ
โ commit_name โ failed_lite_validators โ has_test_patch โ is_lite โ llm_score โ num_modified_files โ
โ --- โ --- โ --- โ --- โ --- โ --- โ
โ Utf8 โ List[Utf8] โ Boolean โ Boolean โ Struct[difficulty_score: Int64, issue_text_score: Int64, test_score: Int64] โ Int64 โ
โโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโก
โ head_commit โ [has_short_problem_statement,โฆ โ true โ false โ {difficulty_score: 2, โ 5 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [has_many_modified_files, hasโฆ โ true โ false โ {difficulty_score: 1, โ 5 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [has_removed_files, has_many_โฆ โ true โ false โ {difficulty_score: 2, โ 6 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [] โ true โ true โ {difficulty_score: 2, โ 1 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [] โ true โ true โ {difficulty_score: 0, โ 1 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [] โ true โ true โ {difficulty_score: 0, โ 1 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [] โ true โ true โ {difficulty_score: 1, โ 1 โ
โ โ โ โ โ issue_tโฆ โ โ
โโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ head_commit โ [has_hyperlinks, has_issue_reโฆ โ true โ false โ {difficulty_score: 0, โ 3 โ
โ โ โ โ โ issue_tโฆ โ โ
โฐโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 8 rows)