docs/modalities/files.md
Daft provides powerful capabilities for working with URLs, file paths, and remote storage systems.
Whether you're loading data from local files, cloud storage, or the web, Daft's URL and file handling makes it seamless to work with distributed data sources. Daft supports working with:
file:///path/to/file, /path/to/files3://bucket/path, s3a://bucket/path, s3n://bucket/pathgs://bucket/pathaz://container/path, abfs://container/path, abfss://container/pathhttp://example.com/path, https://example.com/pathhf://dataset/namevol+dbfs:/Volumes/unity/pathdaft.from_glob_path helps discover and size files, accepting wildcards and lists of paths. When paired with daft.functions.download, the two functions enable optimized distributed reads of binary data from storage. This is ideal when your data will fit into memory or when you need the entire file content at once.
=== "๐ Python"
python df = daft.from_pydict({ "urls": [ "https://www.google.com", "s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", ], }) df = df.with_column("data", df["urls"].download()) df.collect()
=== "โ๏ธ SQL"
python df = daft.from_pydict({ "urls": [ "https://www.google.com", "s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", ], }) df = daft.sql(""" SELECT urls, url_download(urls) AS data FROM df """) df.collect()
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ urls โ data โ
โ --- โ --- โ
โ Utf8 โ Binary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ https://www.google.com โ b"<!doctype html><html itemscโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ s3://daft-oss-public-data/open-imโฆ โ b"\xff\xd8\xff\xe0\x00\x10JFIโฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 2 of 2 rows)
This works well for URLs which are HTTP paths to non-HTML files (e.g. jpeg), local filepaths or even paths to a file in an object store such as AWS S3 as well!
daft.File Datatypedaft.File is particularly useful for working with large files that don't fit in memory or when you only need to access specific portions of a file. This is a common use case when working with audio, video, or image data where loading the entire object is prohibitive. The daft.File Type is subclassed by the daft.AudioFile, daft.ImageFile, and daft.VideoFile types which streamline common operations. It provides a pythonic file-like interface with random access capabilities:
=== "๐ Python" ```python import daft from daft.functions import file from daft.io import IOConfig, S3Config
io_config = IOConfig(s3=S3Config(anonymous=True))
df = daft.from_pydict(
{
"urls": [
"https://www.google.com",
"s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
],
}
)
@daft.func
def detect_file_type(file: daft.File) -> str:
# Read just the first 12 bytes to identify file type
with file.open() as f:
header = f.read(12)
# Common file signatures (magic numbers)
if header.startswith(b"\xff\xd8\xff"):
return "JPEG"
elif header.startswith(b"\x89PNG\r\n\x1a\n"):
return "PNG"
elif header.startswith(b"GIF87a") or header.startswith(b"GIF89a"):
return "GIF"
elif header.startswith(b"<!") or header.startswith(b"<html"):
return "HTML"
elif header.startswith(b"HTTP/"):
return "HTTP"
else:
return None
df = df.with_column(
"file_type",
detect_file_type(file(df["urls"], io_config=io_config))
)
df.collect()
```
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฎ
โ urls โ file_type โ
โ --- โ --- โ
โ Utf8 โ Utf8 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโก
โ https://www.google.com โ HTML โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ s3://daft-oss-public-data/open-imโฆ โ JPEG โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโฏ
(Showing first 2 of 2 rows)
The daft.File datatype provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments.
While the Python classes provide the interface, the actual implementation lives in Rust-based PyDaftFile, which maintains optimized backends for different storage types:
This architecture allows us to implement storage-specific optimizations (like network buffering for S3 or HTTP) while presenting a consistent interface.
daft.File works both within dataframes and as standalone objectsdaft.File mirrors the file interface in Python, but is optimized for distributed computing. Due to its lazy nature, daft.File does not read the file into memory until it is needed. To enforce this pattern, daft.File must be used inside a context manager like with file.open() as f: This works within a daft.func or daft.cls user-defined functions or in native Python code.
import daft
@daft.func
def read_header(f: daft.File) -> bytes:
with f.open() as fh:
return fh.read(16)
df = (
daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/**")
.with_column("file", daft.functions.file(daft.col("path")))
.with_column("header", read_header(daft.col("file")))
.select("path", "size", "file", "header")
)
df.show(5)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path โ size โ file โ header โ
โ --- โ --- โ --- โ --- โ
โ String โ Int64 โ File[Unknown] โ Binary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/saโฆ โ 2534 โ Unknown(path: hf://datasets/Eโฆ โ b"*.7z filter=lfs " โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 31 โ Unknown(path: hf://datasets/Eโฆ โ b"---\r\nlicense: ap" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 822924 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 618408 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 1190736 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 5 rows)
When working with files that pack multiple records into a single blob (e.g. Paimon blob files), you can specify offset and length to read only a specific byte range instead of the entire file. Both parameters must be provided together.
import daft
# Read bytes 1024โ2047 from a large blob file
f = daft.File("s3://bucket/data.blob", offset=1024, length=1024)
with f.open() as fh:
data = fh.read() # returns exactly 1024 bytes
This also works inside UDFs:
@daft.func
def read_record(file: daft.File) -> bytes:
with file.open() as f:
return f.read()
# Construct File references with per-row offsets
df = daft.from_pydict({
"url": ["s3://bucket/blob"] * 3,
"offset": [0, 100, 200],
"length": [100, 100, 50],
})
Since daft.File works with any file type with a read method, we can use it to read code and walk the AST for use cases like extracting functions and their signatures for codebase intelligence.
import daft
@daft.func(
return_dtype=daft.DataType.list(
daft.DataType.struct(
{
"name": daft.DataType.string(),
"signature": daft.DataType.string(),
"docstring": daft.DataType.string(),
"start_line": daft.DataType.int64(),
"end_line": daft.DataType.int64(),
}
)
)
)
def extract_functions(file: daft.File):
"""Extract all function definitions from a Python file."""
import ast
with file.open() as f:
file_content = f.read().decode("utf-8")
tree = ast.parse(file_content)
results = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
signature = f"def {node.name}({ast.unparse(node.args)})"
if node.returns:
signature += f" -> {ast.unparse(node.returns)}"
results.append({
"name": node.name,
"signature": signature,
"docstring": ast.get_docstring(node),
"start_line": node.lineno,
"end_line": node.end_lineno,
})
return results
if __name__ == "__main__":
from daft.functions import file, unnest
# Discover Python files
df = (
daft.from_glob_path("~/git/Daft/daft/functions/**/*.py") # Add your own path here
.with_column("file", file(daft.col("path")))
.with_column("functions", extract_functions(daft.col("file")))
.explode("functions")
.select(daft.col("path"), unnest(daft.col("functions")))
)
df.show(3) # Show the first 3 rows of the dataframe
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฎ
โ path โ name โ signature โ docstring โ start_line โ
โ --- โ --- โ --- โ --- โ --- โ
โ String โ String โ String โ String โ Int64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโก
โ file:///Users/myusername007/gโฆ โ monotonically_increasing_id โ def monotonically_increasing_โฆ โ Generates a column of monotonโฆ โ 14 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ file:///Users/myusername007/gโฆ โ eq_null_safe โ def eq_null_safe(left: Expreโฆ โ Performs a null-safe equalityโฆ โ 52 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ file:///Users/myusername007/gโฆ โ cast โ def cast(expr: Expression, dtโฆ โ Casts an expression to the giโฆ โ 68 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโฏ
(Showing first 3 rows)