docs/modalities/files.md
Daft provides powerful capabilities for working with URLs, file paths, and remote storage systems.
Whether you're loading data from local files, cloud storage, or the web, Daft's URL and file handling makes it seamless to work with distributed data sources. Daft supports working with:
file:///path/to/file, /path/to/files3://bucket/path, s3a://bucket/path, s3n://bucket/pathgs://bucket/pathaz://container/path, abfs://container/path, abfss://container/pathhttp://example.com/path, https://example.com/pathhf://dataset/namevol+dbfs:/Volumes/unity/pathdaft.from_glob_path helps discover and size files, accepting wildcards and lists of paths. When paired with daft.functions.download, the two functions enable optimized distributed reads of binary data from storage. This is ideal when your data will fit into memory or when you need the entire file content at once.
=== "๐ Python"
python df = daft.from_pydict({ "urls": [ "https://www.google.com", "s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", ], }) df = df.with_column("data", df["urls"].download()) df.collect()
=== "โ๏ธ SQL"
python df = daft.from_pydict({ "urls": [ "https://www.google.com", "s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg", ], }) df = daft.sql(""" SELECT urls, url_download(urls) AS data FROM df """) df.collect()
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ urls โ data โ
โ --- โ --- โ
โ Utf8 โ Binary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ https://www.google.com โ b"<!doctype html><html itemscโฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ s3://daft-oss-public-data/open-imโฆ โ b"\xff\xd8\xff\xe0\x00\x10JFIโฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 2 of 2 rows)
This works well for URLs which are HTTP paths to non-HTML files (e.g. jpeg), local filepaths or even paths to a file in an object store such as AWS S3 as well!
daft.File Datatypedaft.File is particularly useful for working with large files that don't fit in memory or when you only need to access specific portions of a file. This is a common use case when working with audio or video data where loading the entire object is prohibitive. The daft.File Type is subclassed by the daft.AudioFile and daft.VideoFile types which streamline common operations. It provides a pythonic file-like interface with random access capabilities:
=== "๐ Python" ```python import daft from daft.functions import file from daft.io import IOConfig, S3Config
io_config = IOConfig(s3=S3Config(anonymous=True))
df = daft.from_pydict(
{
"urls": [
"https://www.google.com",
"s3://daft-oss-public-data/open-images/validation-images/0001eeaf4aed83f9.jpg",
],
}
)
@daft.func
def detect_file_type(file: daft.File) -> str:
# Read just the first 12 bytes to identify file type
with file.open() as f:
header = f.read(12)
# Common file signatures (magic numbers)
if header.startswith(b"\xff\xd8\xff"):
return "JPEG"
elif header.startswith(b"\x89PNG\r\n\x1a\n"):
return "PNG"
elif header.startswith(b"GIF87a") or header.startswith(b"GIF89a"):
return "GIF"
elif header.startswith(b"<!") or header.startswith(b"<html"):
return "HTML"
elif header.startswith(b"HTTP/"):
return "HTTP"
else:
return None
df = df.with_column(
"file_type",
detect_file_type(file(df["urls"], io_config=io_config))
)
df.collect()
```
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฎ
โ urls โ file_type โ
โ --- โ --- โ
โ Utf8 โ Utf8 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโก
โ https://www.google.com โ HTML โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโค
โ s3://daft-oss-public-data/open-imโฆ โ JPEG โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโฏ
(Showing first 2 of 2 rows)
The daft.File datatype provides first-class support for handling file data across local and remote storage, enabling seamless file operations in distributed environments.
While the Python classes provide the interface, the actual implementation lives in Rust-based PyDaftFile, which maintains optimized backends for different storage types:
This architecture allows us to implement storage-specific optimizations (like network buffering for S3 or HTTP) while presenting a consistent interface.
daft.File works both within dataframes and as standalone objectsdaft.File mirrors the file interface in Python, but is optimized for distributed computing. Due to its lazy nature, daft.File does not read the file into memory until it is needed. To enforce this pattern, daft.File must be used inside a context manager like with file.open() as f: This works within a daft.func or daft.cls user-defined functions or in native Python code.
import daft
@daft.func
def read_header(f: daft.File) -> bytes:
with f.open() as fh:
return fh.read(16)
df = (
daft.from_glob_path("hf://datasets/Eventual-Inc/sample-files/**")
.with_column("file", daft.functions.file(daft.col("path")))
.with_column("header", read_header(daft.col("file")))
.select("path", "size", "file", "header")
)
df.show(5)
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ path โ size โ file โ header โ
โ --- โ --- โ --- โ --- โ
โ String โ Int64 โ File[Unknown] โ Binary โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโก
โ hf://datasets/Eventual-Inc/saโฆ โ 2534 โ Unknown(path: hf://datasets/Eโฆ โ b"*.7z filter=lfs " โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 31 โ Unknown(path: hf://datasets/Eโฆ โ b"---\r\nlicense: ap" โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 822924 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 618408 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ hf://datasets/Eventual-Inc/saโฆ โ 1190736 โ Unknown(path: hf://datasets/Eโฆ โ b"\xff\xf3\x88\xc4\x00\x00\x0โฆ โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
(Showing first 5 rows)
Since daft.File works with any file type with a read method, we can use it to read code and walk the AST for use cases like extracting functions and their signatures for codebase intelligence.
import daft
@daft.func(
return_dtype=daft.DataType.list(
daft.DataType.struct(
{
"name": daft.DataType.string(),
"signature": daft.DataType.string(),
"docstring": daft.DataType.string(),
"start_line": daft.DataType.int64(),
"end_line": daft.DataType.int64(),
}
)
)
)
def extract_functions(file: daft.File):
"""Extract all function definitions from a Python file."""
import ast
with file.open() as f:
file_content = f.read().decode("utf-8")
tree = ast.parse(file_content)
results = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
signature = f"def {node.name}({ast.unparse(node.args)})"
if node.returns:
signature += f" -> {ast.unparse(node.returns)}"
results.append({
"name": node.name,
"signature": signature,
"docstring": ast.get_docstring(node),
"start_line": node.lineno,
"end_line": node.end_lineno,
})
return results
if __name__ == "__main__":
from daft.functions import file, unnest
# Discover Python files
df = (
daft.from_glob_path("~/git/Daft/daft/functions/**/*.py") # Add your own path here
.with_column("file", file(daft.col("path")))
.with_column("functions", extract_functions(daft.col("file")))
.explode("functions")
.select(daft.col("path"), unnest(daft.col("functions")))
)
df.show(3) # Show the first 3 rows of the dataframe
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโฎ
โ path โ name โ signature โ docstring โ start_line โ
โ --- โ --- โ --- โ --- โ --- โ
โ String โ String โ String โ String โ Int64 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโก
โ file:///Users/myusername007/gโฆ โ monotonically_increasing_id โ def monotonically_increasing_โฆ โ Generates a column of monotonโฆ โ 14 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ file:///Users/myusername007/gโฆ โ eq_null_safe โ def eq_null_safe(left: Expreโฆ โ Performs a null-safe equalityโฆ โ 52 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโค
โ file:///Users/myusername007/gโฆ โ cast โ def cast(expr: Expression, dtโฆ โ Casts an expression to the giโฆ โ 68 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโฏ
(Showing first 3 rows)