python/docs/source/user_guide/loadandbehold.ipynb
!pip install pyspark==4.0.0.dev2
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Data Loading and Storage Example") \
.getOrCreate()
This section covers how to read and write data in various formats using PySpark. You'll learn how to load data from common file types (e.g., CSV, JSON, Parquet, ORC) and store data efficiently.
CSV is one of the most common formats for data exchange. Here's how to load a CSV file into a DataFrame:
csv_df = spark.read.csv("../data/employees.csv", header=True, inferSchema=True)
csv_df.show()
Explanation:
header=True: Treats the first line as column names.inferSchema=True: Automatically infers data types of columns.Loading JSON files is simple and allows you to handle both single-line and multi-line JSON structures:
json_df = spark.read.option("multiline", "true").json("../data/employees.json")
json_df.show()
Explanation:
multiline="true": Allows reading multi-line JSON structures.Parquet is a columnar format that supports efficient data compression and encoding:
parquet_df = spark.read.parquet("../data/employees.parquet")
parquet_df.show()
Tip: Parquet files are highly efficient for storing data due to columnar storage and compression.
ORC is another columnar file format, often used in Hadoop environments:
orc_df = spark.read.orc("../data/employees.orc")
orc_df.show()
csv_df.write.csv("../data/employees_out.csv", mode="overwrite", header=True)
Explanation:
mode="overwrite": If the directory exists, it will be replaced.header=True: Writes the column names as the first line.Parquet format is recommended for large datasets:
parquet_df.write.parquet("../data/employees_out.parquet", mode="overwrite")
json_df.write.orc("../data/employees_out.orc", mode="overwrite")
Tip: Parquet and ORC formats are best for efficient storage and quick reads.
You can customize how data is read and written by using additional options. Here are a few examples:
spark.read.option("delimiter", ";").csv("../data/employees.csv").show(truncate=False)
spark.read.option("nullValue", "NULL").csv("../data/employees.csv").show(truncate=False)
parquet_df.write.option("compression", "gzip").parquet("../data/employees_out.parquet", mode="overwrite")
See the PySpark API reference for Input/Output to check all supported functions and options.