Chapter 7: Load and Behold - Data loading, storage, file formats

python

!pip install pyspark==4.0.0.dev2

python

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Data Loading and Storage Example") \
    .getOrCreate()

This section covers how to read and write data in various formats using PySpark. You'll learn how to load data from common file types (e.g., CSV, JSON, Parquet, ORC) and store data efficiently.

Reading Data

1.1 Reading CSV Files

CSV is one of the most common formats for data exchange. Here's how to load a CSV file into a DataFrame:

python

csv_df = spark.read.csv("../data/employees.csv", header=True, inferSchema=True)
csv_df.show()

Explanation:

header=True: Treats the first line as column names.
inferSchema=True: Automatically infers data types of columns.

1.2 Reading JSON Files

Loading JSON files is simple and allows you to handle both single-line and multi-line JSON structures:

python

json_df = spark.read.option("multiline", "true").json("../data/employees.json")
json_df.show()

Explanation:

multiline="true": Allows reading multi-line JSON structures.

1.3 Reading Parquet Files

Parquet is a columnar format that supports efficient data compression and encoding:

python

parquet_df = spark.read.parquet("../data/employees.parquet")
parquet_df.show()

Tip: Parquet files are highly efficient for storing data due to columnar storage and compression.

1.4 Reading ORC Files

ORC is another columnar file format, often used in Hadoop environments:

python

orc_df = spark.read.orc("../data/employees.orc")
orc_df.show()

Writing Data

2.1 Writing Data as CSV

python

csv_df.write.csv("../data/employees_out.csv", mode="overwrite", header=True)

Explanation:

mode="overwrite": If the directory exists, it will be replaced.
header=True: Writes the column names as the first line.

2.2 Writing Data as Parquet

Parquet format is recommended for large datasets:

python

parquet_df.write.parquet("../data/employees_out.parquet", mode="overwrite")

2.3 Writing Data as ORC

python

json_df.write.orc("../data/employees_out.orc", mode="overwrite")

Tip: Parquet and ORC formats are best for efficient storage and quick reads.

Additional Options and Configurations

You can customize how data is read and written by using additional options. Here are a few examples:

Custom Delimiter in CSV:

python

spark.read.option("delimiter", ";").csv("../data/employees.csv").show(truncate=False)

Handling Null Values:

python

spark.read.option("nullValue", "NULL").csv("../data/employees.csv").show(truncate=False)

Compression Options:

python

parquet_df.write.option("compression", "gzip").parquet("../data/employees_out.parquet", mode="overwrite")

See the PySpark API reference for Input/Output to check all supported functions and options.