Back to Spark

Chapter 7: Load and Behold - Data loading, storage, file formats

python/docs/source/user_guide/loadandbehold.ipynb

4.1.13.0 KB
Original Source

Chapter 7: Load and Behold - Data loading, storage, file formats

python
!pip install pyspark==4.0.0.dev2
python
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Data Loading and Storage Example") \
    .getOrCreate()

This section covers how to read and write data in various formats using PySpark. You'll learn how to load data from common file types (e.g., CSV, JSON, Parquet, ORC) and store data efficiently.

Reading Data

1.1 Reading CSV Files

CSV is one of the most common formats for data exchange. Here's how to load a CSV file into a DataFrame:

python
csv_df = spark.read.csv("../data/employees.csv", header=True, inferSchema=True)
csv_df.show()

Explanation:

  • header=True: Treats the first line as column names.
  • inferSchema=True: Automatically infers data types of columns.

1.2 Reading JSON Files

Loading JSON files is simple and allows you to handle both single-line and multi-line JSON structures:

python
json_df = spark.read.option("multiline", "true").json("../data/employees.json")
json_df.show()

Explanation:

  • multiline="true": Allows reading multi-line JSON structures.

1.3 Reading Parquet Files

Parquet is a columnar format that supports efficient data compression and encoding:

python
parquet_df = spark.read.parquet("../data/employees.parquet")
parquet_df.show()

Tip: Parquet files are highly efficient for storing data due to columnar storage and compression.

1.4 Reading ORC Files

ORC is another columnar file format, often used in Hadoop environments:

python
orc_df = spark.read.orc("../data/employees.orc")
orc_df.show()

Writing Data

2.1 Writing Data as CSV

python
csv_df.write.csv("../data/employees_out.csv", mode="overwrite", header=True)

Explanation:

  • mode="overwrite": If the directory exists, it will be replaced.
  • header=True: Writes the column names as the first line.

2.2 Writing Data as Parquet

Parquet format is recommended for large datasets:

python
parquet_df.write.parquet("../data/employees_out.parquet", mode="overwrite")

2.3 Writing Data as ORC

python
json_df.write.orc("../data/employees_out.orc", mode="overwrite")

Tip: Parquet and ORC formats are best for efficient storage and quick reads.

Additional Options and Configurations

You can customize how data is read and written by using additional options. Here are a few examples:

Custom Delimiter in CSV:

python
spark.read.option("delimiter", ";").csv("../data/employees.csv").show(truncate=False)

Handling Null Values:

python
spark.read.option("nullValue", "NULL").csv("../data/employees.csv").show(truncate=False)

Compression Options:

python
parquet_df.write.option("compression", "gzip").parquet("../data/employees_out.parquet", mode="overwrite")

See the PySpark API reference for Input/Output to check all supported functions and options.