Optional PySpark Support for S3 Source

DataHub's S3 source now supports optional PySpark installation through the s3-slim variant. This allows users to choose a lightweight installation when data lake profiling is not needed.

Overview

The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the s3-slim variant provides a ~500MB smaller installation.

Current implementation status:

✅ S3: SparkProfiler pattern fully implemented (optional PySpark)
ABS: Not yet implemented (still requires PySpark for profiling)
Unity Catalog: Not affected by this change (uses separate profiling mechanisms)
GCS: Does not support profiling

Note: This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs.

PySpark Version

Current Version: PySpark 3.5.x (3.5.6)

PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability.

Installation Options

Standard Installation (includes PySpark)

bash

pip install 'acryl-datahub[s3]'         # S3 with PySpark/profiling support

Lightweight Installation (without PySpark)

For installations where you don't need profiling capabilities and want to save ~500MB:

bash

pip install 'acryl-datahub[s3-slim]'    # S3 without profiling (~500MB smaller)

Recommendation: Use s3-slim when profiling is not needed.

The data-lake-profiling dependencies (included in standard s3 by default):

pyspark~=3.5.6
pydeequ>=1.1.0
Profiling dependencies (cachetools)

Note: In a future major release (e.g., DataHub 2.0), the s3-slim variant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt.

What's Included

S3 source:

Standard s3 extra:

✅ Metadata extraction (schemas, tables, file listing)
✅ Data format detection (Parquet, Avro, CSV, JSON, etc.)
✅ Schema inference from files
✅ Table and column-level metadata
✅ Tags and properties extraction
✅ Data profiling (min/max, nulls, distinct counts)
✅ Data quality checks (PyDeequ-based)
Includes: PySpark 3.5.6 + PyDeequ

s3-slim variant:

✅ All metadata features (same as above)
❌ Data profiling disabled
No PySpark dependencies (~500MB smaller)

Feature Comparison

Feature	`s3-slim`	Standard `s3`
Metadata extraction	✅ Full support	✅ Full support
Schema inference	✅ Full support	✅ Full support
Tags & properties	✅ Full support	✅ Full support
Data profiling	❌ Not available	✅ Full profiling
Installation size	~200MB	~700MB
Install time	Fast	Slower (PySpark build)
PySpark dependencies	❌ None	✅ PySpark 3.5.6 + PyDeequ

Configuration

With Standard Installation (PySpark included)

When you install acryl-datahub[s3], profiling works out of the box:

yaml

source:
  type: s3
  config:
    path_specs:
      - include: s3://my-bucket/data/**/*.parquet
    profiling:
      enabled: true # Works seamlessly with standard installation
      profile_table_level_only: false

With Slim Installation (no PySpark)

When you install s3-slim, disable profiling in your config:

yaml

source:
  type: s3
  config:
    path_specs:
      - include: s3://my-bucket/data/**/*.parquet
    profiling:
      enabled: false # Required for s3-slim installation

If you enable profiling with s3-slim installation, you'll see a clear error message at runtime:

RuntimeError: PySpark is not installed, but is required for S3 profiling.
Please install with: pip install 'acryl-datahub[s3]'

Developer Guide

Implementation Pattern

The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs.

Architecture (currently implemented for S3 only):

Main source class (source.py) - Contains no PySpark imports at module level
Profiler class (profiling.py) - Encapsulates all PySpark/PyDeequ logic in SparkProfiler class
Conditional instantiation - SparkProfiler created only when profiling is enabled
TYPE_CHECKING imports - Type annotations use TYPE_CHECKING block for optional dependencies

Key Benefits:

✅ Type safety preserved (mypy passes without issues)
✅ Proper code layer separation
✅ Works with both standard and -slim installations
✅ Clear error messages when dependencies missing
✅ Pattern can be reused for ABS and other sources

Example structure:

python

# source.py
if TYPE_CHECKING:
    from datahub.ingestion.source.s3.profiling import SparkProfiler

class S3Source:
    profiler: Optional["SparkProfiler"]

    def __init__(self, config, ctx):
        if config.is_profiling_enabled():
            from datahub.ingestion.source.s3.profiling import SparkProfiler
            self.profiler = SparkProfiler(...)
        else:
            self.profiler = None

python

# profiling.py
class SparkProfiler:
    """Encapsulates all PySpark/PyDeequ profiling logic."""

    def init_spark(self) -> Any:
        # Spark session initialization

    def read_file_spark(self, file: str, ext: str):
        # File reading with Spark

    def get_table_profile(self, table_data, dataset_urn):
        # Table profiling coordination

For more details, see the Adding a Metadata Ingestion Source guide.

Troubleshooting

Error: "PySpark is not installed, but is required for profiling"

Problem: You installed a -slim variant but have profiling enabled in your config.

Solutions:

Recommended: Use standard installation with PySpark:

bash

pip uninstall acryl-datahub
pip install 'acryl-datahub[s3]'    # For S3 profiling

Alternative: Disable profiling in your recipe:
yaml
```
profiling:
  enabled: false
```

Verifying Installation

Check if PySpark is installed:

bash

# Check installed packages
pip list | grep pyspark

# Test import in Python
python -c "import pyspark; print(pyspark.__version__)"

Expected output:

Standard installation (s3): Shows pyspark 3.5.x
Slim installation (s3-slim): Import fails or package not found

Migration Guide

Upgrading from Previous Versions

No action required! This change is fully backward compatible:

bash

# Existing installations continue to work exactly as before
pip install 'acryl-datahub[s3]'  # Still includes PySpark by default (profiling supported)

Recommended: Optimize installations

S3 with profiling: Keep using acryl-datahub[s3] (includes PySpark)
S3 without profiling: Switch to acryl-datahub[s3-slim] to save ~500MB

bash

# Recommended installations
pip install 'acryl-datahub[s3]'         # S3 with profiling support
pip install 'acryl-datahub[s3-slim]'    # S3 metadata only (no profiling)

No Breaking Changes

This implementation maintains full backward compatibility:

Standard s3 extra includes PySpark (unchanged behavior)
All existing recipes and configs continue to work
New s3-slim variant available for users who want smaller installations
Future DataHub 2.0 may flip defaults, but provides migration path

Benefits for DataHub Actions

DataHub Actions depends on acryl-datahub and can benefit from s3-slim when profiling is not needed:

Reduced Installation Size

DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use s3-slim to reduce footprint:

bash

# If Actions needs S3 metadata access but not profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3-slim]'
# Result: ~500MB smaller than standard s3 extra

# If Actions needs full S3 with profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3]'
# Result: Includes PySpark for profiling capabilities

Faster Deployment

Actions services using s3-slim deploy faster in containerized environments:

Faster pip install: No PySpark compilation required
Smaller Docker images: Reduced base image size
Quicker cold starts: Less code to load and initialize

Fewer Dependency Conflicts

Actions workflows often integrate with other tools (Slack, Teams, email services). Using s3-slim reduces:

Python version constraint conflicts
Java/Spark runtime conflicts in restricted environments
Transitive dependency version mismatches

When Actions Needs Profiling

If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra:

bash

# Actions with data lake profiling capability
pip install 'acryl-datahub-actions'
pip install 'acryl-datahub[s3]'  # Includes PySpark by default

Common Actions use cases that DON'T need PySpark:

Slack notifications on schema changes
Propagating tags and terms to downstream systems
Triggering dbt runs on metadata updates
Sending emails on data quality failures
Creating Jira tickets for governance issues
Updating external catalogs (e.g., Alation, Collibra)

Rare Actions use cases that MIGHT need PySpark:

Custom actions that programmatically trigger S3 profiling
Actions that directly process data lake files (not typical)

Benefits Summary

✅ Backward compatible: Standard s3 extra unchanged, existing users unaffected ✅ Smaller installations: Save ~500MB with s3-slim ✅ Faster setup: No PySpark compilation with s3-slim ✅ Flexible deployment: Choose based on profiling needs ✅ Type safety maintained: Refactored with proper code layer separation (mypy passes) ✅ Clear error messages: Runtime errors guide users to correct installation ✅ Actions-friendly: DataHub Actions benefits from reduced footprint with s3-slim

Key Takeaways:

Use s3 if you need S3 profiling, s3-slim if you don't
Pattern can be applied to other sources (ABS, etc.) in future PRs
Existing installations continue working without changes