docs/PYSPARK.md
DataHub's S3 source now supports optional PySpark installation through the s3-slim variant. This allows users to choose a lightweight installation when data lake profiling is not needed.
The S3 source includes PySpark by default for backward compatibility and profiling support. For users who only need metadata extraction without profiling, the s3-slim variant provides a ~500MB smaller installation.
Current implementation status:
Note: This change implements the SparkProfiler pattern for S3 only. The same pattern can be applied to other sources (ABS, etc.) in future PRs.
Current Version: PySpark 3.5.x (3.5.6)
PySpark 4.0 support is planned for a future release. Until then, all DataHub components use PySpark 3.5.x for compatibility and stability.
pip install 'acryl-datahub[s3]' # S3 with PySpark/profiling support
For installations where you don't need profiling capabilities and want to save ~500MB:
pip install 'acryl-datahub[s3-slim]' # S3 without profiling (~500MB smaller)
Recommendation: Use s3-slim when profiling is not needed.
The data-lake-profiling dependencies (included in standard s3 by default):
pyspark~=3.5.6pydeequ>=1.1.0Note: In a future major release (e.g., DataHub 2.0), the
s3-slimvariant may become the default, and PySpark will be truly optional. This current approach provides backward compatibility while giving users time to adapt.
S3 source:
Standard s3 extra:
s3-slim variant:
| Feature | s3-slim | Standard s3 |
|---|---|---|
| Metadata extraction | ✅ Full support | ✅ Full support |
| Schema inference | ✅ Full support | ✅ Full support |
| Tags & properties | ✅ Full support | ✅ Full support |
| Data profiling | ❌ Not available | ✅ Full profiling |
| Installation size | ~200MB | ~700MB |
| Install time | Fast | Slower (PySpark build) |
| PySpark dependencies | ❌ None | ✅ PySpark 3.5.6 + PyDeequ |
When you install acryl-datahub[s3], profiling works out of the box:
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: true # Works seamlessly with standard installation
profile_table_level_only: false
When you install s3-slim, disable profiling in your config:
source:
type: s3
config:
path_specs:
- include: s3://my-bucket/data/**/*.parquet
profiling:
enabled: false # Required for s3-slim installation
If you enable profiling with s3-slim installation, you'll see a clear error message at runtime:
RuntimeError: PySpark is not installed, but is required for S3 profiling.
Please install with: pip install 'acryl-datahub[s3]'
The S3 source demonstrates the recommended pattern for isolating PySpark-dependent code. This pattern can be applied to ABS and other sources in future PRs.
Architecture (currently implemented for S3 only):
source.py) - Contains no PySpark imports at module levelprofiling.py) - Encapsulates all PySpark/PyDeequ logic in SparkProfiler classSparkProfiler created only when profiling is enabledKey Benefits:
-slim installationsExample structure:
# source.py
if TYPE_CHECKING:
from datahub.ingestion.source.s3.profiling import SparkProfiler
class S3Source:
profiler: Optional["SparkProfiler"]
def __init__(self, config, ctx):
if config.is_profiling_enabled():
from datahub.ingestion.source.s3.profiling import SparkProfiler
self.profiler = SparkProfiler(...)
else:
self.profiler = None
# profiling.py
class SparkProfiler:
"""Encapsulates all PySpark/PyDeequ profiling logic."""
def init_spark(self) -> Any:
# Spark session initialization
def read_file_spark(self, file: str, ext: str):
# File reading with Spark
def get_table_profile(self, table_data, dataset_urn):
# Table profiling coordination
For more details, see the Adding a Metadata Ingestion Source guide.
Problem: You installed a -slim variant but have profiling enabled in your config.
Solutions:
Recommended: Use standard installation with PySpark:
pip uninstall acryl-datahub
pip install 'acryl-datahub[s3]' # For S3 profiling
Alternative: Disable profiling in your recipe:
profiling:
enabled: false
Check if PySpark is installed:
# Check installed packages
pip list | grep pyspark
# Test import in Python
python -c "import pyspark; print(pyspark.__version__)"
Expected output:
s3): Shows pyspark 3.5.xs3-slim): Import fails or package not foundNo action required! This change is fully backward compatible:
# Existing installations continue to work exactly as before
pip install 'acryl-datahub[s3]' # Still includes PySpark by default (profiling supported)
Recommended: Optimize installations
acryl-datahub[s3] (includes PySpark)acryl-datahub[s3-slim] to save ~500MB# Recommended installations
pip install 'acryl-datahub[s3]' # S3 with profiling support
pip install 'acryl-datahub[s3-slim]' # S3 metadata only (no profiling)
This implementation maintains full backward compatibility:
s3 extra includes PySpark (unchanged behavior)s3-slim variant available for users who want smaller installationsDataHub Actions depends on acryl-datahub and can benefit from s3-slim when profiling is not needed:
DataHub Actions typically doesn't need data lake profiling capabilities since it focuses on reacting to metadata events, not extracting metadata from data lakes. Use s3-slim to reduce footprint:
# If Actions needs S3 metadata access but not profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3-slim]'
# Result: ~500MB smaller than standard s3 extra
# If Actions needs full S3 with profiling
pip install acryl-datahub-actions
pip install 'acryl-datahub[s3]'
# Result: Includes PySpark for profiling capabilities
Actions services using s3-slim deploy faster in containerized environments:
Actions workflows often integrate with other tools (Slack, Teams, email services). Using s3-slim reduces:
If your Actions workflow needs to trigger data lake profiling jobs, use the standard extra:
# Actions with data lake profiling capability
pip install 'acryl-datahub-actions'
pip install 'acryl-datahub[s3]' # Includes PySpark by default
Common Actions use cases that DON'T need PySpark:
Rare Actions use cases that MIGHT need PySpark:
✅ Backward compatible: Standard s3 extra unchanged, existing users unaffected
✅ Smaller installations: Save ~500MB with s3-slim
✅ Faster setup: No PySpark compilation with s3-slim
✅ Flexible deployment: Choose based on profiling needs
✅ Type safety maintained: Refactored with proper code layer separation (mypy passes)
✅ Clear error messages: Runtime errors guide users to correct installation
✅ Actions-friendly: DataHub Actions benefits from reduced footprint with s3-slim
Key Takeaways:
s3 if you need S3 profiling, s3-slim if you don't