Back to Modin

Key Features and Updates

docs/release_notes/release_notes-0.16.0.rst

0.37.112.7 KB
Original Source

:orphan:

Modin 0.16.0

Key Features and Updates

  • Stability and Bugfixes
    • FIX-#4570: Replace np.bool -> np.bool_ (#4571)
    • FIX-#4543: Fix read_csv in case skiprows=<0, []> (#4544)
    • FIX-#4059: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • FIX-#4589: Pin protobuf<4.0.0 to fix ray (#4590)
    • FIX-#4577: Set attribute of Modin dataframe to updated value (#4588)
    • FIX-#4411: Fix binary_op between datetime64 Series and pandas timedelta (#4592)
    • FIX-#4604: Fix groupby + agg in case when multicolumn can arise (#4642)
    • FIX-#4582: Inherit custom log layer (#4583)
    • FIX-#4639: Fix storage_options usage for read_csv and read_csv_glob (#4644)
    • FIX-#4593: Ensure Modin warns when setting columns via attributes (#4621)
    • FIX-#4584: Enable pdb debug when running cloud tests (#4585)
    • FIX-#4564: Workaround import issues in Ray: auto-import pandas on python start if env var is set (#4603)
    • FIX-#4641: Reindex pandas partitions in df.describe() (#4651)
    • FIX-#2064: Fix iloc/loc assignment when dataframe is empty (#4677)
    • FIX-#4634: Check for FrozenList as by in df.groupby() (#4667)
    • FIX-#4680: Fix read_csv that started defaulting to pandas again in case of reading from a buffer and when a buffer has a non-zero starting position (#4681)
    • FIX-#4491: Wait for all partitions in parallel in benchmark mode (#4656)
    • FIX-#4358: MultiIndex loc shouldn't drop levels for full-key lookups (#4608)
    • FIX-#4658: Expand exception handling for read_* functions from s3 storages (#4659)
    • FIX-#4672: Fix incorrect warning when setting frame.index or frame.columns (#4721)
    • FIX-#4686: Propagate metadata and drain call queue in unwrap_partitions (#4697)
    • FIX-#4652: Support categorical data in from_dataframe (#4737)
    • FIX-#4756: Correctly propagate storage_options in read_parquet (#4764)
    • FIX-#4657: Use fsspec for handling s3/http-like paths instead of s3fs (#4710)
    • FIX-#4676: drain sub-virtual-partition call queues (#4695)
    • FIX-#4782: Exclude certain non-parquet files in read_parquet (#4783)
    • FIX-#4808: Set dtypes correctly after column rename (#4809)
    • FIX-#4811: Apply dataframe -> not_dataframe functions to virtual partitions (#4812)
    • FIX-#4099: Use mangled column names but keep the original when building frames from arrow (#4767)
    • FIX-#4838: Bump up modin-spreadsheet to latest master (#4839)
    • FIX-#4840: Change modin-spreadsheet version for notebook requirements (#4841)
    • FIX-#4835: Handle Pathlike paths in read_parquet (#4837)
    • FIX-#4872: Stop checking the private ray mac memory limit (#4873)
    • FIX-#4914: base_lengths should be computed from base_frame instead of self in copartition (#4915)
    • FIX-#4848: Fix rebalancing partitions when NPartitions == 1 (#4874)
    • FIX-#4927: Fix dtypes computation in dataframe.filter (#4928)
    • FIX-#4907: Implement radd for Series and DataFrame (#4908)
    • FIZ-#4945: Fix _take_2d_positional that loses indexes due to filtering empty dataframes (#4951)
    • FIX-#4818, PERF-#4825: Fix where by using the new n-ary operator (#4820)
    • FIX-#3983: FIX-#4107: Materialize 'rowid' columns when selecting rows by position (#4834)
    • FIX-#4845: Fix KeyError from __getitem_bool for single row dataframes (#4845)
    • FIX-#4734: Handle Series.apply when return type is a DataFrame (#4830)
    • FIX-#4983: Set frac to None in _sample when n=0 (#4984)
    • FIX-#4993: Return _default_to_pandas in df.attrs (#4995)
    • FIX-#5043: Fix execute function in ASV utils failed if len(partitions) == 0 (#5044)
    • FIX-#4597: Refactor Partition handling of func, args, kwargs (#4715)
    • FIX-#4996: Evaluate BenchmarkMode at each function call (#4997)
    • FIX-#4022: Fixed empty data frame with index (#4910)
    • FIX-#4090: Fixed check if the index is trivial (#4936)
    • FIX-#4966: Fix to_timedelta to return Series instead of TimedeltaIndex (#5028)
    • FIX-#5042: Fix series getitem with invalid strings (#5048)
    • FIX-#4691: Fix binary operations between virtual partitions (#5049)
    • FIX-#5045: Fix ray virtual_partition.wait with duplicate object refs (#5058)
  • Performance enhancements
    • PERF-#4182: Add cell-wise execution for binary ops, fix bin ops for empty dataframes (#4391)
    • PERF-#4288: Improve perf of groupby.mean for narrow data (#4591)
    • PERF-#4772: Remove df.copy call from from_pandas since it is not needed for Ray and Dask (#4781)
    • PERF-#4325: Improve perf of multi-column assignment in __setitem__ when no new column names are assigning (#4455)
    • PERF-#3844: Improve perf of drop operation (#4694)
    • PERF-#4727: Improve perf of concat operation (#4728)
    • PERF-#4705: Improve perf of arithmetic operations between Series objects with shared .index (#4689)
    • PERF-#4703: Improve performance in accessing ser.cat.categories, ser.cat.ordered, and ser.__array_priority__ (#4704)
    • PERF-#4305: Parallelize read_parquet over row groups (#4700)
    • PERF-#4773: Compute lengths and widths in put method of Dask partition like Ray do (#4780)
    • PERF-#4732: Avoid overwriting already-evaluated PandasOnRayDataframePartition._length_cache and PandasOnRayDataframePartition._width_cache (#4754)
    • PERF-#4862: Don't call compute_sliced_len.remote when row_labels/col_labels == slice(None) (#4863)
    • PERF-#4713: Stop overriding the ray MacOS object store size limit (#4792)
    • PERF-#4851: Compute dtypes for binary operations that can only return bool type and the right operand is not a Modin object (#4852)
    • PERF-#4842: copy should not trigger any previous computations (#4843)
    • PERF-#4849: Compute dtypes in concat also for ROW_WISE case when possible (#4850)
    • PERF-#4929: Compute dtype when using Series.dt accessor (#4930)
    • PERF-#4892: Compute lengths in rebalance_partitions when possible (#4893)
    • PERF-#4794: Compute caches in _propagate_index_objs (#4888)
    • PERF-#4860: PandasDataframeAxisPartition.deploy_axis_func should be serialized only once (#4861)
    • PERF-#4890: PandasDataframeAxisPartition.drain should be serialized only once (#4891)
    • PERF-#4870: Avoid index materialization in __getattribute__ and __getitem__ (4911)
    • PERF-#4886: Use lazy index and columns evaluation in query method (#4887)
    • PERF-#4866: iloc function that used in partition.mask should be serialized only once (#4901)
    • PERF-#4920: Avoid index and cache computations in take_2d_labels_or_positional unless they are needed (#4921)
    • PERF-#4999: don't call apply in virtual partition' drain_call_queue if call_queue is empty (#4975)
    • PERF-#4268: Implement partition-parallel getitem for bool Series masks (#4753)
    • PERF-#5017: reset_index shouldn't trigger index materialization if possible (#5018)
    • PERF-#4963: Use partition width/length methods instead of _compute_axis_labels_and_lengths if index is already known (#4964)
    • PERF-#4940: Optimize categorical dtype check in concatenate (#4953)
  • Benchmarking enhancements
    • TEST-#5066: Add outer join case for TimeConcat benchmark (#5067)
    • TEST-#5083: Add merge op with categorical data (#5084)
    • FEAT-#4706: Add Modin ClassLogger to PandasDataframePartitionManager (#4707)
    • TEST-#5014: Simplify adding new ASV benchmarks (#5015)
    • TEST-#5064: Update TimeConcat benchmark with new parameter ignore_index (#5065)
    • PERF-#4944: Avoid default_to_pandas in Series.cat.codes, Series.dt.tz, and Series.dt.to_pytimedelta (#4833)
    • TEST-#5068: Add binary op benchmark for Series (#5069)
  • Refactor Codebase
    • REFACTOR-#4530: Standardize access to physical data in partitions (#4563)
    • REFACTOR-#4534: Replace logging meta class with class decorator (#4535)
    • REFACTOR-#4708: Delete combine dtypes (#4709)
    • REFACTOR-#4629: Add type annotations to modin/config (#4685)
    • REFACTOR-#4717: Improve PartitionMgr.get_indices() usage (#4718)
    • REFACTOR-#4730: make Indexer immutable (#4731)
    • REFACTOR-#4774: remove _build_treereduce_func call from _compute_dtypes (#4775)
    • REFACTOR-#4750: Delete BaseDataframeAxisPartition.shuffle (#4751)
    • REFACTOR-#4722: Stop suppressing undefined name lint (#4723)
    • REFACTOR-#4832: unify split_result_of_axis_func_pandas (#4831)
    • REFACTOR-#4796: Introduce constant for reduced column name (#4799)
    • REFACTOR-#4000: Remove code duplication for PandasOnRayDataframePartitionManager (#4895)
    • REFACTOR-#3780: Remove code duplication for PandasOnDaskDataframe (#3781)
    • REFACTOR-#4530: Unify access to physical data for any partition type (#4829)
    • REFACTOR-#4978: Align modin/core/execution/dask/common/__init__.py with modin/core/execution/ray/common/__init__.py (#4979)
    • REFACTOR-#4949: Remove code duplication in default2pandas/dataframe.py and default2pandas/any.py (#4950)
    • REFACTOR-#4976: Rename RayTask to RayWrapper in accordance with Dask (#4977)
    • REFACTOR-#4885: De-duplicated take_2d_labels_or_positional methods (#4883)
    • REFACTOR-#5005: Use finalize method instead of list comprehension + drain_call_queue (#5006)
    • REFACTOR-#5001: Remove jenkins stuff (#5002)
    • REFACTOR-#5026: Change exception names to simplify grepping (#5027)
    • REFACTOR-#4970: Rewrite base implementations of a partition' width/length (#4971)
    • REFACTOR-#4942: Remove call method in favor of register due to duplication (4943)
    • REFACTOR-#4922: Helpers for take_2d_labels_or_positional (#4865)
    • REFACTOR-#5024: Make _row_lengths and _column_widths public (#5025)
    • REFACTOR-#5009: Use RayWrapper.materialize instead of ray.get (#5010)
    • REFACTOR-#4755: Rewrite Pandas version mismatch warning (#4965)
    • REFACTOR-#5012: Add mypy checks for singleton files in base modin directory (#5013)
    • REFACTOR-#5038: Remove unnecessary _method argument from resamplers (#5039)
    • REFACTOR-#5081: Remove c323f7fe385011ed849300155de07645.db file (#5082)
  • Pandas API implementations and improvements
    • FEAT-#4670: Implement convert_dtypes by mapping across partitions (#4671)
  • OmniSci enhancements
    • FEAT-#4913: Enabling pyhdk
  • XGBoost enhancements *
  • Developer API enhancements *
  • Update testing suite
    • TEST-#4508: Reduce test_partition_api pytest threads to deflake it (#4551)
    • TEST-#4550: Use much less data in test_partition_api (#4554)
    • TEST-#4610: Remove explicit installation of black/flake8 for omnisci ci-notebooks (#4609)
    • TEST-#2564: Add caching and use mamba for conda setups in GH (#4607)
    • TEST-#4557: Delete multiindex sorts instead of xfailing (#4559)
    • TEST-#4698: Stop passing invalid storage_options param (#4699)
    • TEST-#4745: Pin flake8 to <5 to workaround installation conflict (#4752)
    • TEST-#4875: XFail tests failing due to file gone missing (#4876)
    • TEST-#4879: Use pandas ensure_clean() in place of io_tests_data (#4881)
    • TEST-#4562: Use local Ray cluster in CI to resolve flaky test-compat-win (#5007)
    • TEST-#5040: Rework test_series using eval_general() (#5041)
    • TEST-#5050: Add black to pre-commit hook (#5051)
  • Documentation improvements
    • DOCS-#4552: Change default sphinx language to en to fix sphinx >= 5.0.0 build (#4553)
    • DOCS-#4628: Add to_parquet partial support notes (#4648)
    • DOCS-#4668: Set light theme for readthedocs page, remove theme switcher (#4669)
    • DOCS-#4748: Apply the Triage label to new issues (#4749)
    • DOCS-#4790: Give all templates issue type and triage labels (#4791)
    • DOCS-#4521: Document how to benchmark modin (#5020)
  • Dependencies
    • FEAT-#4598: Add support for pandas 1.4.3 (#4599)
    • FEAT-#4619: Integrate mypy static type checking (#4620)
    • FEAT-#4202: Allow dask past 2022.2.0 (#4769)
    • FEAT-#4925: Upgrade pandas to 1.4.4 (#4926)
    • TEST-#4998: Add flake8 plugins to dev requirements (#5000)
  • New Features
    • FEAT-4463: Add experimental fuzzydata integration for testing against a randomized dataframe workflow (#4556)
    • FEAT-#4419: Extend virtual partitioning API to pandas on Dask (#4420)
    • FEAT-#4147: Add partial compatibility with Python 3.6 and pandas 1.1 (#4301)
    • FEAT-#4569: Add error message when read_ function defaults to pandas (#4647)
    • FEAT-#4725: Make index and columns lazy in Modin DataFrame (#4726)
    • FEAT-#4664: Finalize compatibility support for Python 3.6 (#4800)
    • FEAT-#4746: Sync interchange protocol with recent API changes (#4763)
    • FEAT-#4733: Support fastparquet as engine for read_parquet (#4807)
    • FEAT-#4766: Support fsspec URLs in read_csv and read_csv_glob (#4898)
    • FEAT-#4827: Implement infer_types dataframe algebra operator (#4871)
    • FEAT-#4989: Switch pandas version to 1.5 (#5037)

Contributors

@mvashishtha @NickCrews @prutskov @vnlitvinov @pyrito @suhailrehman @RehanSD @helmeleegy @anmyachev @d33bs @noloerino @devin-petersohn @YarShev @naren-ponder @jbrockmendel @ienkovich @Garra1980 @Billy2551