docs/source/package_reference/main_classes.mdx
[[autodoc]] datasets.DatasetInfo
The base class [Dataset] implements a Dataset backed by an Apache Arrow table.
[[autodoc]] datasets.Dataset - add_column - add_item - from_file - from_buffer - from_pandas - from_dict - from_list - from_generator - data - cache_files - num_columns - num_rows - column_names - shape - unique - flatten - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns - class_encode_column - len - iter - iter - formatted_as - set_format - set_transform - reset_format - with_format - with_transform - getitem - cleanup_cache_files - map - filter - select - sort - shuffle - skip - take - train_test_split - shard - repeat - to_tf_dataset - push_to_hub - save_to_disk - load_from_disk - flatten_indices - to_csv - to_pandas - to_dict - to_json - to_parquet - to_sql - to_iterable_dataset - add_faiss_index - add_faiss_index_from_external_arrays - save_faiss_index - load_faiss_index - add_elasticsearch_index - load_elasticsearch_index - list_indexes - get_index - drop_index - search - search_batch - get_nearest_examples - get_nearest_examples_batch - info - split - builder_name - citation - config_name - dataset_size - description - download_checksums - download_size - features - homepage - license - size_in_bytes - supervised_keys - version - from_csv - from_json - from_parquet - from_text - from_sql - align_labels_with_mapping
[[autodoc]] datasets.concatenate_datasets
[[autodoc]] datasets.interleave_datasets
[[autodoc]] datasets.distributed.split_dataset_by_node
[[autodoc]] datasets.enable_caching
[[autodoc]] datasets.disable_caching
[[autodoc]] datasets.is_caching_enabled
[[autodoc]] datasets.Column
Dictionary with split names as keys ('train', 'test' for example), and Dataset objects as values.
It also has dataset transform methods like map or filter, to process all the splits at once.
[[autodoc]] datasets.DatasetDict - data - cache_files - num_columns - num_rows - column_names - shape - unique - cleanup_cache_files - map - filter - sort - shuffle - set_format - reset_format - formatted_as - with_format - with_transform - flatten - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns - class_encode_column - push_to_hub - save_to_disk - load_from_disk - from_csv - from_json - from_parquet - from_text
<a id='package_reference_features'></a>
The base class [IterableDataset] implements an iterable Dataset backed by python generators.
[[autodoc]] datasets.IterableDataset - from_file - from_pandas - from_dict - from_list - from_generator - remove_columns - select_columns - cast_column - cast - decode - iter - iter - map - rename_column - filter - shuffle - batch - skip - take - shard - reshard - repeat - to_csv - to_pandas - to_dict - to_json - to_parquet - to_sql - push_to_hub - load_state_dict - state_dict - info - split - builder_name - citation - config_name - dataset_size - description - download_checksums - download_size - features - homepage - license - size_in_bytes - supervised_keys - version - from_csv - from_json - from_parquet - from_text
[[autodoc]] datasets.IterableColumn
Dictionary with split names as keys ('train', 'test' for example), and IterableDataset objects as values.
[[autodoc]] datasets.IterableDatasetDict - map - filter - shuffle - with_format - cast - cast_column - remove_columns - rename_column - rename_columns - select_columns - push_to_hub
[[autodoc]] datasets.Features
[[autodoc]] datasets.Value
[[autodoc]] datasets.ClassLabel
[[autodoc]] datasets.LargeList
[[autodoc]] datasets.List
[[autodoc]] datasets.Sequence
[[autodoc]] datasets.Translation
[[autodoc]] datasets.TranslationVariableLanguages
[[autodoc]] datasets.Array2D
[[autodoc]] datasets.Array3D
[[autodoc]] datasets.Array4D
[[autodoc]] datasets.Array5D
[[autodoc]] datasets.Audio
[[autodoc]] datasets.Image
[[autodoc]] datasets.Video
[[autodoc]] datasets.Json
[[autodoc]] datasets.Pdf
[[autodoc]] datasets.Nifti
[[autodoc]] datasets.filesystems.is_remote_filesystem
[[autodoc]] datasets.fingerprint.Hasher