fix: replace List with Sequence in type hints for public API functions (#5354)#8112
fix: replace List with Sequence in type hints for public API functions (#5354)#8112abdulhayykhan wants to merge 6 commits intohuggingface:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to improve typing ergonomics of the datasets public API by replacing invariant list[T] argument annotations with covariant Sequence[T], addressing mypy variance errors reported in #5354.
Changes:
- Update
path_or_pathsparameters in variousfrom_*constructors to acceptSequence[PathLike]instead oflist[PathLike]. - Update
columns/input_columns/remove_columnsargument annotations to useSequence[str]in several APIs. - Update
interleave_datasets/concatenate_datasetsto acceptSequence[...]inputs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/datasets/iterable_dataset.py |
Switches several constructor/iterable-layer parameters (and from_* helpers) from list[...] to Sequence[...]. |
src/datasets/dataset_dict.py |
Updates column-selection/removal and parquet columns annotations to Sequence[str] (but introduces a duplicate Sequence import issue). |
src/datasets/combine.py |
Broadens interleave_datasets / concatenate_datasets input types from list[...] to Sequence[...]. |
src/datasets/arrow_dataset.py |
Updates from_* constructors and _map_single params to use Sequence[...] where applicable. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/datasets/dataset_dict.py
Outdated
| from collections.abc import Sequence | ||
| from functools import partial | ||
| from typing import Callable, Literal, Optional, Union | ||
| from typing import Callable, Literal, Optional, Sequence, Union |
There was a problem hiding this comment.
Sequence is imported twice (from collections.abc and again from typing). The second import shadows the first, making the collections.abc import unused and likely causing ruff F401 in CI. Keep a single Sequence import (prefer the existing collections.abc.Sequence here) and remove the duplicate from typing (or alias one of them).
| from typing import Callable, Literal, Optional, Sequence, Union | |
| from typing import Callable, Literal, Optional, Union |
| input_columns: Optional[Sequence[str]] = None, | ||
| batched: bool = False, | ||
| batch_size: Optional[int] = 1000, | ||
| drop_last_batch: bool = False, | ||
| remove_columns: Optional[list[str]] = None, | ||
| remove_columns: Optional[Sequence[str]] = None, | ||
| keep_in_memory: bool = False, |
There was a problem hiding this comment.
This change updates _map_single to accept Sequence[str] for input_columns/remove_columns, but the public Dataset.map(...) signature in this file still uses Optional[Union[str, list[str]]] for the same parameters. If the goal is to replace list[...] with Sequence[...] across the public API (per PR description/#5354), the public method annotations should be updated as well (or the PR description/scope adjusted).
| input_columns: Optional[Sequence[str]] = None, | ||
| batched: bool = False, | ||
| batch_size: Optional[int] = 1000, | ||
| drop_last_batch: bool = False, | ||
| remove_columns: Optional[list[str]] = None, | ||
| remove_columns: Optional[Sequence[str]] = None, |
There was a problem hiding this comment.
This PR updates internal iterables and several from_* constructors to use Sequence[...], but the public IterableDataset.map(...) and IterableDataset.filter(...) methods later in this file still annotate input_columns/remove_columns as list[str]. If the intent is to replace list[...] with Sequence[...] across the public API (per PR description), these public method annotations should be updated too (or the PR scope/description clarified).
|
Updated — public Dataset.map(), Dataset.filter(), IterableDataset.map() and IterableDataset.filter() signatures now also use Sequence[str] consistently. Thanks for the pointer. |
Summary
Fixes #5354
Replaces
list[X]withSequence[X]in function argument type annotationsacross the public API.
Listis invariant whileSequenceis covariant,which means passing a
listto a parameter typed asListcauses mypy errors,even though lists are valid sequences.
Changes
src/datasets/arrow_dataset.py—path_or_paths,columns,input_columns,remove_columnssrc/datasets/iterable_dataset.py— same parameters across equivalent methodssrc/datasets/combine.py—datasetsandprobabilitiesparameterssrc/datasets/dataset_dict.py—column_namesandcolumnsparametersWhat was NOT changed
Notes
PR #7953 addressed only
from_parquetinarrow_dataset.pyand had mergeconflicts. This PR covers the broader public API consistently.