Skip to content

fix: replace List with Sequence in type hints for public API functions (#5354)#8112

Open
abdulhayykhan wants to merge 6 commits intohuggingface:mainfrom
abdulhayykhan:fix/list-to-sequence-type-hints
Open

fix: replace List with Sequence in type hints for public API functions (#5354)#8112
abdulhayykhan wants to merge 6 commits intohuggingface:mainfrom
abdulhayykhan:fix/list-to-sequence-type-hints

Conversation

@abdulhayykhan
Copy link
Copy Markdown

Summary

Fixes #5354

Replaces list[X] with Sequence[X] in function argument type annotations
across the public API. List is invariant while Sequence is covariant,
which means passing a list to a parameter typed as List causes mypy errors,
even though lists are valid sequences.

Changes

  • src/datasets/arrow_dataset.pypath_or_paths, columns, input_columns, remove_columns
  • src/datasets/iterable_dataset.py — same parameters across equivalent methods
  • src/datasets/combine.pydatasets and probabilities parameters
  • src/datasets/dataset_dict.pycolumn_names and columns parameters

What was NOT changed

  • Docstrings (separate concern)
  • Internal variables and function bodies
  • Return types
  • Non-public/internal helpers

Notes

PR #7953 addressed only from_parquet in arrow_dataset.py and had merge
conflicts. This PR covers the broader public API consistently.

Copilot AI review requested due to automatic review settings April 1, 2026 09:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve typing ergonomics of the datasets public API by replacing invariant list[T] argument annotations with covariant Sequence[T], addressing mypy variance errors reported in #5354.

Changes:

  • Update path_or_paths parameters in various from_* constructors to accept Sequence[PathLike] instead of list[PathLike].
  • Update columns / input_columns / remove_columns argument annotations to use Sequence[str] in several APIs.
  • Update interleave_datasets / concatenate_datasets to accept Sequence[...] inputs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/datasets/iterable_dataset.py Switches several constructor/iterable-layer parameters (and from_* helpers) from list[...] to Sequence[...].
src/datasets/dataset_dict.py Updates column-selection/removal and parquet columns annotations to Sequence[str] (but introduces a duplicate Sequence import issue).
src/datasets/combine.py Broadens interleave_datasets / concatenate_datasets input types from list[...] to Sequence[...].
src/datasets/arrow_dataset.py Updates from_* constructors and _map_single params to use Sequence[...] where applicable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

from collections.abc import Sequence
from functools import partial
from typing import Callable, Literal, Optional, Union
from typing import Callable, Literal, Optional, Sequence, Union
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence is imported twice (from collections.abc and again from typing). The second import shadows the first, making the collections.abc import unused and likely causing ruff F401 in CI. Keep a single Sequence import (prefer the existing collections.abc.Sequence here) and remove the duplicate from typing (or alias one of them).

Suggested change
from typing import Callable, Literal, Optional, Sequence, Union
from typing import Callable, Literal, Optional, Union

Copilot uses AI. Check for mistakes.
Comment on lines +3674 to 3679
input_columns: Optional[Sequence[str]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[list[str]] = None,
remove_columns: Optional[Sequence[str]] = None,
keep_in_memory: bool = False,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates _map_single to accept Sequence[str] for input_columns/remove_columns, but the public Dataset.map(...) signature in this file still uses Optional[Union[str, list[str]]] for the same parameters. If the goal is to replace list[...] with Sequence[...] across the public API (per PR description/#5354), the public method annotations should be updated as well (or the PR description/scope adjusted).

Copilot uses AI. Check for mistakes.
Comment on lines +1245 to +1249
input_columns: Optional[Sequence[str]] = None,
batched: bool = False,
batch_size: Optional[int] = 1000,
drop_last_batch: bool = False,
remove_columns: Optional[list[str]] = None,
remove_columns: Optional[Sequence[str]] = None,
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR updates internal iterables and several from_* constructors to use Sequence[...], but the public IterableDataset.map(...) and IterableDataset.filter(...) methods later in this file still annotate input_columns/remove_columns as list[str]. If the intent is to replace list[...] with Sequence[...] across the public API (per PR description), these public method annotations should be updated too (or the PR scope/description clarified).

Copilot uses AI. Check for mistakes.
@abdulhayykhan
Copy link
Copy Markdown
Author

Updated — public Dataset.map(), Dataset.filter(), IterableDataset.map() and IterableDataset.filter() signatures now also use Sequence[str] consistently. Thanks for the pointer.

@abdulhayykhan
Copy link
Copy Markdown
Author

Note: PR #8067 by @biefan also addresses path_or_paths in arrow_dataset.py. This PR covers a broader scope across 4 files and additional parameters (columns, input_columns, remove_columns, map(), filter()). Happy to coordinate with @biefan or adjust scope based on maintainer preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using "Sequence" instead of "List"

2 participants