Skip to content

fix: Prevent accidental request dropping with maxRequestsPerCrawl#3531

Merged
janbuchar merged 6 commits intomasterfrom
fix-accidental-request-dropping
Apr 13, 2026
Merged

fix: Prevent accidental request dropping with maxRequestsPerCrawl#3531
janbuchar merged 6 commits intomasterfrom
fix-accidental-request-dropping

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar commented Mar 26, 2026

  • closes maxRequestsPerCrawl with RQ optimizations drops requests #3153

  • core issue: maxRequestsPerCrawl and enqueueLinks({ limit }) pre-truncated request lists before sending them to the request queue. Duplicate URLs consumed budget slots, starving actual new URLs.

  • Fix: Instead of guessing upfront which requests will be new, let the request queue tell us.

    • A new maxNewRequests option on addRequestsBatched feeds requests to the queue in budget-capped chunks, counts how many were actually new from the wasAlreadyPresent results, and stops pulling once the budget is exhausted.
    • chunkedAsyncIterable was extended to accept a dynamic size callback
    • Leftovers from the source iterator are returned as requestsOverLimit for callers to report.

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 26, 2026
@janbuchar janbuchar requested review from B4nan, barjin and l2ysho March 26, 2026 17:25
@github-actions github-actions bot added this to the 137th sprint - Tooling team milestone Mar 26, 2026
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Mar 26, 2026
@janbuchar janbuchar marked this pull request as ready for review March 27, 2026 19:49
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar !

Here are some late-night ideas I got (otherwise it looks pretty solid):

Copy link
Copy Markdown
Contributor

@l2ysho l2ysho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generator instance problem highlighted by @barjin is valid point, check example test suite in thread. My other finding is probably just a nit, feel free to resolve as you like.

@janbuchar janbuchar requested review from barjin and l2ysho April 9, 2026 12:26
Copy link
Copy Markdown
Contributor

@l2ysho l2ysho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!

@janbuchar janbuchar merged commit b23319b into master Apr 13, 2026
9 checks passed
@janbuchar janbuchar deleted the fix-accidental-request-dropping branch April 13, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

maxRequestsPerCrawl with RQ optimizations drops requests

5 participants