feat: Add `response_cache` argument for HTTP-based crawlers for caching HTTP responses #1698

Mantisus · 2026-01-28T22:48:48Z

Description

Add response_cache argument to HTTP-based crawlers for optional response caching
Add ContextPipeline.compose_with_skip() for conditional middleware skipping

response_cache accepts a KeyValueStore instance. When enabled, successful responses are cached on the first request and reused on subsequent runs. Useful during development to avoid excessive load on target sites.

Inspired by #801 and #861.

Testing

Add new tests for ContextPipeline and HttpCrawler.

…P responses

Pijukatel · 2026-01-29T08:18:48Z

Hi, thanks for the PR.

I am not sure we need this, though. My reasoning is that this can already be achieved by the external archiving tool, and that works both for simple HTTP-based crawlers and PlaywrightCrawler. No need for any change in Crawlee code.

This solution adds another way how to do it for the HTTP-based crawler only. It has some advantages, like easier setup, but the disadvantages are additional code that increases the complexity of the pipeline and the limitation to the HTTP-based crawlers only.

Mantisus · 2026-01-29T10:45:01Z

I am not sure we need this, though. My reasoning is that this can already be achieved by the external archiving tool, and that works both for simple HTTP-based crawlers and PlaywrightCrawler. No need for any change in Crawlee code.

Hi.

Yes, I have considered this. However, I decided to try to implement this solution because there are nuances when using the archiving tool that may make it less convenient for users.

When archiving through a proxy, the user loses the ability to use external proxy servers.
The user needs to create two different workflows. One for recording, the second for working with recorded data.

However, I agree with the shortcomings of the proposed solution.

janbuchar · 2026-01-29T12:37:41Z

What would be the minimal change to crawlee required to allow downstream users to implement this on their own?

Mantisus · 2026-01-29T13:17:06Z

What would be the minimal change to crawlee required to allow downstream users to implement this on their own?

The minimal change is an extension of the ContextPipeline functionality, allowing for skipping middleware.

However, to implement caching, the pipeline in AbstractHttpCrawler needs to be updated.

Therefore, users will need to implement their own subclass of AbstractHttpCrawler with a new pipeline and, based on the subclass, create HttpCrawler or a crawler with a parser.

UPD: Alternatively, we need to consider refactoring ContextPipeline to make it easier for users to add custom middleware to the middle of the pipeline. But probably this is for v2

Pijukatel · 2026-01-29T13:38:52Z

When archiving through a proxy, the user loses the ability to use external proxy servers.

The user needs to create two different workflows. One for recording, the second for working with recorded data.

I am not 100% sure, but I believe that both points could be solved by the correct setting of the archiving server.

janbuchar · 2026-01-29T15:24:15Z

By the way, couldn't the same behavior be achieved by modifying the HttpClient? If so, then we could probably keep the ContextPipeline simple(r)

Mantisus · 2026-01-29T15:49:02Z

By the way, couldn't the same behavior be achieved by modifying the HttpClient?

It is preferable that caching occurs after _handle_status_code_response. This is to avoid caching pages that require repeated requests.

we could probably keep the ContextPipeline simple(r)

Yes, I understand. I agree with @Pijukatel that this update will significantly complicate the code.
At least partially, this task can be solved using the external archiving tool.

For now, I would suggest closing this PR. If we encounter future use cases that require similar updates, we can revisit it.

Mantisus added 2 commits January 28, 2026 22:15

Add response_cache argument for HTTP-based crawlers for caching HTT…

0bba64a

…P responses

standardize key length

46b36d8

Mantisus requested review from Pijukatel and janbuchar January 28, 2026 22:49

Mantisus self-assigned this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `response_cache` argument for HTTP-based crawlers for caching HTTP responses #1698

feat: Add `response_cache` argument for HTTP-based crawlers for caching HTTP responses #1698

Mantisus commented Jan 28, 2026

Uh oh!

Pijukatel commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026

Uh oh!

janbuchar commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026 •

edited

Loading

Uh oh!

Pijukatel commented Jan 29, 2026

Uh oh!

janbuchar commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add response_cache argument for HTTP-based crawlers for caching HTTP responses #1698

Are you sure you want to change the base?

feat: Add response_cache argument for HTTP-based crawlers for caching HTTP responses #1698

Conversation

Mantisus commented Jan 28, 2026

Description

Testing

Uh oh!

Pijukatel commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026

Uh oh!

janbuchar commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel commented Jan 29, 2026

Uh oh!

janbuchar commented Jan 29, 2026

Uh oh!

Mantisus commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Add `response_cache` argument for HTTP-based crawlers for caching HTTP responses #1698

feat: Add `response_cache` argument for HTTP-based crawlers for caching HTTP responses #1698

Mantisus commented Jan 29, 2026 •

edited

Loading