Skip to content

Conversation

@Mantisus
Copy link
Collaborator

Description

  • Add response_cache argument to HTTP-based crawlers for optional response caching
  • Add ContextPipeline.compose_with_skip() for conditional middleware skipping

response_cache accepts a KeyValueStore instance. When enabled, successful responses are cached on the first request and reused on subsequent runs. Useful during development to avoid excessive load on target sites.

Inspired by #801 and #861.

Testing

  • Add new tests for ContextPipeline and HttpCrawler.

@Mantisus Mantisus self-assigned this Jan 28, 2026
@Pijukatel
Copy link
Collaborator

Hi, thanks for the PR.

I am not sure we need this, though. My reasoning is that this can already be achieved by the external archiving tool, and that works both for simple HTTP-based crawlers and PlaywrightCrawler. No need for any change in Crawlee code.

This solution adds another way how to do it for the HTTP-based crawler only. It has some advantages, like easier setup, but the disadvantages are additional code that increases the complexity of the pipeline and the limitation to the HTTP-based crawlers only.

@Mantisus
Copy link
Collaborator Author

I am not sure we need this, though. My reasoning is that this can already be achieved by the external archiving tool, and that works both for simple HTTP-based crawlers and PlaywrightCrawler. No need for any change in Crawlee code.

Hi.

Yes, I have considered this. However, I decided to try to implement this solution because there are nuances when using the archiving tool that may make it less convenient for users.

  1. When archiving through a proxy, the user loses the ability to use external proxy servers.
  2. The user needs to create two different workflows. One for recording, the second for working with recorded data.

However, I agree with the shortcomings of the proposed solution.

@janbuchar
Copy link
Collaborator

What would be the minimal change to crawlee required to allow downstream users to implement this on their own?

@Mantisus
Copy link
Collaborator Author

Mantisus commented Jan 29, 2026

What would be the minimal change to crawlee required to allow downstream users to implement this on their own?

The minimal change is an extension of the ContextPipeline functionality, allowing for skipping middleware.

However, to implement caching, the pipeline in AbstractHttpCrawler needs to be updated.

Therefore, users will need to implement their own subclass of AbstractHttpCrawler with a new pipeline and, based on the subclass, create HttpCrawler or a crawler with a parser.

UPD: Alternatively, we need to consider refactoring ContextPipeline to make it easier for users to add custom middleware to the middle of the pipeline. But probably this is for v2

@Pijukatel
Copy link
Collaborator

  • When archiving through a proxy, the user loses the ability to use external proxy servers.

  • The user needs to create two different workflows. One for recording, the second for working with recorded data.

I am not 100% sure, but I believe that both points could be solved by the correct setting of the archiving server.

@janbuchar
Copy link
Collaborator

By the way, couldn't the same behavior be achieved by modifying the HttpClient? If so, then we could probably keep the ContextPipeline simple(r)

@Mantisus
Copy link
Collaborator Author

By the way, couldn't the same behavior be achieved by modifying the HttpClient?

It is preferable that caching occurs after _handle_status_code_response. This is to avoid caching pages that require repeated requests.

we could probably keep the ContextPipeline simple(r)

Yes, I understand. I agree with @Pijukatel that this update will significantly complicate the code.
At least partially, this task can be solved using the external archiving tool.

For now, I would suggest closing this PR. If we encounter future use cases that require similar updates, we can revisit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants