-
Notifications
You must be signed in to change notification settings - Fork 608
feat: Add response_cache argument for HTTP-based crawlers for caching HTTP responses
#1698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi, thanks for the PR. I am not sure we need this, though. My reasoning is that this can already be achieved by the external archiving tool, and that works both for simple This solution adds another way how to do it for the HTTP-based crawler only. It has some advantages, like easier setup, but the disadvantages are additional code that increases the complexity of the pipeline and the limitation to the HTTP-based crawlers only. |
Hi. Yes, I have considered this. However, I decided to try to implement this solution because there are nuances when using the archiving tool that may make it less convenient for users.
However, I agree with the shortcomings of the proposed solution. |
|
What would be the minimal change to crawlee required to allow downstream users to implement this on their own? |
The minimal change is an extension of the However, to implement caching, the pipeline in Therefore, users will need to implement their own subclass of UPD: Alternatively, we need to consider refactoring |
I am not 100% sure, but I believe that both points could be solved by the correct setting of the archiving server. |
|
By the way, couldn't the same behavior be achieved by modifying the HttpClient? If so, then we could probably keep the ContextPipeline simple(r) |
It is preferable that caching occurs after
Yes, I understand. I agree with @Pijukatel that this update will significantly complicate the code. For now, I would suggest closing this PR. If we encounter future use cases that require similar updates, we can revisit it. |
Description
response_cacheargument to HTTP-based crawlers for optional response cachingContextPipeline.compose_with_skip()for conditional middleware skippingresponse_cacheaccepts aKeyValueStoreinstance. When enabled, successful responses are cached on the first request and reused on subsequent runs. Useful during development to avoid excessive load on target sites.Inspired by #801 and #861.
Testing
ContextPipelineandHttpCrawler.