Skip to content

Autoscaling based on multiple cpu utilization for single process crawlers? #1119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Pijukatel opened this issue Mar 27, 2025 · 1 comment
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@Pijukatel
Copy link
Collaborator

Currently the Autoscaled pool will try to scale up if the the cpu utilization is low. The problem can happen in situation where for example some http based crawler (basically single process crawler) runs in environment with multiple cpus. The other cpus will be underutilized and this will be reported to Autoscaled pool which can try to scale up (even though the relevant core is already fully utilized.)

This is probably not such a problem for any browser based crawler as the browsers are running in their own processes and can run on different cores.

Mentioned here: apify/apify-sdk-python#447 (comment)

Maybe we need more detailed information about the utilization so that the each crawler can decide what is relevant for it.
(Or possibly make crawlee in general capable of scaling up to multiple cpus?)

@Pijukatel Pijukatel added solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team. labels Mar 27, 2025
@janbuchar
Copy link
Collaborator

Some additional context - when running locally, we consider the overall CPU utilization, not just what the Crawlee process uses. In contrast to that, we only consider the memory used by the current process and its children.

In the JS version, the local implementation also considers the overall system CPU load over all CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solutioning The issue is not being implemented but only analyzed and planned. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants