Skip to content

The application crashes when both use_session_pool=False and context.session.retire() are used. #888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matecsaj opened this issue Jan 9, 2025 · 4 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Jan 9, 2025

Python code

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext   # V0.5.0

async def main() -> None:
    crawler = PlaywrightCrawler(use_session_pool=False)

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.session.retire()

    await crawler.run(['https://crawlee.dev'])

if __name__ == '__main__':
    asyncio.run(main())

command line output

/Users/matecsaj/PycharmProjects/wat-crawlee/venv/bin/python /Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_5.py 
[crawlee._autoscaling.snapshotter] INFO  Setting max_memory_size of this run to 8.00 GB.
[crawlee.crawlers._playwright._playwright_crawler] INFO  Current request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
│ retry_histogram               │ [0]      │
│ request_avg_failed_duration   │ None     │
│ request_avg_finished_duration │ None     │
│ requests_finished_per_minute  │ 0        │
│ requests_failed_per_minute    │ 0        │
│ request_total_duration        │ 0.0      │
│ requests_total                │ 0        │
│ crawler_runtime               │ 0.000736 │
└───────────────────────────────┴──────────┘
[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0
[crawlee.crawlers._playwright._playwright_crawler] ERROR Request failed and reached maximum retries
      Traceback (most recent call last):
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/crawlers/_basic/_context_pipeline.py", line 79, in __call__
          await final_context_consumer(cast(TCrawlingContext, crawling_context))
        File "/Users/matecsaj/PycharmProjects/wat-crawlee/venv/lib/python3.13/site-packages/crawlee/router.py", line 57, in __call__
          return await self._default_handler(context)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/Users/matecsaj/Library/Application Support/JetBrains/PyCharm2024.3/scratches/scratch_5.py", line 9, in request_handler
          context.session.retire()
          ^^^^^^^^^^^^^^^^^^^^^^
      AttributeError: 'NoneType' object has no attribute 'retire'
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee.crawlers._playwright._playwright_crawler] INFO  Error analysis: total_errors=3 unique_errors=1
[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:
┌───────────────────────────────┬───────────┐
│ requests_finished             │ 0         │
│ requests_failed               │ 1         │
│ retry_histogram               │ [0, 0, 1] │
│ request_avg_failed_duration   │ 0.705511  │
│ request_avg_finished_duration │ None      │
│ requests_finished_per_minute  │ 0         │
│ requests_failed_per_minute    │ 11        │
│ request_total_duration        │ 0.705511  │
│ requests_total                │ 1         │
│ crawler_runtime               │ 5.27278   │
└───────────────────────────────┴───────────┘

Process finished with exit code 0
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 9, 2025
@Mantisus
Copy link
Collaborator

Mantisus commented Jan 9, 2025

But the use_session_pool=False parameter disables the use of SessionPool. Therefore context.session is None and this behavior is as expected

@matecsaj
Copy link
Contributor Author

matecsaj commented Jan 9, 2025

Could the system handle this scenario more gracefully? Perhaps context.session.retire() could check for a None case and simply return if it encounters it.

If this scenario is something the user is expected to address, an error log with a clear explanation of the issue and guidance on how to resolve it would be helpful."

@Mantisus
Copy link
Collaborator

Mantisus commented Jan 9, 2025

From my point of view, use_session_pool=False is not the default configuration. And a user who uses it makes a conscious choice that the system will not use sessions.

So this check may be needed either when experimenting with the code or in some complex cases that the user defines himself.

A simple case is

If context.session:
    context.session.retire()

@B4nan
Copy link
Member

B4nan commented Jan 9, 2025

Correct, if you disable the session pool, there is no session in the context. That is also the reason why it's typed to Session | None. This is working as expected, so closing.

https://crawlee.dev/python/api/class/BasicCrawlingContext#session

@B4nan B4nan closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants