Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal state is preventing recrawls #769

Open
tleyden opened this issue Dec 1, 2024 · 3 comments
Open

Internal state is preventing recrawls #769

tleyden opened this issue Dec 1, 2024 · 3 comments
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@tleyden
Copy link

tleyden commented Dec 1, 2024

When running in a Jupyter notebook, or via Modal.com (which seems to reuse python VM instances across invocations), I'm noticing that the second+ crawl invocation always returns an empty crawl:

Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
... etc

If I restart the jupyter kernel, or force a fresh container on Modal, it will do the recrawl as expected. (detailed steps to repro below)

I tried messing with the config:

    crawler = BeautifulSoupCrawler(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
            verbose_log=True,
        ),
    )

but it didn't seem to make a difference. With debug enabled, here is the output on the 2nd run that gives more details on why it skips the crawl:

[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Awaiting listener task...
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Listener task completed.
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Removing listener task from the set...
[crawlee._autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Terminating - no running tasks to wait for
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee._autoscaling.autoscaled_pool] DEBUG Pool cleanup finished
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).

in particular: DEBUG is_finished_function reports that we are finished

Steps to repro

Run this code in a jupyter notebook:

%load_ext autoreload
%autoreload 2
%autoawait asyncio

and

from crawlee import Glob
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import EnqueueStrategy
from crawlee.configuration import Configuration

crawler = BeautifulSoupCrawler(
    configuration=Configuration(
        persist_storage=False,
        purge_on_start=True,
        verbose_log=True,
    ),
)

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    context.log.info(f'Processing {context.request.url} ...')
    await context.enqueue_links(
        strategy=EnqueueStrategy.SAME_HOSTNAME,
    )

results = await crawler.run(['https://11-19-inject-broken-links.docs-7kl.pages.dev'])
results

The first time it will crawl 14 links:

┌───────────────────────────────┬──────────┐
│ requests_finished             │ 14       │
│ requests_failed               │ 0        │

the 2nd time it will crawl 0 links:

Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
... etc

I'd expect to do the same crawl each time. If it's by design, I would say the behavior is very confusing. Ideally should be a global crawlee.reset() to clear out everything when needed.

Workaround

  • Jupyter: restarting the kernel clears out everything and it runs as expected.
  • Modal.com: forcing it start a fresh container each run also solves the issue

RCA

No idea, but I did notice that on the 2nd run this code:

    async def add_request(
        self,
        request: Request,
        *,
        forefront: bool = False,
    ) -> ProcessedRequest:
        existing_queue_by_id = find_or_create_client_by_id_or_name_inner(
            resource_client_class=RequestQueueClient,
            memory_storage_client=self._memory_storage_client,
            id=self.id,
            name=self.name,
        )

        if existing_queue_by_id is None:
            raise_on_non_existing_storage(StorageTypes.REQUEST_QUEUE, self.id)

        request_model = await self._create_internal_request(request, forefront)

        async with existing_queue_by_id.file_operation_lock:
            existing_request_with_id = existing_queue_by_id.requests.get(request_model.id) 

            # We already have the request present, so we return information about it
            if existing_request_with_id is not None:
                await existing_queue_by_id.update_timestamps(has_been_modified=False)

                return ProcessedRequest(
                    id=request_model.id,
                    unique_key=request_model.unique_key,
                    was_already_present=True,
                    was_already_handled=existing_request_with_id.order_no is None,
                )

It was returning ProcessedRequest instances with was_already_handled=True

@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 1, 2024
@Mantisus
Copy link
Collaborator

Mantisus commented Dec 2, 2024

Hey @tleyden

This behavior is due to the fact that Jupyter notebook does not clear memory until the session is complete.

Because of this, all links that have already been processed are in the LRU cache of the RequestQueue.

In order to bypass this behavior, you should force clearing of the standard RequestQueue

Create a separate cell with the following code

from crawlee.storages import RequestQueue

request_provider = await RequestQueue.open()
await request_provider.drop()

This will open the default RequestQueue and clear the cache

@tleyden
Copy link
Author

tleyden commented Dec 2, 2024

Thanks @Mantisus, that fixes it!

I still find the behavior a bit surprising though. Shouldn't creating a new BeautifulSoupCrawler() start with a fresh RequestQueue?

If you want to re-use a RequestQueue across crawlers, it could be passed in as a param

@janbuchar
Copy link
Collaborator

Thanks @Mantisus, that fixes it!

I still find the behavior a bit surprising though. Shouldn't creating a new BeautifulSoupCrawler() start with a fresh RequestQueue?

If you want to re-use a RequestQueue across crawlers, it could be passed in as a param

By default, we use a filesystem-backed request queue, which persists the data in case the process gets interrupted. It can also be more memory-efficient. I agree that this behavior can be confusing though... We are currently re-evaluating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants