Internal state is preventing recrawls #769

tleyden · 2024-12-01T13:56:55Z

When running in a Jupyter notebook, or via Modal.com (which seems to reuse python VM instances across invocations), I'm noticing that the second+ crawl invocation always returns an empty crawl:

Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
... etc

If I restart the jupyter kernel, or force a fresh container on Modal, it will do the recrawl as expected. (detailed steps to repro below)

I tried messing with the config:

    crawler = BeautifulSoupCrawler(
        configuration=Configuration(
            persist_storage=False,
            purge_on_start=True,
            verbose_log=True,
        ),
    )

but it didn't seem to make a difference. With debug enabled, here is the output on the 2nd run that gives more details on why it skips the crawl:

[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Awaiting listener task...
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Listener task completed.
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Removing listener task from the set...
[crawlee._autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Terminating - no running tasks to wait for
[crawlee._autoscaling.autoscaled_pool] INFO  Waiting for remaining tasks to finish
[crawlee._autoscaling.autoscaled_pool] DEBUG Pool cleanup finished
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).

in particular: DEBUG is_finished_function reports that we are finished

Steps to repro

Run this code in a jupyter notebook:

%load_ext autoreload
%autoreload 2
%autoawait asyncio

and

from crawlee import Glob
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import EnqueueStrategy
from crawlee.configuration import Configuration

crawler = BeautifulSoupCrawler(
    configuration=Configuration(
        persist_storage=False,
        purge_on_start=True,
        verbose_log=True,
    ),
)

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    context.log.info(f'Processing {context.request.url} ...')
    await context.enqueue_links(
        strategy=EnqueueStrategy.SAME_HOSTNAME,
    )

results = await crawler.run(['https://11-19-inject-broken-links.docs-7kl.pages.dev'])
results

The first time it will crawl 14 links:

┌───────────────────────────────┬──────────┐
│ requests_finished             │ 14       │
│ requests_failed               │ 0        │

the 2nd time it will crawl 0 links:

Final request statistics:
┌───────────────────────────────┬──────────┐
│ requests_finished             │ 0        │
│ requests_failed               │ 0        │
... etc

I'd expect to do the same crawl each time. If it's by design, I would say the behavior is very confusing. Ideally should be a global crawlee.reset() to clear out everything when needed.

Workaround

Jupyter: restarting the kernel clears out everything and it runs as expected.
Modal.com: forcing it start a fresh container each run also solves the issue

RCA

No idea, but I did notice that on the 2nd run this code:

    async def add_request(
        self,
        request: Request,
        *,
        forefront: bool = False,
    ) -> ProcessedRequest:
        existing_queue_by_id = find_or_create_client_by_id_or_name_inner(
            resource_client_class=RequestQueueClient,
            memory_storage_client=self._memory_storage_client,
            id=self.id,
            name=self.name,
        )

        if existing_queue_by_id is None:
            raise_on_non_existing_storage(StorageTypes.REQUEST_QUEUE, self.id)

        request_model = await self._create_internal_request(request, forefront)

        async with existing_queue_by_id.file_operation_lock:
            existing_request_with_id = existing_queue_by_id.requests.get(request_model.id) 

            # We already have the request present, so we return information about it
            if existing_request_with_id is not None:
                await existing_queue_by_id.update_timestamps(has_been_modified=False)

                return ProcessedRequest(
                    id=request_model.id,
                    unique_key=request_model.unique_key,
                    was_already_present=True,
                    was_already_handled=existing_request_with_id.order_no is None,
                )

It was returning ProcessedRequest instances with was_already_handled=True

The text was updated successfully, but these errors were encountered:

Mantisus · 2024-12-02T00:41:38Z

Hey @tleyden

This behavior is due to the fact that Jupyter notebook does not clear memory until the session is complete.

Because of this, all links that have already been processed are in the LRU cache of the RequestQueue.

In order to bypass this behavior, you should force clearing of the standard RequestQueue

Create a separate cell with the following code

from crawlee.storages import RequestQueue

request_provider = await RequestQueue.open()
await request_provider.drop()

This will open the default RequestQueue and clear the cache

tleyden · 2024-12-02T09:16:41Z

Thanks @Mantisus, that fixes it!

I still find the behavior a bit surprising though. Shouldn't creating a new BeautifulSoupCrawler() start with a fresh RequestQueue?

If you want to re-use a RequestQueue across crawlers, it could be passed in as a param

janbuchar · 2024-12-02T15:28:35Z

Thanks @Mantisus, that fixes it!

I still find the behavior a bit surprising though. Shouldn't creating a new BeautifulSoupCrawler() start with a fresh RequestQueue?

If you want to re-use a RequestQueue across crawlers, it could be passed in as a param

By default, we use a filesystem-backed request queue, which persists the data in case the process gets interrupted. It can also be more memory-efficient. I agree that this behavior can be confusing though... We are currently re-evaluating it.

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal state is preventing recrawls #769

Internal state is preventing recrawls #769

tleyden commented Dec 1, 2024

Mantisus commented Dec 2, 2024

tleyden commented Dec 2, 2024

janbuchar commented Dec 2, 2024

Internal state is preventing recrawls #769

Internal state is preventing recrawls #769

Comments

tleyden commented Dec 1, 2024

Steps to repro

Workaround

RCA

Mantisus commented Dec 2, 2024

tleyden commented Dec 2, 2024

janbuchar commented Dec 2, 2024