You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running in a Jupyter notebook, or via Modal.com (which seems to reuse python VM instances across invocations), I'm noticing that the second+ crawl invocation always returns an empty crawl:
but it didn't seem to make a difference. With debug enabled, here is the output on the 2nd run that gives more details on why it skips the crawl:
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Awaiting listener task...
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Listener task completed.
[crawlee.events._event_manager] DEBUG LocalEventManager.on.listener_wrapper(): Removing listener task from the set...
[crawlee._autoscaling.autoscaled_pool] DEBUG `is_finished_function` reports that we are finished
[crawlee._autoscaling.autoscaled_pool] DEBUG Terminating - no running tasks to wait for
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[crawlee._autoscaling.autoscaled_pool] DEBUG Pool cleanup finished
[crawlee.statistics._statistics] DEBUG Persisting state of the Statistics (event_data=is_migrating=False).
in particular: DEBUG is_finished_function reports that we are finished
I'd expect to do the same crawl each time. If it's by design, I would say the behavior is very confusing. Ideally should be a global crawlee.reset() to clear out everything when needed.
Workaround
Jupyter: restarting the kernel clears out everything and it runs as expected.
Modal.com: forcing it start a fresh container each run also solves the issue
RCA
No idea, but I did notice that on the 2nd run this code:
async def add_request(
self,
request: Request,
*,
forefront: bool = False,
) -> ProcessedRequest:
existing_queue_by_id = find_or_create_client_by_id_or_name_inner(
resource_client_class=RequestQueueClient,
memory_storage_client=self._memory_storage_client,
id=self.id,
name=self.name,
)
if existing_queue_by_id is None:
raise_on_non_existing_storage(StorageTypes.REQUEST_QUEUE, self.id)
request_model = await self._create_internal_request(request, forefront)
async with existing_queue_by_id.file_operation_lock:
existing_request_with_id = existing_queue_by_id.requests.get(request_model.id)
# We already have the request present, so we return information about it
if existing_request_with_id is not None:
await existing_queue_by_id.update_timestamps(has_been_modified=False)
return ProcessedRequest(
id=request_model.id,
unique_key=request_model.unique_key,
was_already_present=True,
was_already_handled=existing_request_with_id.order_no is None,
)
It was returning ProcessedRequest instances with was_already_handled=True
The text was updated successfully, but these errors were encountered:
I still find the behavior a bit surprising though. Shouldn't creating a new BeautifulSoupCrawler() start with a fresh RequestQueue?
If you want to re-use a RequestQueue across crawlers, it could be passed in as a param
By default, we use a filesystem-backed request queue, which persists the data in case the process gets interrupted. It can also be more memory-efficient. I agree that this behavior can be confusing though... We are currently re-evaluating it.
When running in a Jupyter notebook, or via Modal.com (which seems to reuse python VM instances across invocations), I'm noticing that the second+ crawl invocation always returns an empty crawl:
If I restart the jupyter kernel, or force a fresh container on Modal, it will do the recrawl as expected. (detailed steps to repro below)
I tried messing with the config:
but it didn't seem to make a difference. With debug enabled, here is the output on the 2nd run that gives more details on why it skips the crawl:
in particular:
DEBUG
is_finished_functionreports that we are finished
Steps to repro
Run this code in a jupyter notebook:
and
The first time it will crawl 14 links:
the 2nd time it will crawl 0 links:
I'd expect to do the same crawl each time. If it's by design, I would say the behavior is very confusing. Ideally should be a global
crawlee.reset()
to clear out everything when needed.Workaround
RCA
No idea, but I did notice that on the 2nd run this code:
It was returning
ProcessedRequest
instances withwas_already_handled=True
The text was updated successfully, but these errors were encountered: