Description
Is your feature request related to a problem? Please describe.
In Aleph, a task is a single unit of background work. Aleph tracks progress of background jobs at the task-level (i.e. it stores how many tasks have been processed so far and how many are still pending). If a task fails, the entire task is retried.
Cross-referencing an entire collection is implemented as a single task. This causes multiple problems:
-
It means that Aleph currently isn’t able to display information about the progress of the cross-referencing process (besides the fact that it is still running).
-
If the task fails, it is retried from the beginning, re-computing cross-referencings for entities that have already been processed. Depending on the size of the Aleph instance and the size of the collection, computing cross-referencings can take hours, sometimes even days.
-
Aleph uses the Elasticsearch scroll API to iterate over all entities in the collection in batches. The scroll API has a timeout for the maximum time between requesting two batches of entities. If fetching candidates and computing the similarity score for a batch of entities takes longer than the timeout, Elasticsearch will raise an error when Aleph tries to request the next batch of entities.
Describe the solution you'd like
Computing cross-referencings for a collection should be split into two types of background tasks:
-
An initial task should use the ES scroll API to iterate over all entities in the collection in batches, similar to what happens right now. However, it shouldn’t actually compute the xref matches for the entities as part of that same task. Instead, this task should merely enqueue a separate task for every batch.
-
These tasks then compute the actual xref matches.
For example, if a collection contains 1000 entities and given a batch size of 500:
- Aleph fetches first batch of 500 entities from ES and enqueues a separate background task to compute xref for entities 1 to 500.
- Aleph fetches second batch of 500 entities from ES and enqueues a separate background task to compute xref for entities 501 to 1000.
- A worker computes xref for entities 1 to 500.
- A worker computes xref for entities 501 to 1000.
This has several advantages:
- If one of the subtasks fails, the failure is limited to one batch of entities (if the batch size is 500, only xref matches for 500 entities would have to be recomputed).
- Aleph will display detailed progress for the cross-referencing. (For example, if a collection contains 5000 entities and the batch size is 500, Aleph would be able to report progress in 10% increments.)
- The problems with ES scroll timeouts are gone. The process that uses the scroll API to iterate over all entities doesn’t actually compute the xref matches (which takes a lot of time and can easily exceed the scroll timeouts), it only adds another task to the queue (which is very quick).
- Xref can be parallelized. Each batch of entities can be handled by a separate worker. (For example, if you have 10 workers that are otherwise idle, xref could take only 1/10th of the time to complete compared to the current implementation.)
The disadvantage is that it adds a lot of tasks to the queue with possibly large payloads (the task payload would need to include the IDs for one batch of entities).
Describe alternatives you've considered
Increasing scroll timeouts: We’ve done this before, but it is only a short-term solution, as it cannot be repeated indefinitely. However, we’re currently migrating to from Redis to RabbitMQ as the primary data store for queued tasks. In contrast to Redis, RabbitMQ stores tasks on disk, so this should be less of a problem.
Additional context
- This article gives an overview about how cross-referencing is implemented in Aleph.