Arche-core REST API supports both multiple parallel transactions (since the beginning) and multiple parallel requests within a single transaction (since version 3.2). While parallel ingestion can provide a great speedup there are some limitations and caveats which were discussed below.
Unfortunately the fact that the arche-core supports parallel ingestions does not mean it can automagically protect one transaction from conflicting with the other when both try to modify the same repository resource at the same time.
The risk of such conflicts is the highest for named entity resources as the same named entity is likely to exist in many different ingestions and is likely to be referred to from many resources.
The best way to avoid the risk of lock conflicts on named entities is to only refer them with their identifiers (URIs) and manage their detailed metadata separately (with a separate ingestion). As only referring to a resource doesn’t acquire a lock on it, there will be no conflict errors (even if the referred resource doesn’t exist and the first transaction will obtain a lock due to its creation, all other won’t conflict with it because they won’t obtain a lock on the created resource).
This approach allows to solve another problem with named entities - overwritting of a named entity resource’s curated data already existing in the repository with a dirty data coming from a new ingestion.
Please read the next chapter For the detailed description of the locking system.
When there is more than one client trying to modify the same repository resource we must have a way to perform an arbitration between them. This is done by imposing the rule that whoever modifies a resource first acquires a lock on it and the lock prevents other clients from modifying this resource until the lock owner releases it. Any other request trying to modify a locked resource receives the HTTP 409 Conflict
error response (with the response body containing more detailed description of the conflict).
There are two kinds of lock in arche-core:
It’s woth noting that just referring to an already existing repository resource doesn’t acquire a lock on it but referring to a non-existing repository resource creates it and acquires a lock on it.
Let’s consider three scenarios:
Repository resources: {none}
Ingested data:
<newRes> <property> <oldRes> .
Repository resources: <oldRes>
Ingested data:
<newRes> <property> <oldRes> .
Repository resources: <oldRes>
Ingested data:
<newRes> <property> <oldRes> .
<oldRes> <otherProperty> "data" .
In the first case the transaction will acquire a lock both on <newRes>
and <oldRes>
.
In the second case only a lock on the <newRes>
will be acquired.
In the third case locks on both <newRes>
and <oldRes>
will be acquired as both resources data will be modified.
As mentioned above just refering to a resource doesn’t acquire a lock on it. This can lead to some corner cases. Let’s consider a following sequence of events:
{no resources}
<res1> <someProperty> "some value" .
and obtains a lock on <res1>
<res2> <otherProperty> <res1>
.<res2>
but there is no conflict with the transaction 1 as the <res1>
was only referred to and already existed (the transaction isolation level in arche-core is read uncommitted so the <res1>
was already visible to the transaction 2).What should happen now? We have two possible base scenarios with many subsequent decisions to be made:
<res1>
and rolls back so the <res1>
is removed.<res2>
metadata.<res2>
.<res1>
so it can not be removed by the transaction 1 rollback, so the <res1>
is kept.<res1>
(or maybe it should not be locked). Second, what set of metadata the <res1>
should contain (should it contain data provided by the transaction 1 or should it be an id-only resource).
<res1>
to the transaction 2 and leaving it a metadata-only resource will not cause trouble in a plain arche-core but it may cause data completeness checks performed by deployment-specific check plugins to fail.<res1>
metadata as it was provided by the transaction 1 (just because it’s the easiest scenario to implement) but this may change in the next major arche-core release.Parallel ingestion within the same transaction is very attractive because it can provide a significant speedup. This is especially pronounced for the metadata ingestions, e.g. we measured from 4 to 8 times shorter ingestion times.
Unfortunately parallel ingestion within the same transaction require special handling of corner cases which do not affect a single-request-at-once transactions
The easiest way to avoid these problems is to use a higher-level ingestion interface which already takes care of them (like arche-lib-ingest)
Tests on real world data gave us following results:
TODO
When we ingest data, we typically want to create a repository resources which do not exist yet and update already existing ones. An intuitive way of doing that could be:
Unfortunately this approach works only if we are sure that a resource can not be created in between of the search and update operations (a so-called atomicity guarantee). The arche-core REST API does not provide us such a guarantee. It means it is possible that you will search for a resource, get no result, then try to create it and get the HTTP 400 unique constraint violation
meaning a resource has been created by another parallel request in between of your search and create requests. As a consequence you must trap such an error and act accordingly - try to update the resources instead.
Taking this into account an optimal approach is to always try to create a resource and switch back to update once you get the HTTP 400 unique constraint violation
response. This way the create-or-update operation is always executed with 1 (create) or 2 (failed to create, then update) requests. In comparison the search-update-or-create approach with the creation failure handling requires from 2 (search and update, search and successful create) to 3 (search, failed to create, update) requests per resource.
You may wonder how big is the risk of actually running into such conflicts. It depends on how dense is the ingested RDF metadata graph (how many edges it has), how many parallel requests are used and what is the ingestion order. From mathematical perspective trying to ingest any complete subgraph (complete without distinguishing between edge properties and directions) in parallel (even with two parallel requests) will lead to the created in the meantime conflict. For example assuming all resources you are ingesting have an RDF property denoting the same author, it is for sure you will run into at least one created in the meantime conflict. And if you have a graph of N resources where every resource points to the other (with any property), it is just impossible to ingest it in parallel. Summing up the risk of running into the created in the meantime conflict when using parallel requests is high and your code must be prepared for handling conflict errors, e.g. by trying to reingest the resource.
Finally it is worth noting that the problem described above is valid also for single-requests-at-once transactions as the creation between a search and a create may be also caused by another transaction. Anyway this scenario is pretty rare because the probability of creation/update of exactly same resource at exactly same time between multiple transactions is very low. If two transactions try to create/update the same resource a conflict is far more probable outcome (see the section on locks above).
As a rule of thumb no more than 10 for metadata ingestion and no more than 3 for binary ingestion.
The details are quite complex but let’s try to dig a little into them.
First, by ingesting in parallel we can only reduce an impact of a network latency and a delay introduced by a server-side processing. If the bottleneck is somewhere else, e.g. it’s network’s troughput, we won’t benefit from the parallelization. This is exactly why the expected benefit for binary ingestion is much lower than for metadata one.
Second problem is the synchronization overhead. The repository has to synchronize parallel requests using transaction and within-transaction locks and this process is non-parallel by its very nature. It means handling parallel requests adds some synchronization overhead. This overhead grows faster than the gain from the parallelization and at some point overweights it. If you push parallelization even further at some point it will become slower than sequential execution.
It’s worth noting that:
HTTP 409 Conflict
is returned. It means in practice you won’t observe single request handling time growing infinitely once you increase number of parallel requests. At some point you will just start getting (very quick) HTTP 409 Conflict
responses which you must handle on your own (by basically reducing number of parallel requests and trying to send a request again).
TODO - how to use Repo::map()