Foreword

Arche-core REST API supports both multiple parallel transactions (since the beginning) and multiple parallel requests within a single transaction (since version 3.2). While parallel ingestion can provide a great speedup there are some limitations and caveats which were discussed below.

Most common problem

Unfortunately the fact that the arche-core supports parallel ingestions does not mean it can automagically protect one transaction from conflicting with the other when both try to modify the same repository resource at the same time.

The risk of such conflicts is the highest for named entity resources as the same named entity is likely to exist in many different ingestions and is likely to be referred to from many resources.

The best way to avoid the risk of lock conflicts on named entities is to only refer them with their identifiers (URIs) and manage their detailed metadata separately (with a separate ingestion). As only referring to a resource doesn’t acquire a lock on it, there will be no conflict errors (even if the referred resource doesn’t exist and the first transaction will obtain a lock due to its creation, all other won’t conflict with it because they won’t obtain a lock on the created resource).

This approach allows to solve another problem with named entities - overwritting of a named entity resource’s curated data already existing in the repository with a dirty data coming from a new ingestion.

Please read the next chapter For the detailed description of the locking system.

Locks

When there is more than one client trying to modify the same repository resource we must have a way to perform an arbitration between them. This is done by imposing the rule that whoever modifies a resource first acquires a lock on it and the lock prevents other clients from modifying this resource until the lock owner releases it. Any other request trying to modify a locked resource receives the HTTP 409 Conflict error response (with the response body containing more detailed description of the conflict).

There are two kinds of lock in arche-core:

It’s woth noting that just referring to an already existing repository resource doesn’t acquire a lock on it but referring to a non-existing repository resource creates it and acquires a lock on it.

Let’s consider three scenarios:

  1. Repository resources: {none}
    Ingested data:

    <newRes> <property> <oldRes> .
  2. Repository resources: <oldRes>
    Ingested data:

    <newRes> <property> <oldRes> .
  3. Repository resources: <oldRes>
    Ingested data:

    <newRes> <property> <oldRes> .
    <oldRes> <otherProperty> "data" .

In the first case the transaction will acquire a lock both on <newRes> and <oldRes>.
In the second case only a lock on the <newRes> will be acquired.
In the third case locks on both <newRes> and <oldRes> will be acquired as both resources data will be modified.

Corner cases

As mentioned above just refering to a resource doesn’t acquire a lock on it. This can lead to some corner cases. Let’s consider a following sequence of events:

  • Initial repository state: {no resources}
  • Transaction 1 creates <res1> <someProperty> "some value" . and obtains a lock on <res1>
  • Transaction 2 creates <res2> <otherProperty> <res1>.
    As a result it obtains a lock on <res2> but there is no conflict with the transaction 1 as the <res1> was only referred to and already existed (the transaction isolation level in arche-core is read uncommitted so the <res1> was already visible to the transaction 2).
  • Transaction 1 rolls back.

What should happen now? We have two possible base scenarios with many subsequent decisions to be made:

  1. Transaction 1 owns the <res1> and rolls back so the <res1> is removed.
    Notably it makes the transaction 2 invalid. We have basically two options:
    1. Cause transaction 2 rollback.
      The problem here is it will be a very surprising and difficult to understand for the client performing this transaction.
    2. Silently drop the problematic triple from <res2> metadata.
      It is even more problematic - we are silently modifying other transaction without any notification leading to the end repository state which is impossible to understand by the transaction 2 owner. A no-go.
    3. Silently drop the <res2>.
      Also a no-go due to the same reason as b.
  2. Transaction 2 depends on the <res1> so it can not be removed by the transaction 1 rollback, so the <res1> is kept.
    Two questions arise in such scenario. First, who should own a lock on the <res1> (or maybe it should not be locked). Second, what set of metadata the <res1> should contain (should it contain data provided by the transaction 1 or should it be an id-only resource).
    1. When it comes to locking it should probably be inherited by the transaction 2. Otherwise rollback of both transactions would not get us back to the initial repository state which would be a serious ACID violation and therefore a no-go.
    2. When it comes to metadata content it is more subtle. Passing the lock on the <res1> to the transaction 2 and leaving it a metadata-only resource will not cause trouble in a plain arche-core but it may cause data completeness checks performed by deployment-specific check plugins to fail.
      As for arche-core 3.2 the decision was made to leave <res1> metadata as it was provided by the transaction 1 (just because it’s the easiest scenario to implement) but this may change in the next major arche-core release.

Parallel ingestion within the same transaction

Parallel ingestion within the same transaction is very attractive because it can provide a significant speedup. This is especially pronounced for the metadata ingestions, e.g. we measured from 4 to 8 times shorter ingestion times.

Unfortunately parallel ingestion within the same transaction require special handling of corner cases which do not affect a single-request-at-once transactions

The easiest way to avoid these problems is to use a higher-level ingestion interface which already takes care of them (like arche-lib-ingest)

Speedup

Tests on real world data gave us following results:

TODO

Atomic create-or-update

When we ingest data, we typically want to create a repository resources which do not exist yet and update already existing ones. An intuitive way of doing that could be:

  • Search for a resource
  • If it was found, update it
  • If it was not found, create it

Unfortunately this approach works only if we are sure that a resource can not be created in between of the search and update operations (a so-called atomicity guarantee). The arche-core REST API does not provide us such a guarantee. It means it is possible that you will search for a resource, get no result, then try to create it and get the HTTP 400 unique constraint violation meaning a resource has been created by another parallel request in between of your search and create requests. As a consequence you must trap such an error and act accordingly - try to update the resources instead.

Taking this into account an optimal approach is to always try to create a resource and switch back to update once you get the HTTP 400 unique constraint violation response. This way the create-or-update operation is always executed with 1 (create) or 2 (failed to create, then update) requests. In comparison the search-update-or-create approach with the creation failure handling requires from 2 (search and update, search and successful create) to 3 (search, failed to create, update) requests per resource.

You may wonder how big is the risk of actually running into such conflicts. It depends on how dense is the ingested RDF metadata graph (how many edges it has), how many parallel requests are used and what is the ingestion order. From mathematical perspective trying to ingest any complete subgraph (complete without distinguishing between edge properties and directions) in parallel (even with two parallel requests) will lead to the created in the meantime conflict. For example assuming all resources you are ingesting have an RDF property denoting the same author, it is for sure you will run into at least one created in the meantime conflict. And if you have a graph of N resources where every resource points to the other (with any property), it is just impossible to ingest it in parallel. Summing up the risk of running into the created in the meantime conflict when using parallel requests is high and your code must be prepared for handling conflict errors, e.g. by trying to reingest the resource.

Finally it is worth noting that the problem described above is valid also for single-requests-at-once transactions as the creation between a search and a create may be also caused by another transaction. Anyway this scenario is pretty rare because the probability of creation/update of exactly same resource at exactly same time between multiple transactions is very low. If two transactions try to create/update the same resource a conflict is far more probable outcome (see the section on locks above).

How many parallel requests make sense?

As a rule of thumb no more than 10 for metadata ingestion and no more than 3 for binary ingestion.

The details are quite complex but let’s try to dig a little into them.

First, by ingesting in parallel we can only reduce an impact of a network latency and a delay introduced by a server-side processing. If the bottleneck is somewhere else, e.g. it’s network’s troughput, we won’t benefit from the parallelization. This is exactly why the expected benefit for binary ingestion is much lower than for metadata one.

Second problem is the synchronization overhead. The repository has to synchronize parallel requests using transaction and within-transaction locks and this process is non-parallel by its very nature. It means handling parallel requests adds some synchronization overhead. This overhead grows faster than the gain from the parallelization and at some point overweights it. If you push parallelization even further at some point it will become slower than sequential execution.

It’s worth noting that:

  • The more time-consuming the server-side processing is, the more space for parallelization. E.g. if you don’t run any pre-resource-modification plugins in your repository you will stop benefiting from the parallelization much earlier than if you run time-consuming plugins. (it’s simple - the bigger locks-unrelated overhead you have on the server side, the biggest is the performance gap to be filled up by the growing synchronization overhead)
  • To avoid “the locks spiral of death” (situation when locks are obtained slower than new lock request come leading to the whole database stall) arche-core acquires locks with a timeout. When the timeout is reached, just an HTTP 409 Conflict is returned. It means in practice you won’t observe single request handling time growing infinitely once you increase number of parallel requests. At some point you will just start getting (very quick) HTTP 409 Conflict responses which you must handle on your own (by basically reducing number of parallel requests and trying to send a request again).
    • Depending on the underlying database speed it should be able to handle from few to few dozens locks per second. The default timeout is 1 second but it can be tuned by the repository admin (e.g. a fast database with 2 seconds timeout should be able to handle around 150 locks per second while a slow database with a 0.25 second timeout may have problems handling even two parallel requests at once).

arche-lib utils for parallel requests

TODO - how to use Repo::map()