Conventions

Introduction

ARCHE Suite uses RDF metadata but doesn’t provide an SPARQL endpoint1.

Instead of the SPARQL endpoint ARCHE Suite provides its own REST API (later on just “ARCHE API”). This API doesn’t give you as much flexibility as SPARQL but is much simpler, delivers data much faster and covers most everyday use cases.

This document supplements the (technical openAPI documentation) of the ARCHE API with practical examples illustrating its capabilities.

Metadata retrieval performance

Maximizing metadata retrieval performance goes down to few simple rules:

Examples and discussion

Let’s compare two implementations of a “fetch https://arche.acdh.oeaw.ac.at/api/8274 title and last modification date and title of all resources it refers to” scenario:

  1. First fetch the https://arche.acdh.oeaw.ac.at/api/8274 metadata, then fetch all resources it points to one-by-one:

    t0 = datetime.datetime.now()
    response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=resource&format=application/n-triples')
    resMeta = rdflib.Graph()
    resMeta.parse(data=response.text, format='nt')
    n = 1
    for i in resMeta:
      if str(i[2]).startswith('https://arche.acdh.oeaw.ac.at/api/'):
        response = requests.get(f'{i[2]}/metadata?readMode=resource&format=application/n-triples')
        resMeta.parse(data=response.text, format='nt')
        n += 1
    
    print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read, {n} requests made")

    resulting in in Elapsed time 0:00:16.629970, 465 triples read, 28 requests made.

  2. Fetch metadata of https://arche.acdh.oeaw.ac.at/api/8274 and all resources it refers to in one request by using the right readMode:

    t0 = datetime.datetime.now()
    response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=0_0_1_0&format=application/n-triples')
    resMeta = rdflib.Graph()
    resMeta.parse(data=response.text, format='nt')
    print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")

    resulting in Elapsed time 0:00:01.151311, 465 triples read.

As we see using one request instead of 28 reduced the time from 16.6 s to around 1.2 s. Moreover, the faster code is also shorter and simpler.

This effect depends largely on the network latency and will be less pronounced if you make requests over local network and more pronounced when you make them over a slow network.

Now let’s take a look on how much time we can save fetching only RDF properties we really want.

At first let’s just adapt the previous scenario (You may skip analyzing the exact request URL as we will get back to it later. For now it’s enough to trust it does the job and focus at the results.):

t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=0_0_1_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")

resulting in Elapsed time 0:00:01.041821, 39 triples read.

Here we saved only 0.11 s corresponding to around 10% of the response time.

Doesn’t look like a huge gain but hey, we’ve only save 465 - 39 = 426 triples. What if we save thousands of triples?

To test that let’s fetch a title and a last modification date of https://arche.acdh.oeaw.ac.at/api/8274 and title of all its children (there are few hundreds of them):

  1. By just fetching all RDF data of the resource and its children

    t0 = datetime.datetime.now()
    response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=1_0_0_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
    resMeta = rdflib.Graph()
    resMeta.parse(data=response.text, format='nt')
    print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")

    resulting in Elapsed time 0:00:05.231554, 23810 triples read.

  2. By limiting set of retrieved RDF properties

    t0 = datetime.datetime.now()
    response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=1_0_0_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
    resMeta = rdflib.Graph()
    resMeta.parse(data=response.text, format='nt')
    print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")

    resulting in Elapsed time 0:00:01.082643, 794 triples read.

Now the time went down from 5.2 s to 1.1 s which is a noticeable gain.

By the way a per-saved-triple-gain is comparable in both scenarios, just it’s so small per single triple that it only becomes noticeable for large triples count.

readMode in details

As we already saw choosing the right readMode parameter value is crucial for effective ARCHE REST API usage. So which values it can take?

The full description of the readMode consists of four digits separated by an underscore: {children depth}_{parents depth}_{pointed to}_{pointed from}, where:

Examples

Let’s take a following sample repository structure (circles are resources with the black one being the requested one, black arrows are parent RDF properties, blue and red arrows are any other RDF properties):

  • readMode 0_0_0_0 matches only the requested resource - the R0.
  • readMode 0_0_1_0 matches the requested resource (R0) and all resources it points to (with any RDF property): R2, R10 and R12.
  • readMode 0_0_0_1 matches the requested resource (R0) and all resources pointing to it (with any RDF property): R9, R3, R4 and R13.
  • readMode 0_1_0_0 matches the requested resource (R0) and its first order parent (R2).
  • readMode 0_2_0_0 matches the requested resource (R0) and its parents up to the second order (R1 and R2).
  • readMode 1_0_0_0 matches the requested resource (R0) and its first order children (R3 and R4).
  • readMode 2_0_0_0 matches the requested resource (R0) and its children up to the second order (R3, R4, R5 and R6).
  • readMode 2_1_1_0 is a union of results for 2_0_0_0 (R0, R3, R4, R5, R6), 0_1_0_0 (R0, R2) and 0_0_1_0 (R0, R2, R10, R12) so all in all it covers R0, R2, R3, R4, R5, R6, R10 and R12

readMode shorthands

A few most popular readMode values have shorthand labels which can be used instead:

shorthand readMode value remarks
resource 0_0_0_0
ids 0_0_0_0 limits fetched RDF properties to the label ({repoCfg}$.schema.label) only
neighbors 0_0_1_1
relatives 999999_999999_1_0
relativesOnly 999999_999999_0_0
relativesReverse 999999_999999_1_1
parents 0_999999_1_0
parentsOnly 0_999999_0_0
parentsReverse 0_999999_1_1

Limiting the retrieved RDF properties set

As we saw in the Metadata retrieval performance chapter fetching only RDF properties we are actually interested in can speed up the metadata retrieval significantly.

This can be achieved with the resourceProperties and relativesProperties request parameters. The first one lists the RDF properties we want to fetch for the requested resource and the second one the RDF properties to be fetched for other resources collected according to the readMode.

Remarks:

Example

Let’s split the URL from the Metadata retrieval performance chapter into a (more) human-readable form (note that RDF property URIs passes in the request query have to be URL-encoded, hence %3A%2F%2F in place of ://, %2F instead of / and %23 instead of #):

https://arche.acdh.oeaw.ac.at/api/8274/metadata
  ?readMode=1_0_0_0
  &format=application/n-triples
  &relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle
  &resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle
  &resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate

Query parameters vs request headers

ARCHE API allows to provide various parameters either by request query parameters or HTTP headers. Why two methods are available and is there a difference between them?

All in all just use the method you find more convenient in a given context.

Remarks:


  1. There are many reasons for that starting from performance and hardware resources consumption though assuring data consistency up to ability to assure access rights. Discussing them is beyond the scope of this document.↩︎