RDF property URIs are quite often shortened using following prefixes:
acdh https://vocabs.acdh.oeaw.ac.at/schema#
acdhi https://id.acdh.oeaw.ac.at
Examples are provided either in plain URLs or Python. In case of
Python it’s assumed datetime
, requests
and
rdflib
libraries are loaded in a following way:
The {repoCfg}$.X.Y
syntax means an
$.X.Y
JSON
path over the repository configuration returned by its describe
REST API endpoint, e.g. {repoCfg}$.schema.label
on https://arche.acdh.oeaw.ac.at/api resolves to
https://vocabs.acdh.oeaw.ac.at/schema#hasTitle
.
The ARCHE API term means “the REST API provided by the arche-core” with its OpenAPI specification available here.
The requested resource means both the resource requested by the resource metadata REST API endpoint as well as all resources matched by the search condition in the search REST API endpoints (GET and POST ones)
ARCHE Suite uses RDF metadata but doesn’t provide an SPARQL endpoint1.
Instead of the SPARQL endpoint ARCHE Suite provides its own REST API (later on just “ARCHE API”). This API doesn’t give you as much flexibility as SPARQL but is much simpler, delivers data much faster and covers most everyday use cases.
This document supplements the (technical openAPI documentation) of the ARCHE API with practical examples illustrating its capabilities.
Maximizing metadata retrieval performance goes down to few simple rules:
resourceProperties
and
relativesProperties
parameters.application/n-triples
format
because it’s definitely the
fastest and if you think you can avoid parsing the RDF with a
dedicated parser library by choosing the serialization format smartly
(e.g. by requesting application/json
or
application/ld+json
), then you
are wrong.Let’s compare two implementations of a “fetch https://arche.acdh.oeaw.ac.at/api/8274 title and last modification date and title of all resources it refers to” scenario:
First fetch the https://arche.acdh.oeaw.ac.at/api/8274 metadata, then fetch all resources it points to one-by-one:
t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=resource&format=application/n-triples')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
n = 1
for i in resMeta:
if str(i[2]).startswith('https://arche.acdh.oeaw.ac.at/api/'):
response = requests.get(f'{i[2]}/metadata?readMode=resource&format=application/n-triples')
resMeta.parse(data=response.text, format='nt')
n += 1
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read, {n} requests made")
resulting in in
Elapsed time 0:00:16.629970, 465 triples read, 28 requests made
.
Fetch metadata of https://arche.acdh.oeaw.ac.at/api/8274 and all resources it refers to in one request by using the right readMode:
t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=0_0_1_0&format=application/n-triples')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")
resulting in
Elapsed time 0:00:01.151311, 465 triples read
.
As we see using one request instead of 28 reduced the time from 16.6 s to around 1.2 s. Moreover, the faster code is also shorter and simpler.
This effect depends largely on the network latency and will be less pronounced if you make requests over local network and more pronounced when you make them over a slow network.
Now let’s take a look on how much time we can save fetching only RDF properties we really want.
At first let’s just adapt the previous scenario (You may skip analyzing the exact request URL as we will get back to it later. For now it’s enough to trust it does the job and focus at the results.):
t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=0_0_1_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")
resulting in
Elapsed time 0:00:01.041821, 39 triples read
.
Here we saved only 0.11 s corresponding to around 10% of the response time.
Doesn’t look like a huge gain but hey, we’ve only save
465 - 39 = 426
triples. What if we save thousands
of triples?
To test that let’s fetch a title and a last modification date of https://arche.acdh.oeaw.ac.at/api/8274 and title of all its children (there are few hundreds of them):
By just fetching all RDF data of the resource and its children
t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=1_0_0_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")
resulting in
Elapsed time 0:00:05.231554, 23810 triples read
.
By limiting set of retrieved RDF properties
t0 = datetime.datetime.now()
response = requests.get('https://arche.acdh.oeaw.ac.at/api/8274/metadata?readMode=1_0_0_0&format=application/n-triples&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate')
resMeta = rdflib.Graph()
resMeta.parse(data=response.text, format='nt')
print(f"Elapsed time {datetime.datetime.now() - t0}, {len(resMeta)} triples read")
resulting in
Elapsed time 0:00:01.082643, 794 triples read
.
Now the time went down from 5.2 s to 1.1 s which is a noticeable gain.
By the way a per-saved-triple-gain is comparable in both scenarios, just it’s so small per single triple that it only becomes noticeable for large triples count.
As we already saw choosing the right readMode parameter value is crucial for effective ARCHE REST API usage. So which values it can take?
The full description of the readMode consists of
four digits separated by an underscore:
{children depth}_{parents depth}_{pointed to}_{pointed from}
,
where:
{children depth}
and {parents depth}
specify how far from the requested resource in
the repository structure we want go (by following the RDF property
indicated by the parentProperty
request parameter){pointed to}
and {pointed from}
are binary
flags (they take only values od 0
or 1
) and
indicate if resources pointed by the requested
resource and pointing to the requested
resource should be included, respectivelyLet’s take a following sample repository structure (circles are resources with the black one being the requested one, black arrows are parent RDF properties, blue and red arrows are any other RDF properties):
0_0_0_0
matches only the requested
resource - the R0
.0_0_1_0
matches the requested
resource (R0
) and all resources it points to (with any
RDF property): R2
, R10
and
R12
.0_0_0_1
matches the requested
resource (R0
) and all resources pointing to it (with
any RDF property): R9
, R3
, R4
and
R13
.0_1_0_0
matches the requested
resource (R0
) and its first order parent
(R2
).0_2_0_0
matches the requested
resource (R0
) and its parents up to the second order
(R1
and R2
).1_0_0_0
matches the requested
resource (R0
) and its first order children
(R3
and R4
).2_0_0_0
matches the requested
resource (R0
) and its children up to the second order
(R3
, R4
, R5
and
R6
).2_1_1_0
is a union of results for
2_0_0_0
(R0
, R3
, R4
,
R5
, R6
), 0_1_0_0
(R0
, R2
) and 0_0_1_0
(R0
, R2
, R10
, R12
)
so all in all it covers R0
, R2
,
R3
, R4
, R5
, R6
,
R10
and R12
A few most popular readMode values have shorthand labels which can be used instead:
shorthand | readMode value | remarks |
---|---|---|
resource |
0_0_0_0 |
|
ids |
0_0_0_0 |
limits fetched RDF properties to the label
({repoCfg}$.schema.label ) only |
neighbors |
0_0_1_1 |
|
relatives |
999999_999999_1_0 |
|
relativesOnly |
999999_999999_0_0 |
|
relativesReverse |
999999_999999_1_1 |
|
parents |
0_999999_1_0 |
|
parentsOnly |
0_999999_0_0 |
|
parentsReverse |
0_999999_1_1 |
As we saw in the Metadata retrieval performance chapter fetching only RDF properties we are actually interested in can speed up the metadata retrieval significantly.
This can be achieved with the
resourceProperties
and
relativesProperties
request parameters.
The first one lists the RDF properties we want to fetch for the
requested resource and the second one the RDF
properties to be fetched for other resources collected according to the
readMode.
Remarks:
https://vocabs.acdh.oeaw.ac.at/schema#hasTitle
(acdh:hasTitle
wont’ work).X-RESOURCE-PROPERTIES: https://vocabs.acdh.oeaw.ac.at/schema#hasTitle,https://vocabs.acdh.oeaw.ac.at/Fschema#hasUpdatedDate
.relativesProperties
list.Let’s split the URL from the Metadata retrieval performance
chapter into a (more) human-readable form (note that RDF property URIs
passes in the request query have to be URL-encoded, hence
%3A%2F%2F
in place of ://
, %2F
instead of /
and %23
instead of
#
):
https://arche.acdh.oeaw.ac.at/api/8274/metadata
?readMode=1_0_0_0
&format=application/n-triples
&relativesProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle
&resourceProperties[0]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasTitle
&resourceProperties[1]=https%3A%2F%2Fvocabs.acdh.oeaw.ac.at%2Fschema%23hasUpdatedDate
ARCHE API allows to provide various parameters either by request query parameters or HTTP headers. Why two methods are available and is there a difference between them?
@paramName
syntax used by the https://hdl.handle.net
resolver). In contrary most HTTP clients and libraries either preserve
HTTP headers between the redirects by default or can be easily set up to
do so. This assures parameters you want to pass to the ARCHE API survive
intermediate redirects and reach the ARCHE API.All in all just use the method you find more convenient in a given context.
Remarks:
{repoCfg}$.rest.headers
.There are many reasons for that starting from performance and hardware resources consumption though assuring data consistency up to ability to assure access rights. Discussing them is beyond the scope of this document.↩︎