Documentation for the ARCHE repository software stack
You need:
Remarks:
--privileged
switch to
docker run
commands (or privileged: true
in the docker-compose.yaml
).In this step we will just set up an ARCHE Suite instance as it is used at the ACDH-CH.
It creates a fully-fledged setup with some optional services and also ingests some data into the repository.
docker run --name arche-suite -p 80:80 -e CFG_BRANCH=arche -e ADMIN_PSWD='myAdminPassword' -d acdhch/arche
docker logs -f arche-suite
and wait until you see something like:
##########
# Starting supervisord
##########
2023-06-22 08:44:35,458 INFO Included extra file "/home/www-data/config/supervisord.conf.d/postgresql.conf" during parsing
2023-06-22 08:44:35,458 INFO Included extra file "/home/www-data/config/supervisord.conf.d/tika.conf" during parsing
2023-06-22 08:44:35,459 INFO RPC interface 'supervisor' initialized
2023-06-22 08:44:35,459 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-06-22 08:44:35,459 INFO supervisord started with pid 1253
2023-06-22 08:44:36,462 INFO spawned: 'initScripts' with pid 1255
2023-06-22 08:44:36,463 INFO spawned: 'apache2' with pid 1256
2023-06-22 08:44:36,464 INFO spawned: 'postgresql' with pid 1257
2023-06-22 08:44:36,464 INFO spawned: 'tika' with pid 1258
2023-06-22 08:44:36,465 INFO spawned: 'txDaemon' with pid 1259
2023-06-22 08:44:37,496 INFO success: initScripts entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-06-22 08:44:37,496 INFO success: apache2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-06-22 08:44:37,496 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-06-22 08:44:37,496 INFO success: tika entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-06-22 08:44:37,496 INFO success: txDaemon entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Now hit CRTL+c
.
At this point the repository is up and running.
docker exec -ti -u www-data arche-suite tail -f log/initScripts.log
and wait until you see (this may take a few minutes as two big controlled vocabularies are being imported):
##########
# INIT SCRIPTS ENDED
##########
docker exec -ti -u www-data arche-suite bash
collection.ttl
describing a top collection resource and a TEI-XML resource according to the ACDH-CH metadata schema:
echo '
@prefix n1: <https://vocabs.acdh.oeaw.ac.at/schema#>.
<https://id.acdh.oeaw.ac.at/traveldigital> a n1:TopCollection;
n1:hasIdentifier <https://hdl.handle.net/21.11115/0000-000C-29F3-4>;
n1:hasPid "https://hdl.handle.net/21.11115/0000-000C-29F3-4"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
n1:hasTitle "travel!digital Collection"@en;
n1:hasDescription "A digital collection of early German travel guides on non-European countries which were released by the Baedeker publishing house between 1875 and 1914. The collection consists of the travel!digital Corpus (XML/TEI transcriptions of first editions (5 volumes) including structural, semantic and linguistic annotations), the travel!digital Facsimiles (scans and photographs of the historical prints), the travel!digital Auxiliary Files (a TEI schema of the annotations applied in the corpus, and a list of term labels for indexing names of persons annotated in the corpus), and the travel!digital Thesaurus (a SKOS vocabulary covering designations of groups and selected sights annotated in the corpus).\n The collection was created within the GO!DIGITAL 1.0 project \"travel!digital. Exploring People and Monuments in Baedeker Guidebooks (1875-1914)\", Project-Nr.: ÖAW0204.\n Image creation was done in 2004 at the Austrian Academy of Sciences (AAC-Austrian Academy Corpus)."@en;
n1:hasSubject "Karl Baedeker"@en,
"historical travel guides"@en;
n1:hasHosting <https://id.acdh.oeaw.ac.at/arche>;
n1:hasRightsHolder <https://d-nb.info/gnd/1001454-8>;
n1:hasLicensor <https://d-nb.info/gnd/1123037736>;
n1:hasMetadataCreator <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasCurator <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasCreator <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasDepositor <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasContact <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasOwner <https://d-nb.info/gnd/1123037736>.
<https://id.acdh.oeaw.ac.at/traveldigital/Corpus/Baedeker-Mittelmeer_1909.xml> a n1:Resource;
n1:hasTitle "Karl Baedeker: Das Mittelmeer. Handbuch für Reisende: Digital Edition"@en;
n1:hasCategory <http://purl.org/dc/dcmitype/Dataset>;
n1:hasDepositor <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasMetadataCreator <https://id.acdh.oeaw.ac.at/uczeitschner>;
n1:hasOwner <https://d-nb.info/gnd/1123037736>;
n1:hasRightsHolder <https://d-nb.info/gnd/1001454-8>;
n1:hasLicensor <https://d-nb.info/gnd/1123037736>;
n1:hasLicense <https://creativecommons.org/licenses/by/4.0/>;
n1:hasHosting <https://id.acdh.oeaw.ac.at/arche>;
n1:isPartOf <https://id.acdh.oeaw.ac.at/traveldigital>.
' > collection.ttl
composer require acdh-oeaw/arche-ingest
~/vendor/bin/arche-import-metadata collection.ttl http://127.0.0.1/api admin myAdminPassword
created http://127.0.0.1/api/11305 (1/2)
created http://127.0.0.1/api/11306 (2/2)
mkdir sampledata
curl https://arche.acdh.oeaw.ac.at/api/29688 > sampledata/Baedeker-Mittelmeer_1909.xml
~/vendor/bin/arche-import-binary \
sampledata \
https://id.acdh.oeaw.ac.at/traveldigital/Corpus \
http://127.0.0.1/api admin myAdminPassword
At this point we have a repository with some data in it.
We can check it out in a few ways:
{originalURL}/metadata
)/metadata
to the URL to display the metadata view)?format={RDF format MIME}
,
e.g. http://127.0.0.1/api/11305/metadata?format=application/rdf%2Bxml
Now let’s take a step back and make a step-by-step installation starting from a minimal setup.
Let’s say we want to set up a repository:
Let’s go:
my.domain
points to our own machine.
This can be done by adding a line to the /etc/hosts
:
(you need admin rights for that, by the way this file exists also under Windows - google it):
127.0.0.1 my.domain
git clone --depth 1 --branch master https://github.com/acdh-oeaw/arche-docker-config.git
cp arche-docker-config/yaml/local.yaml.sample arche-docker-config/yaml/local.yaml
and edit it so it looks as follows:
rest:
urlBase: http://my.domain
pathBase: /api/
rm arche-docker-config/supervisord.conf.d/postgresql.conf
mkdir log
docker-compose.yaml
:
services:
# container for "external" Postgresql database
postgresql:
image: postgis/postgis:15-master
volumes:
- postgresql:/var/lib/postgresql/data
networks:
- backend
environment:
- "POSTGRES_PASSWORD=${MYREPO_DB_PSWD}"
# container for the arche-core
arche-core:
image: acdhch/arche
volumes:
- data:/home/www-data/data
- ./arche-docker-config:/home/www-data/config
- ./log:/home/www-data/log
environment:
- PG_HOST=postgresql
- PG_USER=postgres
- "PG_PSWD=${MYREPO_DB_PSWD}"
- PG_DBNAME=arche
- "USER_UID=${USER_UID}"
- "USER_GID=${USER_GID}"
- "ADMIN_PSWD=${ADMIN_PSWD}"
ports:
- "80:80"
networks:
- backend
- bridge
depends_on:
- postgresql
networks:
backend:
driver: bridge
bridge:
volumes:
postgresql:
data:
and an .env
file containing private environment variables:
MYREPO_DB_PSWD=strongPassword
ADMIN_PSWD=anotherStrongPassword
USER_UID=number reported by running id -u
USER_GID=number reported by running id -g
docker compose up
curl -i http://my.domain/api/describe
you should get a YAML file describing the repository instance configuration which might be relevant from the client perspective.
Congratulations, at that point we have the repository backbone up and running.
If something did not work, you can inspect:
docker compose up
command.log
directory,
especially the error.log
, rest.log
and ‘initScripts.log`.ARCHE Suite itself does not enforce any metadata schema. You can use whichever you want.
Still, some ARCHE components define concepts which have to mapped to RDF properties to make everything work, e.g.
Here and know let’s create a mapping for the arche-core only assuming we want to use Dublin Core wherever suitable and artificial predicates for everything else (especially for the technical predicates used by the API.
To do that please modify the top part of the schema
section of the arche-docker-config/yaml/schema.yaml
so it looks as follows:
schema:
id: http://purl.org/dc/terms/identifier
parent: http://purl.org/dc/terms/isPartOf
label: http://purl.org/dc/terms/title
delete: delete://delete
searchMatch: search://match
searchOrder: search://order
searchOrderValue: search://orderValue
searchFts: search://fts
searchCount: search://count
binarySize: http://purl.org/dc/terms/extent
fileName: file://name
mime: http://purl.org/dc/terms/format
hash: file://hash
modificationDate: http://purl.org/dc/terms/modified
modificationUser: http://purl.org/dc/terms/contributor
binaryModificationDate: file://modified
binaryModificationUser: file://modifiedUser
creationDate: http://purl.org/dc/terms/created
creationUser: http://purl.org/dc/terms/creator
and then restart the arche-core by hitting CTRL+c
on the console where you run docker compose up
and running docker compose up
again.
Let’s ingest one metadata-only resource and a TEI-XML file as its child.
folder and the
sampleData/metadata.ttl` in it containing:
@prefix dc: <http://purl.org/dc/terms/>.
<http://id.namespace/collection1> dc:title "Sample collection"@en .
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> dc:title "Sample TEI-XML"@en ;
dc:isPartOf <http://id.namespace/collection1> .
sampleData
directory availble under /data
):
docker run --rm -ti --network host -v ./sampleData:/data acdhch/arche-ingest
vendor/bin/arche-import-metadata /data/metadata.ttl http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file
created http://my.domain/api/1 (1/2)
created http://my.domain/api/2 (2/2)
curl https://arche.acdh.oeaw.ac.at/api/29688 > /data/Baedeker-Mittelmeer_1909.xml
vendor/bin/arche-import-binary \
/data \
http://id.namespace \
http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file \
--skip not_exist
Processing /data/Baedeker-Mittelmeer_1909.xml (1/2 50%): update + upload http://my.domain/api/2
exit
Now we have some rudimentary data and we can check if our metadata schema has been picked up.
Fetch the metadata of the TEI-XML binary.
It is http://my.domain/api/2 in my case but check your ingestion logs for yours.
curl -u 'admin:ADMIN_PSWD_as_set_in_.env_file' 'http://my.domain/api/2/metadata?readMode=resource'
which in my case resulted in:
@prefix n0: <http://my.domain/api/>.
@prefix n1: <file://>.
@prefix n2: <http://purl.org/dc/terms/>.
@prefix n3: <http://id.namespace/>.
@prefix n4: <https://vocabs.acdh.oeaw.ac.at/schema#>.
<http://my.domain/api/2> n1:modified "2023-07-06T11:56:53.559750"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
n2:title "Sample TEI-XML"@en;
n2:creator "admin";
n2:isPartOf <http://my.domain/api/1>;
n2:identifier <http://id.namespace/Baedeker-Mittelmeer_1909.xml>;
n2:extent "32380001"^^<http://www.w3.org/2001/XMLSchema#integer>;
n2:identifier <http://my.domain/api/2>;
n2:modified "2023-07-06T11:56:53.659461"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
n4:aclRead "admin";
n1:modifiedUser "admin";
n1:hash "sha1:ad8a457099d70990f6d936182f0e3b2c35a19ad6";
n2:contributor "admin";
n4:aclWrite "admin";
n1:name "Baedeker-Mittelmeer_1909.xml";
n2:format "application/xml";
n2:created "2023-07-06T11:56:30.757867"^^<http://www.w3.org/2001/XMLSchema#dateTime>.
We can see that:
dc:title
, dc:identifier
and dc:isPartOf
look exactly like we set them up in the sampleData/metadata.ttl
.dc:creator "admin"
because we used the admin
account for the ingestiondc:extent "32380001"^^<http://www.w3.org/2001/XMLSchema#integer>
and <file://hash> "sha1:ad8a457099d70990f6d936182f0e3b2c35a19ad6"
because the arche-core computed them while storing the uploaded file<file://name> "Baedeker-Mittelmeer_1909.xml"
because the upload script
provided this information while uploading the file<https://vocabs.acdh.oeaw.ac.at/schema#aclWrite "admin"
but this is discussed in the next chapter.You can also try:
curl -u 'admin:ADMIN_PSWD_as_set_in_.env_file' 'http://my.domain/api/2/metadata?readMode=neighbors'
curl -u 'admin:ADMIN_PSWD_as_set_in_.env_file' 'http://my.domain/api/2'
curl -i http://my.domain/api/2
The arche-ingest repository provides scripts automating metadata and binary data repository upload.
Access control is based on roles which are generalization of the user and group concepts.
admin
role is created.public
role which is used to indicate unauthenticated users.You can create and modify users using the {repo base URL}/user
REST API endpoint
(see swagger API documentation for details), e.g.
curl -u 'admin:ADMIN_PSWD_as_set_in_.env_file' 'http://my.domain/api/user'
bob
belonging to the creators
role:
curl -i -u 'admin:anotherStrongPassword' 'http://my.domain/api/user/bob' \
-X PUT \
-H 'Content-Type: application/json' \
--data-binary '{"groups": ["creators"], "password": "randomPassword"}'
The access control settings are stored in the accessControl
section of the arche-docker-config/yaml/repo.yaml
file.
In our case it should look more or less as follows:
accessControl:
publicRole: public
adminRoles:
- admin
create:
# who can create new resources
allowedRoles:
- creators
# rights assigned to the creator uppon resource creation
creatorRights:
- read
- write
# rights assigned to other roles upon resource creation
assignRoles:
read: []
defaultAction:
read: deny
write: deny
enforceOnMetadata: true
schema:
read: https://vocabs.acdh.oeaw.ac.at/schema#aclRead
write: https://vocabs.acdh.oeaw.ac.at/schema#aclWrite
db:
connStr: 'pgsql: user={PG_USER_PREFIX}repo dbname={PG_DBNAME} host={PG_HOST} port={PG_PORT}'
table: users
userCol: user_id
dataCol: data
authMethods:
- class: \zozlak\auth\authMethod\TrustedHeader
parameters:
- HTTP_EPPN
- class: \zozlak\auth\authMethod\HttpBasic
parameters:
- repo
- class: \zozlak\auth\authMethod\Guest
parameters:
- public
Let’s analyze it step-by-step:
publicRole: public
- defines the name of the role used to indicate unauthorized user adminRoles:
- admin
defines the list of admin roles. Admin rights are needed to create new roles. Also, having the admin rights allows to freely create, read, modify and delete repository resources.
create:
allowedRoles:
- creators
creatorRights:
- read
- write
assignRoles:
read: []
defines who can create new repository resources and what are the default access rights being set on newly created resources. Here:
write
from this list would enforce
creation of immutable repository resources, at least until you are and admin)assignRoles:
read: [public]
defaultAction:
read: deny
write: deny
determines what to use if no access control rule has been matched.
In this case the access is denied.
You can e.g. consider setting write: allow
.
enforceOnMetadata: true
enforces the read access rights
to be applied also to metadata (the write access rights are always applied
both to the metadata and resource binary content).schema
section defines RDF properties used to store
access control information in the metadata.
You can choose them however you want, just do the adjustment before
you start ingesting resources into the repository.db
section contains internal config we will not dig into.auth
section contains configuration of the zozlak/auth
authentication framework. In this case:
EPPN
HTTP header
and if it exists, we take the role name from the header value.
This is quite a common integration scenario for the single sign-on
authorization methods like the Shibboleth Apache module.public
is assumed.Now let’s try to use the bob
role we created in the examples at the beginning of the chapter
to allow public read rights on the TEI-XML resource (http://my.domain/api/2 in my case).
First, we need to allow bob
to modify the resouces which is currently possible only for the admin
role
(and all roles with admin priviledges).
This can be done
bob
to the accessControl.adminRoles
list in the arche-docker-config/yaml/repo.yaml
and restarting the repository’s Docker container. But we will not use it as we do not want bob
to become an admin.bob
writes to modify the resource.
Presicely by adding the accessControl.schema.write "bob"
triple
(in our case <https://vocabs.acdh.oeaw.ac.at/schema#aclWrite> "bob"
) to resource’s metadata.
This is the solution we prefer.For that let’s create a sampleData/acl1.ttl
file:
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclWrite> "bob", "admin" .
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "bob", "admin" .
After ingesting it bob
should be able to grant public read writes by importing a sampleData/acl2.ttl
:
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "public" .
Let’s ingest both metadata files the same way we did before:
sampleData
directory availble under /data
):
docker run --rm -ti --network host -v ./sampleData:/data acdhch/arche-ingest
acl1.ttl
as the bob
user:
vendor/bin/arche-import-metadata /data/acl1.ttl http://my.domain/api bob randomPassword
and note it failed to perform the update
(...)
Ingested resources count: 0 errors count: 0
admin
vendor/bin/arche-import-metadata /data/acl1.ttl http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file
which should succeed
(...)
Importing http://id.namespace/Baedeker-Mittelmeer_1909.xml (1/1)
updating http://my.domain/api/2 (1/1)
Ingested resources count: 1 errors count: 0
vendor/bin/arche-import-metadata /data/acl2.ttl http://my.domain/api bob randomPassword
exit
curl -i 'http://my.domain/api/2/metadata?readMode=resource'
curl 'http://my.domain/api/2'
Now let’s try to add an OAI-PMH service to our repository to make it harvestable by external aggregators.
We can deploy it either in the same docker container as the arche-core or in a separate one. In this example we will use the first approach.
acdh-oeaw/arche-oaipmh
to the packages list in arche-docker-config/composer.json
:
{
"require": {
"acdh-oeaw/arche-core": "^4",
"zozlak/yaml-merge": "^1",
"acdh-oeaw/arche-oaipmh": "^4.2"
}
}
This will make the arche-oaipmh being installed on the Docker container startup.
chmod +x arche-docker-config/run.d/oaipmh.sh
)
the arche-docker-config/run.d/oaipmh.sh
file
initializing the OAI-PMH service under the {repoBaseUrl}/oaipmh
path:
#!/bin/bash
if [ ! -d /home/www-data/docroot/oaipmh ]; then
su -l www-data -c 'mkdir /home/www-data/docroot/oaipmh'
su -l www-data -c 'ln -s /home/www-data/vendor /home/www-data/docroot/oaipmh/vendor'
fi
su -l www-data -c 'cp /home/www-data/vendor/acdh-oeaw/arche-oaipmh/.htaccess /home/www-data/docroot/oaipmh/.htaccess'
su -l www-data -c 'cp /home/www-data/vendor/acdh-oeaw/arche-oaipmh/index.php /home/www-data/docroot/oaipmh/index.php'
# compile the final resolver config file from arche-docker-config/yaml/schema.yaml,
# arche-docker-config/yaml/resolver.yaml and arche-docker-config/yaml/local.yaml
CMD=/home/www-data/vendor/zozlak/yaml-merge/bin/yaml-edit.php
CFGD=/home/www-data/config/yaml
rm -f /home/www-data/docroot/oaipmh/config.yaml $CFGD/config-oaipmh.yaml
su -l www-data -c "$CMD --src $CFGD/oaipmh.yaml --src $CFGD/local.yaml $CFGD/config-oaipmh.yaml"
su -l www-data -c "ln -s $CFGD/config-oaipmh.yaml /home/www-data/docroot/oaipmh/config.yaml"
arche-docker-config/yaml/oaipmh.yaml
:
oai:
info:
repositoryName: my repository
baseURL: http://my.domain/oaipmh/
earliestDatestamp: "1900-01-01T00:00:00Z"
adminEmail: admin@my.domain
granularity: YYYY-MM-DDThh:mm:ssZ
# the guest user is created during the arche-core initialization and we can reuse it
dbConnStr: "pgsql: user=guest dbname=postgres host=postgresql"
cacheDir: ""
logging:
file: /home/www-data/log/oaipmh.log
level: info
deleted:
deletedClass: \acdhOeaw\arche\oaipmh\deleted\Tombstone
deletedRecord: transient
search:
searchClass: \acdhOeaw\arche\oaipmh\search\BaseSearch
dateProp: http://purl.org/dc/terms/modified
idNmsp: http://my.domain
id: http://purl.org/dc/terms/identifier
searchMatch: search://match
searchCount: search://count
repoBaseUrl: http://my.domain/api/
resumptionTimeout: 120
resumptionDir: "tmp"
resumptionKeepAlive: 600
sets:
setClass: \acdhOeaw\arche\oaipmh\set\NoSets
formats:
oai_dc:
metadataPrefix: oai_dc
schema: http://www.openarchives.org/OAI/2.0/oai_dc.xsd
metadataNamespace: http://www.openarchives.org/OAI/2.0/oai_dc/
class: \acdhOeaw\arche\oaipmh\metadata\DcMetadata
For more details please look at the sample config provided in the arche-oaimph git repository and at the metadata format classes documentation.
CTRL+c
on the console where you run docker compose up
and running docker compose up
again.We should have the OAI-PMH service with a very basic configuration running now. You can try:
As we internally store metadata in the Dublin Core schema, it was possible to use the very simple
metadata format class \acdhOeaw\arche\oaipmh\metadata\DcMetadata
which does not require any
additional config.
In real-world scenarios you will almost for sure need to prepare templates using the
\acdhOeaw\arche\oaipmh\metadata\LiveCmdiMetadata
class which will map your internal
metadata schema into the schema you want to provide to an external aggregator.
Please read this documentation
and template examples used at the ACDH-CH.
Now an advanced topic - plugging your own logic into the arche-core.
This is possible in two ways:
txBegin
, txCommit
and txRollback
events your handler signature should be:
function myHandler(
string $event,
int $transactionId,
array<int> $idsOfResourcesModifiedByTheTransaction)
): void
get
, getMetadata
, create
, updateBinary
, updateMetadata
, delete
, deleteTombstone
)
your handler signature should be:
function myHandler(
string $event,
EasyRdf\Resource $resourceMetadata,
?string $pathToBinaryPayload
): EasyRdf\Resource
and it should return the final metadata of the resource (as an EasyRdf\Resource
object).
rest.handlers
section of the arche-docker-config/yaml/repo.yaml
file, e.g.
rest:
handlers:
create:
- type: function
function: myClassName::myHandler
txBegin
, txCommit
and txRollback
events:
method
- event name, e.g. get
, updateMetadata
, etc.transactionId
- identifier of the transactionresourceIds
- array containing interal ids of repository resouced
modified by this transaction (e.g. [3, 8, 125]
)get
, getMetadata
, create
, updateBinary
, updateMetadata
, delete
, deleteTombstone
):
method
- event name, e.g. get
, updateMetadata
, etc.path
- path to the resource binary contentURI
- full resource URI, e.g. http://my.domain/api/2id
- numeric internal id of the resource, e.g. 2
metadata
- RDF metadata of the resources in the application/n-triples formatstatus
field.
txBegin
, txCommit
and txRollback
events, that is it.
For other events the respons from the handler should also provide target resource’s
metadata serialized in application/n-triples
in the metadata
property.status
property value and the response body containing
the optional message
property value (if it is missing, a default error message is used).By the way it is possible to mix both methods.
As the first approach (PHP handlers with no AMQP) is well illustrated in our own ARCHE Suite deployment code (you may look here and here), in this tutorial we will take the second approach and implement a simple metadata consistency check handler in Python over the AMQP.
services
section of the docker-compose.yaml
with:
services:
rabbitmq:
image: rabbitmq
networks:
- backend
services
section of the docker-compose.yaml
with:
services:
handlers:
image: python:3.11
networks:
- backend
volumes:
- ./handlers:/opt
entrypoint: /opt/start.sh
handlers
directoryhandlers/start.sh
:
#!/bin/bash
# install required python modules
pip install pika
pip install rdflib
# give rabbitmq some time to start
sleep 10
# start our handlers service
python3 /opt/handlers.py
dc:license
triple is defined in the resource metadata.handlers/handlers.py
so it contains:
import pika
import json
from rdflib import Graph, URIRef
# this is our handler
def checkMeta(channel, deliver, msgProperties, body):
message = json.loads(body.decode('utf-8'))
g = Graph()
g.parse(data=message['metadata'])
# if http://purl.org/dc/terms/license is not provided
# return an empty graph and set an error message
if (None, URIRef('http://purl.org/dc/terms/license'), None) not in g:
retMsg = {
'status': 400,
'message': 'dc:license triple is missing'
}
else:
retMsg = {
'status': 0,
'metadata': g.serialize(format='nt')
}
channel.basic_publish(
exchange=deliver.exchange,
routing_key=msgProperties.reply_to,
body=json.dumps(retMsg),
properties=msgProperties
)
# boilerplate code coupling our handler with the AMQP broker
connCfg = pika.ConnectionParameters(host='rabbitmq')
connection = pika.BlockingConnection(connCfg)
channel = connection.channel()
channel.basic_qos(0, 1, False)
# create the 'onModify' queue
channel.queue_declare(queue='onModify')
# set up a handler function
channel.basic_consume(
queue='onModify',
on_message_callback=checkMeta,
auto_ack=True
)
channel.start_consuming()
rest.handlers
section of the arche-docker-config/yaml/repo.yaml
:
handlers:
rabbitMq:
host: rabbitmq
port: 5672
user: guest # default settings of the rabbitmq docker image
password: guest # default settings of the rabbitmq docker image
timeout: 5 # handler execution timeout in seconds
exceptionOnTimeout: true
methods:
create:
- type: rpc
queue: onModify # must match the queue name in Python code
updateBinary: []
updateMetadata:
- type: rpc
queue: onModify # must match the queue name in Python code
txCommit: []
CTRL+c
on the console where you run docker compose up
and running docker compose up
again.docker run --rm -ti --network host -v ./sampleData:/data acdhch/arche-ingest
metadata.ttl
file we used in the previous chapter:
vendor/bin/arche-import-metadata /data/metadata.ttl http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file
You should get something like
(...)
Importing http://id.namespace/Baedeker-Mittelmeer_1909.xml (2/2)
updating http://my.domain/api/1 (1/2)
updating http://my.domain/api/2 (2/2)
ERROR while processing http://id.namespace/collection1: 400 dc:license triple is missing(GuzzleHttp\Exception\ClientException)
ERROR while processing http://id.namespace/Baedeker-Mittelmeer_1909.xml: 400 dc:license triple is missing(GuzzleHttp\Exception\ClientException)
(...)
which is exactly what we wanted - our metadata check procedure revoked
the metadata update requests with the HTTP code 400 (Bad Request)
because the dc:licence
RDF triple was missing.
docs/rest.log
sampleData/metadata.ttl
adding dc:license
triples
for both resources to it, e.g.:
@prefix dc: <http://purl.org/dc/terms/>.
<http://id.namespace/collection1> dc:title "Sample collection"@en ;
dc:license "CC BY-NC" .
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> dc:title "Sample TEI-XML"@en ;
dc:isPartOf <http://id.namespace/collection1> ;
dc:license "CC BY-NC".
vendor/bin/arche-import-metadata /data/metadata.ttl http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file
This time the ingestion should succeed.
docs/rest.log
Remarks about a production environment usage:
txBegin/txCommit/txRollback/get/getMetadata/create/updateBinary/updateMetadata/delete/deleteTombstone
),
you should assign each of them a unique queue name.handlers/start.sh
.handlers/start.sh
or in the handlers/handlers.py
.Other remarks:
You probably want to assign PIDs to resources in your repository.
You can depend on an external PIDs service for that (e.g. a https://www.handle.net/) but this requires a constant maintenance (e.g. if you migrate your repository base URL, you need to update all the handles in the external service on your own).
Alternatively you can set up your own PIDs service based on the ARCHE Suite repository
metadata.
For that you just need a dedicated (sub)domain and a deployment of the
arche-resolver module.
The arche-resolver
will also allow you to provide users with the content type negotation,
e.g. redirect them to a service converting the resource as it is stored in the repository
to another format.
Let’s try (deploying the resolver module within the arche-core Docker container like we did for the OAI-PMH):
my.pid
.
127.0.0.1
in the /etc/hosts
just like we did at the very beginning for the my.domain
:
(...)
127.0.0.1 my.pid
acdh-oeaw/arche-resolver
to the packages list in arche-docker-config/composer.json
:
{
"require": {
"acdh-oeaw/arche-core": "^4",
"zozlak/yaml-merge": "^1",
"acdh-oeaw/arche-oaipmh": "^4.2",
"acdh-oeaw/arche-resolver": "^3"
}
}
arche-docker-config/yaml/resolver.yaml
file containing the resolver module config
(consider reading comments provided in the code below):
schema:
# This section defines RDF properties used to describe dissemination services.
# Here we just reuse the same settings we use at our OEAW deployment.
# See https://acdh-oeaw.github.io/arche-docs/aux/dissemination_services.html
# for details.
dissService:
class: https://vocabs.acdh.oeaw.ac.at/schema#DisseminationService
location: https://vocabs.acdh.oeaw.ac.at/schema#serviceLocation
returnFormat: https://vocabs.acdh.oeaw.ac.at/schema#hasReturnType
matchProperty: https://vocabs.acdh.oeaw.ac.at/schema#matchesProp
matchValue: https://vocabs.acdh.oeaw.ac.at/schema#matchesValue
matchRequired: https://vocabs.acdh.oeaw.ac.at/schema#isRequired
revProxy: https://vocabs.acdh.oeaw.ac.at/schema#serviceRevProxy
parameterClass: https://vocabs.acdh.oeaw.ac.at/schema#DisseminationServiceParameter
parameterDefaultValue: https://vocabs.acdh.oeaw.ac.at/schema#hasDefaultValue
parameterRdfProperty: https://vocabs.acdh.oeaw.ac.at/schema#usesRdfProperty
hasService: https://vocabs.acdh.oeaw.ac.at/schema#hasDissService
resolver:
logging:
file: /home/www-data/log/resolver.log
level: warning
idProtocol: http
idHost: my.pid
idPathBase: ''
defaultDissService: raw
# quick redirects for dissemination formats provided by the arche-core
fastTrack:
raw: ''
application/octet-stream: ''
rdf: /metadata
text/turtle: /metadata
application/n-triples: /metadata
application/rdf+xml: /metadata
application/ld+json: /metadata
repositories:
# the resolver is capable of searching against multiple arche-core
# instances but we have only one so we set up only one
# see https://github.com/acdh-oeaw/arche-docker-config/blob/arche/yaml/resolver.yaml
# for a multi-repo setup
main:
baseUrl: http://my.domain/api
chmod +x arche-docker-config/run.d/resolver.sh
)
the arche-docker-config/run.d/resolver.sh
file
initializing the resolver:
if [ ! -d /home/www-data/docroot/resolver ]; then
su -l www-data -c 'mkdir /home/www-data/docroot/resolver'
su -l www-data -c 'ln -s /home/www-data/vendor /home/www-data/docroot/resolver/vendor'
fi
su -l www-data -c 'cp /home/www-data/vendor/acdh-oeaw/arche-resolver/.htaccess /home/www-data/docroot/resolver/.htaccess'
su -l www-data -c 'cp /home/www-data/vendor/acdh-oeaw/arche-resolver/index.php /home/www-data/docroot/resolver/index.php'
# compile the final resolver config file from arche-docker-config/yaml/schema.yaml,
# arche-docker-config/yaml/resolver.yaml and arche-docker-config/yaml/local.yaml
CMD=/home/www-data/vendor/zozlak/yaml-merge/bin/yaml-edit.php
CFGD=/home/www-data/config/yaml
rm -f /home/www-data/docroot/resolver/config.yaml $CFGD/config-resolver.yaml
su -l www-data -c "$CMD --src $CFGD/schema.yaml --src $CFGD/resolver.yaml --src $CFGD/local.yaml $CFGD/config-resolver.yaml"
su -l www-data -c "ln -s $CFGD/config-resolver.yaml /home/www-data/docroot/resolver/config.yaml"
arche-docker-config/sites-enabled/my.pid.conf
file
providing the webserver configuration for the my.pi domain:
<VirtualHost *:80>
DocumentRoot /home/www-data/docroot/resolver
ServerName my.pid
<Directory /home/www-data/docroot/resolver>
Options All
AllowOverride All
Require all granted
</Directory>
</VirtualHost>
CTRL+c
on the console where you run
docker compose up
and running docker compose up
again.At this point we have the resolver service ready. What we still need to do to make it useful is to assign our repository resources identifiers in the my.pid domain:
sampleData/metadata.ttl
adding dc:identifier
triples
in the my.pid domain and setting read access rights to public
(we restricted them in our ACL experiments few chapters before), e.g.:
@prefix dc: <http://purl.org/dc/terms/>.
<http://id.namespace/collection1> dc:title "Sample collection"@en ;
dc:license "CC BY-NC" ;
dc:identifier <http://my.pid/collection1> ;
<https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "public" .
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> dc:title "Sample TEI-XML"@en ;
dc:isPartOf <http://id.namespace/collection1> ;
dc:license "CC BY-NC" ;
dc:identifier <http://my.pid/mittelmeer> ;
<https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "public" .
docker run --rm -ti --network host -v ./sampleData:/data acdhch/arche-ingest
vendor/bin/arche-import-metadata /data/metadata.ttl http://my.domain/api admin ADMIN_PSWD_as_set_in_.env_file
docs/rest.log
Now we should be able to play with the resolver
(we will user the curl
so we can see what is going on in details
and are not affected by web browsers content negotation settings -
browsers always request response in text/html
):
-i
curl option we ask HTTP response headers to be displayed):
curl -i 'http://my.pid/mittelmeer'
You should get something like:
HTTP/1.1 302 Found
(...)
Location: http://my.domain/api/2
(...)
which means a resource has been found and you are redirected to its actual location - http://my.domain/api/2
-L
option):
curl -L 'http://my.pid/mittelmeer'
Now you should get the TEI-XML repository resource content.
application/n-triples
format:
curl -i -H 'Accept: application/n-triples' 'http://my.pid/mittelmeer'
Now you should get something like:
HTTP/1.1 302 Found
(...)
Location: http://my.domain/api/2/metadata
As you can see the redirect location is different this time.
This is because we requested the repository resource to be
disseminated in a particular format (which is configured
in the resolver config - see the resolver.fastTrack
section
of the arche-docker-config/yaml/resolver.yaml
).
curl -L -H 'Accept: application/n-triples' 'http://my.pid/mittelmeer'
and we should get something like:
<http://my.domain/api/2> <http://purl.org/dc/terms/format> "application/xml" .
<http://my.domain/api/2> <http://purl.org/dc/terms/created> "2023-07-06T11:56:30.757867"^^<http://www.w3.org/2001/XMLSchema#dateTime> .
(...)
<http://my.domain/api/1> <http://purl.org/dc/terms/title> "Sample collection"@en .
(...)
being metadata of the requested resource (as well as resources its metadata
directly point to, e.g. here the parent collection) in the application/n-triples
format.
resolver.defaultDissService
to raw
in the arche-docker-config/yaml/resolver.yaml
,
requesting a not configured format will end up in raw
which just redirects to
the resource’s repository URL:
curl -i -H 'Accept: application/foo' 'http://my.pid/mittelmeer'
should fetch:
HTTP/1.1 302 Found
(...)
Location: http://my.domain/api/2
(...)
curl -i 'http://my.pid/bar'
should result in something like:
HTTP/1.1 404 Not Found
(...)
Summing up the nice thing about having the PIDs service is:
arche-docker-config/yaml/resolver.yaml
Closing remarks:
From time to time you might want to perform a batch-update of the metadata. The most common scenario are chagnes made in the metadata schema.
Performing such changes using the REST API is technically possible but very troublesome and time-consuming. Instead of that you can directly access the metadata database and modify it using SQL queries.
Let’s say we want to change the RDF predicates used to store access control information:
https://vocabs.acdh.oeaw.ac.at/schema#aclRead
into acl://read
https://vocabs.acdh.oeaw.ac.at/schema#aclWrite
into acl://write
First, we modify the accessControl.schema
settings in the arche-docker-config/yaml/repo.yaml
but this will only affect future interactions trought the REST API
so we have to update all already existing triples not to mess up everything.
Fortunately it’s pretty straigtforward:
postgresql
:
docker ps
- it will be the one with postgresql
in name
(e.g. tmp-postgresql-1
in my case)psql arch
inside of it with
docker exec -ti -u postgres tmp-postgresql-1 psql arche
BEGIN;
UPDATE metadata SET property = 'acl://read' WHERE property = 'https://vocabs.acdh.oeaw.ac.at/schema#aclRead';
UPDATE metadata SET property = 'acl://write' WHERE property = 'https://vocabs.acdh.oeaw.ac.at/schema#aclWrite';
COMMIT;
\q
The direct database access can be also used to analyze the metadata, e.g. quickly compute distribution of RDF predicated values, etc.:
SELECT property, count(*) FROM metadata GROUP BY 1 ORDER BY 2 DESC;
A little more information on the database structure is provided here.
Congratulations, you completed your first dive into the ARCHE Suite world!
You managed to set up quite a range of services: the core repository module, the OAI-PMH service, the PIDs resolver. You managed to ingest a little data into the repository. You played a little with the access control and metadata schema and you also set up your very own metadata consistency check handler.
That being said in most cases we only touched a topic and there is still a lot to discover (OAI-PMH templates, resolver’s dissemination services, implementing your complete checks logic using handlers, dealing with ARCHE Suites modules not mentioned in this guide, etc.).
Also, we have not touched at all the topic of your repository GUI (explanation here).
You would surely benefit from some help dealing with all that missing stuff so please do not hesitate to contact us.