ARCHE Suite documentation

Documentation for the ARCHE repository software stack

View the Project on GitHub acdh-oeaw/arche-docs

Pretty extensive tour through the ARCHE Suite deployment

Table of contents

Prerequisites

You need:

Remarks:

Really quick and dirty setup

In this step we will just set up an ARCHE Suite instance as it is used at the ACDH-CH.

It creates a fully-fledged setup with some optional services and also ingests some data into the repository.

  1. Start it up with:
    docker run --name arche-suite -p 80:80 -e CFG_BRANCH=arche -e ADMIN_PSWD='myAdminPassword' -d acdhch/arche
    
  2. Run:
    docker logs -f arche-suite
    

    and wait until you see something like:

    ##########
    # Starting supervisord
    ##########
       
    2023-06-22 08:44:35,458 INFO Included extra file "/home/www-data/config/supervisord.conf.d/postgresql.conf" during parsing
    2023-06-22 08:44:35,458 INFO Included extra file "/home/www-data/config/supervisord.conf.d/tika.conf" during parsing
    2023-06-22 08:44:35,459 INFO RPC interface 'supervisor' initialized
    2023-06-22 08:44:35,459 CRIT Server 'unix_http_server' running without any HTTP authentication checking
    2023-06-22 08:44:35,459 INFO supervisord started with pid 1253
    2023-06-22 08:44:36,462 INFO spawned: 'initScripts' with pid 1255
    2023-06-22 08:44:36,463 INFO spawned: 'apache2' with pid 1256
    2023-06-22 08:44:36,464 INFO spawned: 'postgresql' with pid 1257
    2023-06-22 08:44:36,464 INFO spawned: 'tika' with pid 1258
    2023-06-22 08:44:36,465 INFO spawned: 'txDaemon' with pid 1259
    2023-06-22 08:44:37,496 INFO success: initScripts entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2023-06-22 08:44:37,496 INFO success: apache2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2023-06-22 08:44:37,496 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2023-06-22 08:44:37,496 INFO success: tika entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    2023-06-22 08:44:37,496 INFO success: txDaemon entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
    

    Now hit CRTL+c.
    At this point the repository is up and running.

  3. Wait until initial data is imported.
    Run:
    docker exec -ti -u www-data arche-suite tail -f log/initScripts.log
    

    and wait until you see (this may take a few minutes as two big controlled vocabularies are being imported):

    ##########
    # INIT SCRIPTS ENDED
    ##########
    
  4. Import a sample collection with a resource:
    • Log into the docker container with:
      docker exec -ti -u www-data arche-suite bash
      
    • Create a collection.ttl describing a top collection resource and a TEI-XML resource according to the ACDH-CH metadata schema:
      echo '
      @prefix n1: <https://vocabs.acdh.oeaw.ac.at/schema#>.
      
      <https://id.acdh.oeaw.ac.at/traveldigital> a n1:TopCollection;
          n1:hasIdentifier <https://hdl.handle.net/21.11115/0000-000C-29F3-4>;
          n1:hasPid "https://hdl.handle.net/21.11115/0000-000C-29F3-4"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
          n1:hasTitle "travel!digital Collection"@en;
          n1:hasDescription "A digital collection of early German travel guides on non-European countries which were released by the Baedeker publishing house between 1875 and 1914. The collection consists of the travel!digital Corpus (XML/TEI transcriptions of first editions (5 volumes) including structural, semantic and linguistic annotations), the travel!digital Facsimiles (scans and photographs of the historical prints), the travel!digital Auxiliary Files (a TEI schema of the annotations applied in the corpus, and a list of term labels for indexing names of persons annotated in the corpus), and the travel!digital Thesaurus (a SKOS vocabulary covering designations of groups and selected sights annotated in the corpus).\n The collection was created within the GO!DIGITAL 1.0 project \"travel!digital. Exploring People and Monuments in Baedeker Guidebooks (1875-1914)\", Project-Nr.: ÖAW0204.\n Image creation was done in 2004 at the Austrian Academy of Sciences (AAC-Austrian Academy Corpus)."@en;
          n1:hasSubject "Karl Baedeker"@en,
              "historical travel guides"@en;
          n1:hasHosting <https://id.acdh.oeaw.ac.at/arche>;
          n1:hasRightsHolder <https://d-nb.info/gnd/1001454-8>;
          n1:hasLicensor <https://d-nb.info/gnd/1123037736>;
          n1:hasMetadataCreator <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasCurator <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasCreator <https://id.acdh.oeaw.ac.at/uczeitschner>; 
          n1:hasDepositor <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasContact <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasOwner <https://d-nb.info/gnd/1123037736>.
      
      <https://id.acdh.oeaw.ac.at/traveldigital/Corpus/Baedeker-Mittelmeer_1909.xml> a n1:Resource;
          n1:hasTitle "Karl Baedeker: Das Mittelmeer. Handbuch für Reisende: Digital Edition"@en;
          n1:hasCategory <http://purl.org/dc/dcmitype/Dataset>;
          n1:hasDepositor <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasMetadataCreator <https://id.acdh.oeaw.ac.at/uczeitschner>;
          n1:hasOwner <https://d-nb.info/gnd/1123037736>;
          n1:hasRightsHolder <https://d-nb.info/gnd/1001454-8>;
          n1:hasLicensor <https://d-nb.info/gnd/1123037736>;
          n1:hasLicense <https://creativecommons.org/licenses/by/4.0/>;
          n1:hasHosting <https://id.acdh.oeaw.ac.at/arche>;
          n1:isPartOf <https://id.acdh.oeaw.ac.at/traveldigital>.
      ' > collection.ttl
      
    • Ingest the metadata into the repository with:
      composer require acdh-oeaw/arche-ingest
      ~/vendor/bin/arche-import-metadata collection.ttl http://127.0.0.1/api admin myAdminPassword
      
      • note down the URLs of created resources reported in the log, e.g.:
            created http://127.0.0.1/api/11305 (1/2)
            created http://127.0.0.1/api/11306 (2/2)
        
    • Download and ingest the TEI-XML resource binary:
      mkdir sampledata
      curl https://arche.acdh.oeaw.ac.at/api/29688 > sampledata/Baedeker-Mittelmeer_1909.xml
      ~/vendor/bin/arche-import-binary \
          sampledata \
          https://id.acdh.oeaw.ac.at/traveldigital/Corpus \
          http://127.0.0.1/api admin myAdminPassword
      

At this point we have a repository with some data in it.
We can check it out in a few ways:

Step-by-step installation

Now let’s take a step back and make a step-by-step installation starting from a minimal setup.

Installing arche-core

Let’s say we want to set up a repository:

Let’s go:

Congratulations, at that point we have the repository backbone up and running.

Troubleshooting

If something did not work, you can inspect:

Deciding on metadata schema

ARCHE Suite itself does not enforce any metadata schema. You can use whichever you want.

Still, some ARCHE components define concepts which have to mapped to RDF properties to make everything work, e.g.

Here and know let’s create a mapping for the arche-core only assuming we want to use Dublin Core wherever suitable and artificial predicates for everything else (especially for the technical predicates used by the API.

To do that please modify the top part of the schema section of the arche-docker-config/yaml/schema.yaml so it looks as follows:

schema:
    id: http://purl.org/dc/terms/identifier 
    parent: http://purl.org/dc/terms/isPartOf
    label: http://purl.org/dc/terms/title
    delete: delete://delete
    searchMatch: search://match
    searchOrder: search://order
    searchOrderValue: search://orderValue
    searchFts: search://fts
    searchCount: search://count
    binarySize: http://purl.org/dc/terms/extent
    fileName: file://name
    mime: http://purl.org/dc/terms/format
    hash: file://hash
    modificationDate: http://purl.org/dc/terms/modified
    modificationUser: http://purl.org/dc/terms/contributor
    binaryModificationDate: file://modified
    binaryModificationUser: file://modifiedUser
    creationDate: http://purl.org/dc/terms/created
    creationUser: http://purl.org/dc/terms/creator

and then restart the arche-core by hitting CTRL+c on the console where you run docker compose up and running docker compose up again.

Ingesting some data

Let’s ingest one metadata-only resource and a TEI-XML file as its child.

Now we have some rudimentary data and we can check if our metadata schema has been picked up.

Fetch the metadata of the TEI-XML binary.
It is http://my.domain/api/2 in my case but check your ingestion logs for yours.

curl -u 'admin:ADMIN_PSWD_as_set_in_.env_file' 'http://my.domain/api/2/metadata?readMode=resource'

which in my case resulted in:

@prefix n0: <http://my.domain/api/>.
@prefix n1: <file://>.
@prefix n2: <http://purl.org/dc/terms/>.
@prefix n3: <http://id.namespace/>.
@prefix n4: <https://vocabs.acdh.oeaw.ac.at/schema#>.

<http://my.domain/api/2> n1:modified "2023-07-06T11:56:53.559750"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
    n2:title "Sample TEI-XML"@en;
    n2:creator "admin";
    n2:isPartOf <http://my.domain/api/1>;
    n2:identifier <http://id.namespace/Baedeker-Mittelmeer_1909.xml>;
    n2:extent "32380001"^^<http://www.w3.org/2001/XMLSchema#integer>;
    n2:identifier <http://my.domain/api/2>;
    n2:modified "2023-07-06T11:56:53.659461"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
    n4:aclRead "admin";
    n1:modifiedUser "admin";
    n1:hash "sha1:ad8a457099d70990f6d936182f0e3b2c35a19ad6";
    n2:contributor "admin";
    n4:aclWrite "admin";
    n1:name "Baedeker-Mittelmeer_1909.xml";
    n2:format "application/xml";
    n2:created "2023-07-06T11:56:30.757867"^^<http://www.w3.org/2001/XMLSchema#dateTime>.

We can see that:

You can also try:

The arche-ingest repository provides scripts automating metadata and binary data repository upload.

Acess control

Access control is based on roles which are generalization of the user and group concepts.

You can create and modify users using the {repo base URL}/user REST API endpoint (see swagger API documentation for details), e.g.

The access control settings are stored in the accessControl section of the arche-docker-config/yaml/repo.yaml file.

In our case it should look more or less as follows:

accessControl:
    publicRole: public
    adminRoles:
        - admin
    create:
        # who can create new resources
        allowedRoles:
            - creators
        # rights assigned to the creator uppon resource creation
        creatorRights:
            - read
            - write
        # rights assigned to other roles upon resource creation
        assignRoles:
            read: []
    defaultAction:
        read: deny
        write: deny
    enforceOnMetadata: true
    schema:
        read: https://vocabs.acdh.oeaw.ac.at/schema#aclRead
        write: https://vocabs.acdh.oeaw.ac.at/schema#aclWrite
    db:
        connStr: 'pgsql: user={PG_USER_PREFIX}repo dbname={PG_DBNAME} host={PG_HOST} port={PG_PORT}'
        table: users
        userCol: user_id
        dataCol: data
    authMethods:
        - class: \zozlak\auth\authMethod\TrustedHeader
          parameters:
            - HTTP_EPPN
        - class: \zozlak\auth\authMethod\HttpBasic
          parameters:
             - repo
        - class: \zozlak\auth\authMethod\Guest
          parameters:
             - public

Let’s analyze it step-by-step:

Now let’s try to use the bob role we created in the examples at the beginning of the chapter to allow public read rights on the TEI-XML resource (http://my.domain/api/2 in my case).

First, we need to allow bob to modify the resouces which is currently possible only for the admin role (and all roles with admin priviledges). This can be done

For that let’s create a sampleData/acl1.ttl file:

<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclWrite> "bob", "admin" .
<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "bob", "admin" .

After ingesting it bob should be able to grant public read writes by importing a sampleData/acl2.ttl:

<http://id.namespace/Baedeker-Mittelmeer_1909.xml> <https://vocabs.acdh.oeaw.ac.at/schema#aclRead> "public" .

Let’s ingest both metadata files the same way we did before:

Setting up a basic OAI-PMH

Now let’s try to add an OAI-PMH service to our repository to make it harvestable by external aggregators.

We can deploy it either in the same docker container as the arche-core or in a separate one. In this example we will use the first approach.

We should have the OAI-PMH service with a very basic configuration running now. You can try:

As we internally store metadata in the Dublin Core schema, it was possible to use the very simple metadata format class \acdhOeaw\arche\oaipmh\metadata\DcMetadata which does not require any additional config. In real-world scenarios you will almost for sure need to prepare templates using the \acdhOeaw\arche\oaipmh\metadata\LiveCmdiMetadata class which will map your internal metadata schema into the schema you want to provide to an external aggregator. Please read this documentation and template examples used at the ACDH-CH.

Plugging in checks on the data ingestion

Now an advanced topic - plugging your own logic into the arche-core.

This is possible in two ways:

By the way it is possible to mix both methods.

As the first approach (PHP handlers with no AMQP) is well illustrated in our own ARCHE Suite deployment code (you may look here and here), in this tutorial we will take the second approach and implement a simple metadata consistency check handler in Python over the AMQP.

Remarks about a production environment usage:

Other remarks:

Adding PIDs resolver and content format negotation

You probably want to assign PIDs to resources in your repository.

You can depend on an external PIDs service for that (e.g. a https://www.handle.net/) but this requires a constant maintenance (e.g. if you migrate your repository base URL, you need to update all the handles in the external service on your own).

Alternatively you can set up your own PIDs service based on the ARCHE Suite repository metadata. For that you just need a dedicated (sub)domain and a deployment of the arche-resolver module. The arche-resolver will also allow you to provide users with the content type negotation, e.g. redirect them to a service converting the resource as it is stored in the repository to another format.

Let’s try (deploying the resolver module within the arche-core Docker container like we did for the OAI-PMH):

At this point we have the resolver service ready. What we still need to do to make it useful is to assign our repository resources identifiers in the my.pid domain:

Now we should be able to play with the resolver (we will user the curl so we can see what is going on in details and are not affected by web browsers content negotation settings - browsers always request response in text/html):

Summing up the nice thing about having the PIDs service is:

Closing remarks:

Batch-updating metadata

From time to time you might want to perform a batch-update of the metadata. The most common scenario are chagnes made in the metadata schema.

Performing such changes using the REST API is technically possible but very troublesome and time-consuming. Instead of that you can directly access the metadata database and modify it using SQL queries.

Let’s say we want to change the RDF predicates used to store access control information:

First, we modify the accessControl.schema settings in the arche-docker-config/yaml/repo.yaml but this will only affect future interactions trought the REST API so we have to update all already existing triples not to mess up everything.

Fortunately it’s pretty straigtforward:

The direct database access can be also used to analyze the metadata, e.g. quickly compute distribution of RDF predicated values, etc.:

SELECT property, count(*) FROM metadata GROUP BY 1 ORDER BY 2 DESC;

A little more information on the database structure is provided here.

Further considerations

Congratulations, you completed your first dive into the ARCHE Suite world!

You managed to set up quite a range of services: the core repository module, the OAI-PMH service, the PIDs resolver. You managed to ingest a little data into the repository. You played a little with the access control and metadata schema and you also set up your very own metadata consistency check handler.

That being said in most cases we only touched a topic and there is still a lot to discover (OAI-PMH templates, resolver’s dissemination services, implementing your complete checks logic using handlers, dealing with ARCHE Suites modules not mentioned in this guide, etc.).

Also, we have not touched at all the topic of your repository GUI (explanation here).

You would surely benefit from some help dealing with all that missing stuff so please do not hesitate to contact us.