Documentation

Indexer
in package

Ingests files into the repository

Tags
author

zozlak

Table of Contents

Constants

ERRMODE_CONTINUE  = 'continue'
ERRMODE_FAIL  = 'fail'
ERRMODE_INCLUDE  = 'include'
ERRMODE_PASS  = 'pass'
FILTER_MATCH  = 1
FILTER_SKIP  = 2
NETWORKERROR_SLEEP  = 3
PID_KEEP  = 1
PID_PASS  = 2
SKIP_BINARY_EXIST  = 4
SKIP_EXIST  = 2
SKIP_NONE  = 0
SKIP_NOT_EXIST  = 1
VERSIONING_ALWAYS  = 2
VERSIONING_DATE  = 4
VERSIONING_DIGEST  = 3
VERSIONING_NONE  = 1

Properties

$debug  : bool
Turns debug messages on
$autoCommit  : int
Number of resource automatically triggering a commit (0 - no auto commit)
$binaryClass  : NamedNodeInterface
URI of an RDF class assigned to indexed binary resources.
$collectionClass  : NamedNodeInterface
URI of an RDF class assigned to indexed collections.
$depth  : int
How many subsequent subdirectories should be indexed.
$directory  : string
Base ingestion path to be substituted with the $idPrefix to form a binary id.
$filter  : string
Regular expression for matching child resource file names.
$filterNot  : string
Regular expression for excluding child resource file names.
$flatStructure  : bool
Should children be directly attached to the RepoResource or maybe each subdirectory should result in a separate collection resource containing its children.
$idgen  : FileId
$idPrefix  : string|null
Namespaces to substitute the $directory in the ingested binary path to form a binary id.
$includeEmpty  : bool
Should resources be created for empty directories.
$metaLookup  : MetaLookupInterface
An object providing metadata when given a resource file path
$metaLookupRequire  : bool
Should files without external metadata (provided by the `$metaLookup` object) be skipped.
$parent  : RepoResource
RepoResource which children are created by the Indexer
$pidPass  : int
Should PIDs (epic handles) be migrated to the new version of a resource during versioning.
$repo  : Repo
Repository connection
$schema  : Schema
Repository schema
$skipMode  : int
Should files (not)existing in the repository be skipped?
$uploadSizeLimit  : int
Maximum size of a child resource (in bytes) resulting in the creation of binary resources.
$uriNorm  : UriNormalizer
$versioningMode  : int
Should new versions of binary resources already existing in the repository be created (if not, an existing resource is simply overwritten).

Methods

__construct()  : mixed
Creates the Indexer object.
import()  : array<string|int, RepoResource|ClientException>
Performs the indexing.
setAutoCommit()  : Indexer
Controls the automatic commit behaviour.
setBinaryClass()  : Indexer
Sets default RDF class for imported binary resources.
setCollectionClass()  : Indexer
Sets default RDF class for imported collections.
setDepth()  : Indexer
Sets maximum indexing depth.
setFilter()  : Indexer
Sets file name filter for child resources.
setFlatStructure()  : Indexer
Sets if child resources be directly attached to the indexed RepoResource (`$isFlat` equals to `true`) or a separate collection repository resource be created for each subdirectory (`$isFlat` equals to `false`).
setIncludeEmptyDirs()  : Indexer
Sets if repository resources should be created for empty directories.
setMetaLookup()  : Indexer
Sets a class providing metadata for indexed files.
setParent()  : Indexer
Sets the parent resource for the files in the `$directory` (constructor parameter) directory.
setSkip()  : Indexer
Defines if (and how) resources should be skipped from indexing based on their (not)existance in the repository.
setUploadSizeLimit()  : Indexer
Sets size treshold for uploading child resources as binary resources.
setVersioning()  : Indexer
Defines if new versions of binary resources should be created or if they should be simply overwritten with a new binary payload.
createFile()  : File
listFiles()  : array<string|int, File>
Gets the list of files/dirs matching the filename filters and the depth limit.

Constants

ERRMODE_CONTINUE

public mixed ERRMODE_CONTINUE = 'continue'

ERRMODE_FAIL

public mixed ERRMODE_FAIL = 'fail'

ERRMODE_INCLUDE

public mixed ERRMODE_INCLUDE = 'include'

ERRMODE_PASS

public mixed ERRMODE_PASS = 'pass'

FILTER_MATCH

public mixed FILTER_MATCH = 1

FILTER_SKIP

public mixed FILTER_SKIP = 2

NETWORKERROR_SLEEP

public mixed NETWORKERROR_SLEEP = 3

SKIP_BINARY_EXIST

public mixed SKIP_BINARY_EXIST = 4

SKIP_EXIST

public mixed SKIP_EXIST = 2

SKIP_NONE

public mixed SKIP_NONE = 0

SKIP_NOT_EXIST

public mixed SKIP_NOT_EXIST = 1

VERSIONING_ALWAYS

public mixed VERSIONING_ALWAYS = 2

VERSIONING_DATE

public mixed VERSIONING_DATE = 4

VERSIONING_DIGEST

public mixed VERSIONING_DIGEST = 3

VERSIONING_NONE

public mixed VERSIONING_NONE = 1

Properties

$debug

Turns debug messages on

public static bool $debug = false

$autoCommit

Number of resource automatically triggering a commit (0 - no auto commit)

private int $autoCommit = 0

$binaryClass

URI of an RDF class assigned to indexed binary resources.

private NamedNodeInterface $binaryClass

$collectionClass

URI of an RDF class assigned to indexed collections.

private NamedNodeInterface $collectionClass

$depth

How many subsequent subdirectories should be indexed.

private int $depth = \PHP_INT_MAX

$directory

Base ingestion path to be substituted with the $idPrefix to form a binary id.

private string $directory

$filter

Regular expression for matching child resource file names.

private string $filter = ''

$filterNot

Regular expression for excluding child resource file names.

private string $filterNot = ''

$flatStructure

Should children be directly attached to the RepoResource or maybe each subdirectory should result in a separate collection resource containing its children.

private bool $flatStructure = false

$idPrefix

Namespaces to substitute the $directory in the ingested binary path to form a binary id.

private string|null $idPrefix = null

$includeEmpty

Should resources be created for empty directories.

private bool $includeEmpty = false

Skipped if $flatStructure equals to true

$metaLookupRequire

Should files without external metadata (provided by the `$metaLookup` object) be skipped.

private bool $metaLookupRequire = false

$pidPass

Should PIDs (epic handles) be migrated to the new version of a resource during versioning.

private int $pidPass = self::PID_KEEP

$skipMode

Should files (not)existing in the repository be skipped?

private int $skipMode = self::SKIP_NONE
Tags
see
setSkip()

$uploadSizeLimit

Maximum size of a child resource (in bytes) resulting in the creation of binary resources.

private int $uploadSizeLimit = -1

For child resources bigger then this limit an "RDF only" repository resources will be created.

Special value of -1 means "import all no matter their size"

$versioningMode

Should new versions of binary resources already existing in the repository be created (if not, an existing resource is simply overwritten).

private int $versioningMode = self::VERSIONING_NONE

Methods

__construct()

Creates the Indexer object.

public __construct(string $directory, string $idPrefix, Repo $repo) : mixed

It's important to understand how the file/directory paths are mapped to repository resource identifiers:

  • first the path is stripped from the $directory
  • the rest is URL-encoded but with '/' characters being preserved
  • at the end the $idPrefix is prepended

e.g. for a file /foo/foo bar/baz with $directory /foo/ and $idPrefix https://id.nmsp/foobar/, it will be done as follows:

  • foo/foo bar/baz => foo bar/baz
  • foo bar/baz => foo%20bar/baz
  • foo%20bar/baz => https://id.nmsp/foobar/foo%20bar/baz

It's worth mentioning (lack of) slashes at the end of $directory and $idPrefix doesn't matter (it's standardized internally).

Parameters
$directory : string

path to be indexed.

$idPrefix : string

prefix used to create repository resource identifiers from file/directory paths (see above).

$repo : Repo

repository connectiond object.

import()

Performs the indexing.

public import([string $errorMode = self::ERRMODE_FAIL ][, int $concurrency = 3 ][, int $retries = 6 ]) : array<string|int, RepoResource|ClientException>
Parameters
$errorMode : string = self::ERRMODE_FAIL

what should happen if an error is encountered? One of:

  • Indexer::ERRMODE_FAIL - the first encountered error throws an exception.
  • Indexer::ERRMODE_PASS - the first encountered error turns off the autocommit but ingestion is continued. When all resources are processed andif there were no errors, an array of RepoResource objects is returned. If there was an error, an exception is thrown.
  • Indexer::ERRMODE_INCLUDE - the first encountered error turns off the autocommit but ingestion is continued. The returned array contains RepoResource objects for successful ingestions and Exception objects for failed ones.
  • Indexer::ERRMODE_CONTINUE - just continue the import no matter if errors were encountered. The returned array contains RepoResource objects for successful ingestions and Exception objects for failed ones.
$concurrency : int = 3

number of parallel requests to the repository allowed during the import

$retries : int = 6

how many ingestion attempts should be taken if the repository resource is locked by other request or an network connection error occurs

Tags
throws
IndexerException
Return values
array<string|int, RepoResource|ClientException>

a list RepoResource objects representing indexed resources

setAutoCommit()

Controls the automatic commit behaviour.

public setAutoCommit(int $count) : Indexer

Even when you use autocommit, you should commit your transaction after Indexer::index() (the only exception is when you set auto commit to 1 forcing commiting each and every resource separately but you probably don't want to do that for performance reasons).

Parameters
$count : int

number of resource automatically triggering a commit (0 - no auto commit)

Return values
Indexer

setBinaryClass()

Sets default RDF class for imported binary resources.

public setBinaryClass(string|NamedNodeInterface $class) : Indexer

Overrides setting read form the cfg::defaultBinaryClass configuration property.

Parameters
$class : string|NamedNodeInterface
Return values
Indexer

setCollectionClass()

Sets default RDF class for imported collections.

public setCollectionClass(string|NamedNodeInterface $class) : Indexer

Overrides setting read form the cfg::defaultCollectionClass configuration property.

Parameters
$class : string|NamedNodeInterface
Return values
Indexer

setDepth()

Sets maximum indexing depth.

public setDepth(int $depth) : Indexer
Parameters
$depth : int

maximum indexing depth (0 - only initial resource dir, 1 - also its direct subdirectories, etc.)

Return values
Indexer

setFilter()

Sets file name filter for child resources.

public setFilter(string $filter[, int $type = self::FILTER_MATCH ]) : Indexer

You can choose if file names must match or must not match (skip) the filter using the $type parameter. You can set both match and skip filters by calling setFilter() two times (once with $type = Indexer::MATCH and second time with $type = Indexer::SKIP).

Filter is applied only to file names but NOT to directory names.

Parameters
$filter : string

regular expression conformant with preg_replace()

$type : int = self::FILTER_MATCH

decides if $filter is a match or skip filter (can be one of Indexer::MATCH and Indexer::SKIP)

Return values
Indexer

setFlatStructure()

Sets if child resources be directly attached to the indexed RepoResource (`$isFlat` equals to `true`) or a separate collection repository resource be created for each subdirectory (`$isFlat` equals to `false`).

public setFlatStructure(bool $isFlat) : Indexer
Parameters
$isFlat : bool
Return values
Indexer

setIncludeEmptyDirs()

Sets if repository resources should be created for empty directories.

public setIncludeEmptyDirs(bool $include) : Indexer
Parameters
$include : bool

should resources be created for empty directories

Tags
see
setFlatStructure()
Return values
Indexer

setMetaLookup()

Sets a class providing metadata for indexed files.

public setMetaLookup(MetaLookupInterface $metaLookup[, bool $require = false ]) : Indexer
Parameters
$metaLookup : MetaLookupInterface
$require : bool = false

should files lacking external metadata be skipped

Return values
Indexer

setSkip()

Defines if (and how) resources should be skipped from indexing based on their (not)existance in the repository.

public setSkip(int $skipMode) : Indexer
Parameters
$skipMode : int

mode. Any combination of (with a binary union or sum):

  • Indexer::SKIP_NONE (default) - import all resources,
  • Indexer::SKIP_NOT_EXIST - skip ingestion of all resources which don't exist at the beginning of the ingestion.
  • Indexer::SKIP_EXIST - skip ingestion of all resources which exist at the beginning of the ingestion.
  • Indexer::SKIP_BINARY_EXIST - skip ingestion of all resources which already exist and either already have binary payload or are metadata only resources.
Return values
Indexer

setUploadSizeLimit()

Sets size treshold for uploading child resources as binary resources.

public setUploadSizeLimit(int $limit) : Indexer

For files bigger then this treshold a "pure RDF" repository resources will be created containing full metadata but no binary content.

Parameters
$limit : int

maximum size in bytes; 0 will cause no files upload, special value of -1 (default) will cause all files to be uploaded no matter their size

Return values
Indexer

setVersioning()

Defines if new versions of binary resources should be created or if they should be simply overwritten with a new binary payload.

public setVersioning(int $versioningMode[, int $migratePid = self::PID_KEEP ]) : Indexer
Parameters
$versioningMode : int

mode either Indexer::VERSIONING_NONE, Indexer::VERSIONING_ALWAYS, Indexer::VERSIONING_CHECKSUM or Indexer::VERSIONING_DATE

$migratePid : int = self::PID_KEEP

should PIDs (epic handles) be migrated to the new version - either Indexer::MIGRATE_NO or Indexer::MIGRATE_YES

Tags
throws
BadMethodCallException
Return values
Indexer

createFile()

private createFile(SplFileInfo $file) : File
Parameters
$file : SplFileInfo
Return values
File

listFiles()

Gets the list of files/dirs matching the filename filters and the depth limit.

private listFiles(SplFileInfo $dir, int $level) : array<string|int, File>
Parameters
$dir : SplFileInfo
$level : int
Return values
array<string|int, File>

        
On this page

Search results