Indexer
in package
Ingests files into the repository
Tags
Table of Contents
Constants
- ERRMODE_CONTINUE = 'continue'
- ERRMODE_FAIL = 'fail'
- ERRMODE_INCLUDE = 'include'
- ERRMODE_PASS = 'pass'
- FILTER_MATCH = 1
- FILTER_SKIP = 2
- NETWORKERROR_SLEEP = 3
- PID_KEEP = 1
- PID_PASS = 2
- SKIP_BINARY_EXIST = 4
- SKIP_EXIST = 2
- SKIP_NONE = 0
- SKIP_NOT_EXIST = 1
- VERSIONING_ALWAYS = 2
- VERSIONING_DATE = 4
- VERSIONING_DIGEST = 3
- VERSIONING_NONE = 1
Properties
- $debug : bool
- Turns debug messages on
- $autoCommit : int
- Number of resource automatically triggering a commit (0 - no auto commit)
- $binaryClass : NamedNodeInterface
- URI of an RDF class assigned to indexed binary resources.
- $collectionClass : NamedNodeInterface
- URI of an RDF class assigned to indexed collections.
- $depth : int
- How many subsequent subdirectories should be indexed.
- $directory : string
- Base ingestion path to be substituted with the $idPrefix to form a binary id.
- $filter : string
- Regular expression for matching child resource file names.
- $filterNot : string
- Regular expression for excluding child resource file names.
- $flatStructure : bool
- Should children be directly attached to the RepoResource or maybe each subdirectory should result in a separate collection resource containing its children.
- $idgen : FileId
- $idPrefix : string|null
- Namespaces to substitute the $directory in the ingested binary path to form a binary id.
- $includeEmpty : bool
- Should resources be created for empty directories.
- $metaLookup : MetaLookupInterface
- An object providing metadata when given a resource file path
- $metaLookupRequire : bool
- Should files without external metadata (provided by the `$metaLookup` object) be skipped.
- $parent : RepoResource
- RepoResource which children are created by the Indexer
- $pidPass : int
- Should PIDs (epic handles) be migrated to the new version of a resource during versioning.
- $repo : Repo
- Repository connection
- $schema : Schema
- Repository schema
- $skipMode : int
- Should files (not)existing in the repository be skipped?
- $uploadSizeLimit : int
- Maximum size of a child resource (in bytes) resulting in the creation of binary resources.
- $uriNorm : UriNormalizer
- $versioningMode : int
- Should new versions of binary resources already existing in the repository be created (if not, an existing resource is simply overwritten).
Methods
- __construct() : mixed
- Creates the Indexer object.
- import() : array<string|int, RepoResource|ClientException>
- Performs the indexing.
- setAutoCommit() : Indexer
- Controls the automatic commit behaviour.
- setBinaryClass() : Indexer
- Sets default RDF class for imported binary resources.
- setCollectionClass() : Indexer
- Sets default RDF class for imported collections.
- setDepth() : Indexer
- Sets maximum indexing depth.
- setFilter() : Indexer
- Sets file name filter for child resources.
- setFlatStructure() : Indexer
- Sets if child resources be directly attached to the indexed RepoResource (`$isFlat` equals to `true`) or a separate collection repository resource be created for each subdirectory (`$isFlat` equals to `false`).
- setIncludeEmptyDirs() : Indexer
- Sets if repository resources should be created for empty directories.
- setMetaLookup() : Indexer
- Sets a class providing metadata for indexed files.
- setParent() : Indexer
- Sets the parent resource for the files in the `$directory` (constructor parameter) directory.
- setSkip() : Indexer
- Defines if (and how) resources should be skipped from indexing based on their (not)existance in the repository.
- setUploadSizeLimit() : Indexer
- Sets size treshold for uploading child resources as binary resources.
- setVersioning() : Indexer
- Defines if new versions of binary resources should be created or if they should be simply overwritten with a new binary payload.
- createFile() : File
- listFiles() : array<string|int, File>
- Gets the list of files/dirs matching the filename filters and the depth limit.
Constants
ERRMODE_CONTINUE
public
mixed
ERRMODE_CONTINUE
= 'continue'
ERRMODE_FAIL
public
mixed
ERRMODE_FAIL
= 'fail'
ERRMODE_INCLUDE
public
mixed
ERRMODE_INCLUDE
= 'include'
ERRMODE_PASS
public
mixed
ERRMODE_PASS
= 'pass'
FILTER_MATCH
public
mixed
FILTER_MATCH
= 1
FILTER_SKIP
public
mixed
FILTER_SKIP
= 2
NETWORKERROR_SLEEP
public
mixed
NETWORKERROR_SLEEP
= 3
PID_KEEP
public
mixed
PID_KEEP
= 1
PID_PASS
public
mixed
PID_PASS
= 2
SKIP_BINARY_EXIST
public
mixed
SKIP_BINARY_EXIST
= 4
SKIP_EXIST
public
mixed
SKIP_EXIST
= 2
SKIP_NONE
public
mixed
SKIP_NONE
= 0
SKIP_NOT_EXIST
public
mixed
SKIP_NOT_EXIST
= 1
VERSIONING_ALWAYS
public
mixed
VERSIONING_ALWAYS
= 2
VERSIONING_DATE
public
mixed
VERSIONING_DATE
= 4
VERSIONING_DIGEST
public
mixed
VERSIONING_DIGEST
= 3
VERSIONING_NONE
public
mixed
VERSIONING_NONE
= 1
Properties
$debug
Turns debug messages on
public
static bool
$debug
= false
$autoCommit
Number of resource automatically triggering a commit (0 - no auto commit)
private
int
$autoCommit
= 0
$binaryClass
URI of an RDF class assigned to indexed binary resources.
private
NamedNodeInterface
$binaryClass
$collectionClass
URI of an RDF class assigned to indexed collections.
private
NamedNodeInterface
$collectionClass
$depth
How many subsequent subdirectories should be indexed.
private
int
$depth
= \PHP_INT_MAX
$directory
Base ingestion path to be substituted with the $idPrefix to form a binary id.
private
string
$directory
$filter
Regular expression for matching child resource file names.
private
string
$filter
= ''
$filterNot
Regular expression for excluding child resource file names.
private
string
$filterNot
= ''
$flatStructure
Should children be directly attached to the RepoResource or maybe each subdirectory should result in a separate collection resource containing its children.
private
bool
$flatStructure
= false
$idgen
private
FileId
$idgen
$idPrefix
Namespaces to substitute the $directory in the ingested binary path to form a binary id.
private
string|null
$idPrefix
= null
$includeEmpty
Should resources be created for empty directories.
private
bool
$includeEmpty
= false
Skipped if $flatStructure
equals to true
$metaLookup
An object providing metadata when given a resource file path
private
MetaLookupInterface
$metaLookup
$metaLookupRequire
Should files without external metadata (provided by the `$metaLookup` object) be skipped.
private
bool
$metaLookupRequire
= false
$parent
RepoResource which children are created by the Indexer
private
RepoResource
$parent
$pidPass
Should PIDs (epic handles) be migrated to the new version of a resource during versioning.
private
int
$pidPass
= self::PID_KEEP
$repo
Repository connection
private
Repo
$repo
$schema
Repository schema
private
Schema
$schema
$skipMode
Should files (not)existing in the repository be skipped?
private
int
$skipMode
= self::SKIP_NONE
Tags
$uploadSizeLimit
Maximum size of a child resource (in bytes) resulting in the creation of binary resources.
private
int
$uploadSizeLimit
= -1
For child resources bigger then this limit an "RDF only" repository resources will be created.
Special value of -1 means "import all no matter their size"
$uriNorm
private
UriNormalizer
$uriNorm
$versioningMode
Should new versions of binary resources already existing in the repository be created (if not, an existing resource is simply overwritten).
private
int
$versioningMode
= self::VERSIONING_NONE
Methods
__construct()
Creates the Indexer object.
public
__construct(string $directory, string $idPrefix, Repo $repo) : mixed
It's important to understand how the file/directory paths are mapped to repository resource identifiers:
- first the path is stripped from the
$directory
- the rest is URL-encoded but with '/' characters being preserved
- at the end the $idPrefix is prepended
e.g. for a file /foo/foo bar/baz
with $directory
/foo/
and
$idPrefix
https://id.nmsp/foobar/
, it will be done as follows:
-
foo/foo bar/baz
=>foo bar/baz
-
foo bar/baz
=>foo%20bar/baz
-
foo%20bar/baz
=>https://id.nmsp/foobar/foo%20bar/baz
It's worth mentioning (lack of) slashes at the end of $directory
and
$idPrefix
doesn't matter (it's standardized internally).
Parameters
- $directory : string
-
path to be indexed.
- $idPrefix : string
-
prefix used to create repository resource identifiers from file/directory paths (see above).
- $repo : Repo
-
repository connectiond object.
import()
Performs the indexing.
public
import([string $errorMode = self::ERRMODE_FAIL ][, int $concurrency = 3 ][, int $retries = 6 ]) : array<string|int, RepoResource|ClientException>
Parameters
- $errorMode : string = self::ERRMODE_FAIL
-
what should happen if an error is encountered? One of:
- Indexer::ERRMODE_FAIL - the first encountered error throws an exception.
- Indexer::ERRMODE_PASS - the first encountered error turns off the autocommit but ingestion is continued. When all resources are processed andif there were no errors, an array of RepoResource objects is returned. If there was an error, an exception is thrown.
- Indexer::ERRMODE_INCLUDE - the first encountered error turns off the autocommit but ingestion is continued. The returned array contains RepoResource objects for successful ingestions and Exception objects for failed ones.
- Indexer::ERRMODE_CONTINUE - just continue the import no matter if errors were encountered. The returned array contains RepoResource objects for successful ingestions and Exception objects for failed ones.
- $concurrency : int = 3
-
number of parallel requests to the repository allowed during the import
- $retries : int = 6
-
how many ingestion attempts should be taken if the repository resource is locked by other request or an network connection error occurs
Tags
Return values
array<string|int, RepoResource|ClientException> —a list RepoResource objects representing indexed resources
setAutoCommit()
Controls the automatic commit behaviour.
public
setAutoCommit(int $count) : Indexer
Even when you use autocommit, you should commit your transaction after
Indexer::index()
(the only exception is when you set auto commit to 1
forcing commiting each and every resource separately but you probably
don't want to do that for performance reasons).
Parameters
- $count : int
-
number of resource automatically triggering a commit (0 - no auto commit)
Return values
IndexersetBinaryClass()
Sets default RDF class for imported binary resources.
public
setBinaryClass(string|NamedNodeInterface $class) : Indexer
Overrides setting read form the cfg::defaultBinaryClass
configuration property.
Parameters
- $class : string|NamedNodeInterface
Return values
IndexersetCollectionClass()
Sets default RDF class for imported collections.
public
setCollectionClass(string|NamedNodeInterface $class) : Indexer
Overrides setting read form the cfg::defaultCollectionClass
configuration property.
Parameters
- $class : string|NamedNodeInterface
Return values
IndexersetDepth()
Sets maximum indexing depth.
public
setDepth(int $depth) : Indexer
Parameters
- $depth : int
-
maximum indexing depth (0 - only initial resource dir, 1 - also its direct subdirectories, etc.)
Return values
IndexersetFilter()
Sets file name filter for child resources.
public
setFilter(string $filter[, int $type = self::FILTER_MATCH ]) : Indexer
You can choose if file names must match or must not match (skip) the
filter using the $type parameter. You can set both match and skip
filters by calling setFilter() two times (once with
$type = Indexer::MATCH
and second time with $type = Indexer::SKIP
).
Filter is applied only to file names but NOT to directory names.
Parameters
- $filter : string
-
regular expression conformant with preg_replace()
- $type : int = self::FILTER_MATCH
-
decides if $filter is a match or skip filter (can be one of Indexer::MATCH and Indexer::SKIP)
Return values
IndexersetFlatStructure()
Sets if child resources be directly attached to the indexed RepoResource (`$isFlat` equals to `true`) or a separate collection repository resource be created for each subdirectory (`$isFlat` equals to `false`).
public
setFlatStructure(bool $isFlat) : Indexer
Parameters
- $isFlat : bool
Return values
IndexersetIncludeEmptyDirs()
Sets if repository resources should be created for empty directories.
public
setIncludeEmptyDirs(bool $include) : Indexer
Parameters
- $include : bool
-
should resources be created for empty directories
Tags
Return values
IndexersetMetaLookup()
Sets a class providing metadata for indexed files.
public
setMetaLookup(MetaLookupInterface $metaLookup[, bool $require = false ]) : Indexer
Parameters
- $metaLookup : MetaLookupInterface
- $require : bool = false
-
should files lacking external metadata be skipped
Return values
IndexersetParent()
Sets the parent resource for the files in the `$directory` (constructor parameter) directory.
public
setParent(RepoResource $resource) : Indexer
Parameters
- $resource : RepoResource
Return values
IndexersetSkip()
Defines if (and how) resources should be skipped from indexing based on their (not)existance in the repository.
public
setSkip(int $skipMode) : Indexer
Parameters
- $skipMode : int
-
mode. Any combination of (with a binary union or sum):
- Indexer::SKIP_NONE (default) - import all resources,
- Indexer::SKIP_NOT_EXIST - skip ingestion of all resources which don't exist at the beginning of the ingestion.
- Indexer::SKIP_EXIST - skip ingestion of all resources which exist at the beginning of the ingestion.
- Indexer::SKIP_BINARY_EXIST - skip ingestion of all resources which already exist and either already have binary payload or are metadata only resources.
Return values
IndexersetUploadSizeLimit()
Sets size treshold for uploading child resources as binary resources.
public
setUploadSizeLimit(int $limit) : Indexer
For files bigger then this treshold a "pure RDF" repository resources will be created containing full metadata but no binary content.
Parameters
- $limit : int
-
maximum size in bytes; 0 will cause no files upload, special value of -1 (default) will cause all files to be uploaded no matter their size
Return values
IndexersetVersioning()
Defines if new versions of binary resources should be created or if they should be simply overwritten with a new binary payload.
public
setVersioning(int $versioningMode[, int $migratePid = self::PID_KEEP ]) : Indexer
Parameters
- $versioningMode : int
-
mode either Indexer::VERSIONING_NONE, Indexer::VERSIONING_ALWAYS, Indexer::VERSIONING_CHECKSUM or Indexer::VERSIONING_DATE
- $migratePid : int = self::PID_KEEP
-
should PIDs (epic handles) be migrated to the new version - either Indexer::MIGRATE_NO or Indexer::MIGRATE_YES
Tags
Return values
IndexercreateFile()
private
createFile(SplFileInfo $file) : File
Parameters
- $file : SplFileInfo
Return values
FilelistFiles()
Gets the list of files/dirs matching the filename filters and the depth limit.
private
listFiles(SplFileInfo $dir, int $level) : array<string|int, File>
Parameters
- $dir : SplFileInfo
- $level : int