Detection and Classification of Historic Watermarks using neural networks and nearest neighbor search
https://zenodo.org/records/10698260
The history of paper in Europe dates back to the 12th century when papermaking technology was imported from China through the Islamic world. The first paper mills were established in Spain and Italy, followed by France and Germany in the 13th century. Initially, paper was produced by scooping pulp from linen fibers with a sieve. These frames were often embroidered with a metal wire that formed some design or pattern, e.g., animals, or letters. The wire had a different thickness than the surrounding area. As a result, the pattern from the wire was imprinted on the paper. We call this imprint a watermark. (Hunter 1978) (Damberger 2006) (Vereinigung der Österreichischen Papierindustrie 2023)
The study of watermarks holds significance for historical humanities. Watermarks offer valuable insights into the origins of paper, aiding in identifying papermakers, mills, and periods when a specific piece of paper was made. This information plays a crucial role in dating and verifying historical documents. Watermark designs evolved, enabling precise document dating. By identifying watermarks in various documents, one can establish connections between manuscripts, trace paper sources, and track trade routes and distribution networks. However, the identification and comparison of watermarks is currently difficult and time-consuming. (Barrett 2022) (Fuller 2002)
Historians ask the German National Library (DNB) to identify watermarks, typically by sending image attachments by mail. Then, a DNB expert physically goes into the archive to find the same or a similar watermark (e.g., from the same mill but a different period). Since this manual process requires highly trained and experienced people, it is both slow and expensive. Thus, there is a bottleneck for watermark recognition since the DNB lacks resources to quickly handle all requests.
We build a “human in the middle” model, to help historians efficiently search for watermarks on their own. We provide the user with a ranking containing the most similar images to her query image. Then the user must manually compare their watermark with the results from the nearest neighbor (NN) search. This simple comparison can easily be done by non-experts in a reasonable time.
This work presents a novel approach to automatically find similar watermarks based on the digitized collection of historical papers, watermarks, and traced watermarks from the German Museum of Books and Writing and the DNB (Deutsche Nationalbibliothek 2023). Previous approaches either used different image processing techniques or trained neural networks for classification (Stewart 1995) (Belov 1999). One disadvantage of the first one is its lack of robustness, while the second approach cannot deal with unseen groups of watermarks. Furthermore, existing approaches only include high-contrast tracings in the database, e.g., (Picard 2016) (Pondenkandath, et al. 2020) (Deng 2009) (Shen 2019). Limiting the database to tracings excludes much data in training and makes it impossible to find watermarks without corresponding tracings. As the DNB continuously adds new watermarks to the database, we aimed for an approach adaptive to new watermarks. Furthermore, as the watermarks are loosely labeled, the network needs to be independent of input-output pairs.
In our approach, we create an unpaired dataset of watermarks and preprocessed tracings (sketches). Using this dataset, we train a CycleGAN neural network that can generate a sketch of a watermark present in a scan of a historical paper (Zhu 2017). Using this model, we generate sketches for all watermarks from the dataset and combine them with the sketches produced by preprocessing tracings. We utilize a pre-trained ResNet18 neural network for extracting a feature vector from the sketch. Finally, we use the Spotify Annoy algorithm, for an efficient approximate NN search in the entire database. (Bernhardsson 2018).
To test the pipeline, we selected a test set with 22 classes of watermark-tracing groups ranging from 2 to 167 observations per class. Executing the pipeline on the watermark against a database of over 6200 digitized watermarks and traced watermarks, we achieve an accuracy of 50% of finding a corresponding tracing within 25 NNs, and over 68% within 50 NNs.
The pipeline shows promising results applicable in different scenarios. Non-experts can identify their watermark by examining fewer than 50 watermarks (~70% success). We anticipate even better results with 100-150 NNs. Watermark-experts can find similar watermarks based on the content of the image to find correlations of scientific importance. Moreover, the database easily integrates with DNB metadata for additional details on the watermark.
Training on a larger dataset or using a transformer-based model could enhance the pipeline and database, making it a primary resource for both experts and non-experts in historical watermark research.
Code: https://github.com/EvgheniiBeriozchin/watermark-detection
Bibliographie
- Barrett, Timothy et al. 2022. "European Papermaking Techniques 1300-1800." University of Iowa.
- Belov, V.V. and Esipova, V.A. and Kalaida, V.T. and Klimkin, V.M. 1999. "Physical and Mathematical Methods for the Visualization and Identification of Watermarks." Solanus.
- Bernhardsson, Erik. 2018. "ANNOY library." Spotify.
- Bounou, Oumayma, Tom Monnier, Ilaria Pastrolin, Xi SHEN, and Christine Benevent et al. 2020. "A Web Application for Watermark Recognition." Journal of Data Mining & Digital Humanities, 07 14.
- Damberger, Joachim. 2006. "Geschichte der Papierherstellung." (LWF - aktuell).
- Deng, Jia and Dong, Wei and Socher, Richard and Li-Jia Li, Kai Li and Li Fei-Fei. 2009. "ImageNet: A Large-Scale Hierarchical Image Database." 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, Florida: Institute of Electrical and Electronics Engineers.
- Deutsche Nationalbibliothek. 2023. Historical Paper Collections . 06. https://www.dnb.de/EN/Sammlungen/DBSM/PapierhistorischeSammlungen/papierhistorischeSammlungen.
- Fuller, Neathery Batsell. 2002. "A Brief History Of Paper." St. Louis Community College.
- Hunter, D. 1978. Papermaking: The History and Technique of an Ancient Craft. Dover Publications.
- Picard, David and Henn, Thomas and Dietz, Georg. 2016. Non-negative dictionary learning for paper watermark similarity. Conference, Pacific Grove, United States: Asilomar Conference on Signals, Systems, and Computers.
- Pondenkandath, V., M. Alberti, N. Eichenberger, R. Ingold, and M Liwicki. 2020. "Cross-Depicted Historical Motif Categorization and Retrieval with Deep Learning." Journal of Imaging 6, 71.
- Shen, X., Pastrolin, I., Bounou, O., Gidaris, S., Smith, M., Poncet, O., & Aubry, M. 2019. "Large-Scale Historical Watermark Recognition: dataset and a new consistency-based approach." 2020 25th International Conference on Pattern Recognition (ICPR). Milan: International Association of Pattern Recognition.
- Stewart, D. and Scharf, R. A. and Arney, J. S. 1995. "niques for digital image capture of watermarks." Journal of Imaging Science and Technology.
- Vereinigung der Österreichischen Papierindustrie. 2023. Papier macht Schule - Geschichte der Papierproduktion. Accessed Juni 2023. https://www.papiermachtschule.at/papierproduktion/geschichte/.
- Zhu, Jun-Yan and Park, Taesung and Isola, Phillip and Efros, Alexei A. 2017. "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks." Computer Vision (ICCV).