Orbis Pictus – oživení knihy pro kulturní a kreativní odvětví
In terms of the NAKI III program, the project focuses on priority No. 16, Methods of identification, documentation, recording, and interpretation of national immovable and movable cultural heritage, and, within the secondary priorities, on No. 19 Care for national movable cultural heritage in collection-building institutions using modern tools and applications for storage, storage and presentation, and No. 25 Applied research and the use of its results to support art and artistic crafts.
The aim of the project is to open up the graphic content of digital libraries to the public, especially to users from the creative and cultural sectors, who will be able to use the graphic entities they find for their creative work.
In terms of novelty, the most significant result of the project will be the AnnoPage tool, which will use machine learning methods to identify graphic entities on scanned pages. It will further categorize the entities found and automatically add a brief text description to them. AnnoPage will provide sufficient quality information to be deployed in a production environment. The output of the system will thus be a metadata set containing both the position of the entity within the scanned page (so that it can be virtually cut out or marked on the page) and text data characterizing this entity, obtained by analyzing the entity and the rest of the page on which the entity is located. The metadata sets obtained in this way will be used in other systems and tools.
The AnnoPage system will be directly linked to the PeopleGator system, whose input will be the image entities that AnnoPage marks as graphic objects depicting people. PeopleGator will use machine learning methods to identify people who appear in multiple images within image entities, thus enabling the creation of a virtual graph linking documents depicting these people. Another input to the system will be sets of images of already identified people, obtained from various open sources (Wikimedia, obalkyknih.cz, and others). By linking these images with images found in digital libraries, it will be possible to build a database of identified persons, the content of which will be presented both through an open API and in a searchable web interface.
Another result of the project will be a web interface communicating in the background with the API of the Kramerius digital library system, focused on searching and presenting graphic elements of digitized documents, while also providing a set of practical tools for advanced work with the graphic content of the digital library and its further use. The system will enable better work with document excerpts and their easy sharing, searching in categories of graphic entities identified in documents using AnnoPage, virtual linking of documents depicting the same persons, or a reader optimized for reading and exporting multi-page graphic documents (graphics, comics, cuts, etc.). It will also enable the comparison of several selected elements – graphs, maps, images of persons, etc.).
A methodology will be used to standardize work with identified graphic entities, proposing their categorization, a method for their unambiguous identification and referencing, a method for recording the information obtained in structured metadata records, and recommendations on how to approach their indexing in the context of a digital library or other systems.
All of the above tools will then be integrated with the Czech Digital Library into the Czech Digital Library – Orbis Pictus, which will enable all libraries connected to the Czech Digital Library to use these tools (at the time of writing this project, the CDL covers approximately 75% of the content digitized in Czech libraries). This will also significantly expand the amount of content offered by the Orbis Pictus project in an advanced form, allowing further use of non-text parts of documents by all users.
OmniOMR – rozpoznávání hudebního záznamu v digitálních knihovnách pomocí strojového učení
In terms of the NAKI III program, the project focuses primarily on thematic priority No. 11, National and cultural identity in the preservation, documentation, and recording of cultural heritage in the areas of folk culture and traditions, music, theater, and film, and, as part of secondary priorities, No. 16 Methods of identification, documentation, recording, and interpretation of national immovable and movable cultural heritage and No. 19 Care for national movable cultural heritage in collection-building institutions using modern storage, preservation, and presentation tools and applications.
The project implements both the detection and recognition of musical notation (Optical Music Recognition, OMR) in digital library collections and a related user interface focused on searching for musical documents and within musical documents.
Currently, musical notation is not processed in library information systems other than bibliographically. The Moravian Library was the first in the Czech Republic to introduce the cataloguing of musical manuscripts and old prints by entering a record of the musical incipit (the first few bars or notes) in Plaine And Easie Code syntax in accordance with the recommendations of RISM (https://www.jstor.org/stable/23504707). However, existing library systems do not allow further work with notation recorded in this way, and musical incipits are not the same as records of entire compositions for search purposes. When digitizing music, only digital images are stored, which undergo text OCR at most. If a musical notation occurs in a book that is not processed as music, it is not identified in any way. At present, it is not possible to search digitized music records in the same way as it is possible to perform a full-text search of documents processed using OCR. Furthermore, it is not possible to systematically search for documents of musical culture in mixed media.
The importance of automatic identification of musical notation in large volumes of digitized documents is demonstrated, for example, by the recent discovery of the first evidence of the oldest layer of polyphony (Notre Dame organum) in the Czech Republic in the collections of the National Library. (https://www.literarky.cz/kultura/1775-narodni-knihovna-hlasi-unikatni-objev-fragment-sesti-skladeb-ze-13-stoleti)
The Makarius application will enable the indexing and searching of musical notation detected and recognized by the OmniOMR service in library collections. The system will be designed as a separate functional unit that can be connected via API to a digital library system (e.g., Kramerius) or discovery system (e.g., VuFind) used by Czech libraries. The application will enable the indexing of musical notation records obtained from the OmniOMR service and will implement appropriate algorithms for evaluating the similarity of two musical notation records. The indexing process will also include a process of searching for similar records that Makarius has already processed. The index will also contain links to scans of musical notation records and the URLs of their presentation and other necessary metadata. Through the API, it will be possible to search for musical scores or display musical scores similar to the selected score. The Makarius system can be developed from the beginning of the project using test data, which will create space for identifying any shortcomings, testing suitable indexing and search algorithms, and ensuring the overall user-friendliness of the application.
semAnt – Sémantický průzkumník textového kulturního dědictví
Czech libraries and archives contain a huge amount of digitized documents. The possibilities for their online presentation and searchability have improved significantly in recent years. A large part of the digitized printed documents has already been processed using OCR and is therefore searchable in full text. Tools for the automatic transcription of old prints and handwritten documents already exist, and their complete processing is now only a matter of time.
However, the full-text search used in library systems is the simplest possible. It can usually find different forms of a word, but it cannot work with meaning. Finding documents on a specific topic is therefore very laborious. In contrast, current web search engines work with the meanings of words, allowing users to find texts that do not contain the exact search term but correspond to the search topic in a more general sense.
The main goal of this project is therefore to improve the search capabilities in the full-text representation of digitized documents at the level of text meaning and to improve the possibilities for natural navigation between thematically similar documents. We will provide users with full-text search enhanced by understanding the meaning of queries, the ability to search by parts of text (e.g., paragraphs) with the option to simultaneously specify the topic that interests them in the given text. The system will work with automatically identified topics, but will allow users to define their own topics based on examples from the texts.
We will also use the project's ability to identify topics in texts for overview visualizations of the frequency of topic occurrences and their mutual interactions. This will make it possible to track the development of topics over time, their continuity and changes, or their connection to known named entities such as places and people.
The results of the project will be used both by the general public in their routine work with library systems and by the scientific community for higher-quality analysis of text data. At the same time, we hope that parts of the project will find application in software for media analysis of contemporary media and social networks.