AI Indexing with Controlled Vocabularies in 2025

Written by Dr. Cecilia Maas | Sep 3, 2025 2:25:25 AM

Why Indexing Matters More Than Ever

Heritage institutions, archives, and libraries face an ever-growing challenge: how to keep vast and diverse collections accessible in an era when materials multiply faster than they can be catalogued. Text, images, audio, and video transcripts demand structured entry points for discovery, yet manual indexing can no longer keep pace. Controlled vocabularies—authority files, thesauri, and custom taxonomies—have long been the foundation of consistent, interoperable access. Now, with the help of AI, it is possible to apply them at scale and with a depth that would have been unthinkable only a few years ago.

Some may ask whether keywords are still necessary in a time when AI search promises to understand natural language queries directly. But the value of controlled and custom vocabularies lies in something different: they provide stable, shared points of access that cut across institutions, languages, and research practices. Keywords anchor discovery, support interoperability, and allow collections to be explored not only through ad hoc questions but also through structured perspectives. By combining the scalability of AI with the rigor of controlled vocabularies, institutions can expand both the efficiency and the depth of their indexing practices.

The Challenge of Controlled Vocabularies

Applying controlled vocabularies to large collections has always been labor-intensive, but it remains indispensable. Authority files such as the Gemeinsame Normdatei (GND) or the Library of Congress Subject Headings (LCSH) make it possible to connect a photograph, a transcript, and a policy document under the same concept, even when described in different languages or with different terminology. Yet the very strength of controlled vocabularies—their rigor—also makes them difficult to scale: they demand precise matches, careful disambiguation, and awareness of hierarchical relations.

AI systems bring both promise and pitfalls to this task. Many authority files have hundreds of thousands of entries. This is not a challenge one can solve by “just asking ChatGPT” in its web interface. It requires a dedicated system that can store and access the vocabulary efficiently, and a precise methodology for matching terms. In such vocabularies, a concept is never an island: it exists within a broader network of narrower, broader, and related terms. Without sensitivity to these structures, automated indexing risks producing inconsistent or noisy results, or overlooking crucial concepts altogether.

Our Approach in 2025

At aureka, we began working on this problem in 2023 with a proof of concept for the Berlin City Museum, where we tested automated indexing on a collection of long biographical interviews. Since then, we have expanded and refined our method, leveraging the latest developments in AI for this task. The foundation of our approach are embeddings—mathematical representations of semantic meaning expressed as vectors. By measuring the distance between these vectors, we can map objects—whether text passages, image descriptions, or transcript segments—into a semantic space and compare them with vocabulary terms.

Early on, however, we realized that plain embeddings of vocabulary terms are not enough. To work reliably with authority files, embeddings must be enriched with hierarchical relations. By incorporating broader, narrower, and related terms into the embedding space, indexing does not stop at surface matches but captures the full conceptual landscape. Equally, simply retrieving the “closest” vectors is insufficient: because embeddings are very general, vector distance alone is not accurate enough for controlled vocabularies. This is where LLMs add value: they filter out terms that may be close in the vector space but are not a suitable match for the specific context.

Another essential step is chunking. Large documents or transcripts often contain multiple themes, each of which deserves indexing. By dividing text (or descriptive metadata for images and audiovisuals) into smaller segments, we maximize the number of candidate terms that can be retrieved. These terms are then sorted by relevance and aggregated according to clear criteria, resulting in a balanced set of keywords for each object.

This workflow creates a system that is both scalable and precise. It enables institutions to index collections of any size with controlled or custom vocabularies, producing rich and consistent access points without sacrificing interpretability or compliance with cataloguing standards.

What This Unlocks for Heritage Institutions

Automated indexing with controlled and custom vocabularies is not just about efficiency. It opens new possibilities for how collections can be explored, understood, and connected. By combining the scalability of embeddings with the precision of hierarchical vocabularies and the contextual judgment of LLMs, institutions can achieve an indexing depth that was previously unattainable. Thousands of items can now be described consistently, revealing patterns and connections across entire collections.

For archivists, librarians, and heritage professionals, this means more than saving time. It ensures consistency, by applying vocabularies in a rigorous way across heterogeneous materials. It enhances discoverability, by making it possible to navigate collections not only by keywords but also through the broader and narrower concepts that link them. And it strengthens interoperability, allowing institutions to align their catalogues with international standards and with each other.

In this sense, automated indexing is not a replacement for curatorial expertise but an amplifier of it. By taking over the repetitive task of large-scale vocabulary assignment, AI frees experts to focus on refining vocabularies, interpreting results, and shaping new research and public engagement possibilities.

View full post