Information Panopticon

Home » Posts tagged 'ai'

Tag Archives: ai

The AI-Taxonomy Disconnect

https://pixabay.com/photos/telephone-contact-us-call-contact-4440525/

“Me, I disconnect from you.” – Gary Numan, Me! I Disconnect from You

I have seen a significant decline in the number of taxonomist positions available. Wondering if I was in a myopic bubble based on my location or choice of job boards, I’ve been asking colleagues if they see the same thing. The general consensus is, yes, there are fewer taxonomist jobs available even as the importance of foundational data structures are becoming more important with AI tools. I have seen an increase, or at least a steady availability of, ontologist jobs, many requiring more technical expertise (the ability to create Python data ingestion pipelines and retrieval augmented generation (RAG) systems) than I have seen in the past. Again, this may be a matter of where I live or where I am seeking jobs, but it seems jobs in the taxonomy and ontology field are becoming more technical and less focused on the business operations side of working with stakeholders to provide analysis and guidance on semantic frameworks.

I suspect this shift is driven by a wider adoption of AI tools and a contraction in the job market. Employers seem to be seeking a single resource to do both the business and technical sides of implementing semantic structures and the foundational components for building out applications. I wonder if another factor is also the belief that large language models (LLMs) are a replacement for semantic models. If true, there are several reasons I can see that may be driving these beliefs.

Speed to Business

As I have touched upon before in my blog (Friction and Complexity and The Taxonomy Tortoise and the ML Hare), the manual, slower, but deliberate curation of controlled vocabularies can be seen as a roadblock to business speed and agility. There are several valid points here.

One area of pushback I see frequently in organizations is the response time from initially requesting a new concept, taxonomy branch, or vocabulary until it is available in production. The ownership of the concepts and data is, to some degree, taken out of the hands of the business users and put under the control of taxonomists who incorporate these concepts into centralized semantic models (taxonomies, thesauri, ontologies). Even with governance models including service level agreements stating turnaround times and taxonomy availability, not every group in the organization is going to see this centralized service as a benefit. Rather than wait for enterprise-level support, stakeholders may develop workarounds in services and tools to support their own use cases. As we know, this decentralizing of schemas and tools creates a fragmented landscape of differing terminology, functional support in what tools have available to manage taxonomies, and processes in the way data and content is handled and tagged with metadata.

Similarly, enterprise taxonomists serving many areas of the organization may seem to be too domain agnostic to serve the variety of use cases served by controlled vocabularies. While taxonomists do not need to be a domain expert in the areas covered by semantic models, the perception may be that their time spent ramping up in a domain would be better served having the subject matter experts in those domains build their own models. There is some validity here if the domains are truly standalone and operate to serve only those domain use cases. However, tying together various domain areas to form enterprise-wide knowledge graphs seems to be the direction most organizations want to go. If that’s the case, then a centralized taxonomy team as a service to the entire enterprise makes a lot of sense.

Given these counterpoints to slowly developing semantic models, why not, one may ask, simply ask machine learning models to provide the schemes we need to organize and optimize information?

Words Are Words Are Worlds

Are LLMs seen as a replacement, or at least a viable alternative, to taxonomies (using the term as a broad umbrella for all controlled vocabularies?

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) and provide the core capabilities of modern chatbots. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologiesinherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on (Wikipedia).

Certainly building your own language model reflecting all of the ways that natural language queries could be asked is non-starter when there are clearly highly performing, public LLMs available at our fingertips. The chatbots ChatGPT, Google Gemini, and Microsoft Copilot are some of the more familiar tools based on foundational LLMs. One of the primary, and arguably most successful, use cases for these tools is language generation. When prompted, a chatbot can generate text, format this text based on instructions or examples, and produce a slick product which, short of a quick review to ensure the content checks out, is ready to go. These LLMs are based on a “vast amount of text”; in short, a lot more text than you will be able to provide to train a whole model.

What LLMs are missing, however, is your context. There are many, fairly easy, methods for providing them context. You can allow them access to your documents, at work or at home, so the chatbot can “see” the content of your documents, the structures, and even the writing style. That is very much your context, allowing the chatbot to compose in a way that reflects your home or work artifacts and generate new text combining the existing LLM with your specific context. Using additional context moves an LLM from generic to specific, from a wider world of words to your world of words.

At enterprise scale, the same method applies. If your organization is large, has a lot of public exposure and presence, and has a history of public interaction on the Internet and in social media, LLMs are going to know a lot about you even without additional context. I think, however, for the same reason that the market for reusable, one-size-fits-all taxonomies has never taken off is that every organization—whether true or not—is unique and special. Outside of certain industries like healthcare, life sciences, pharmaceuticals, and finance which have well-defined, and often extremely complex, ontologies they can adopt, other industries or functions within a company do not. In my experience, marketing is a great example. Despite the common needs, I have never seen a marketing department adopt any public standard. They build from scratch even when a majority of the taxonomies are commonly available terms.

In these cases, building taxonomies and ontologies to add context specific to your organization provides LLMs with both words and structures modeling the world of your domain. In fact, it is becoming more common that the development of these taxonomies is being done with human-chatbot interaction. A taxonomist can provide glossaries, metadata schemas output in spreadsheets, and documents to provide chatbots with the raw materials to extract entities, cluster topics, compare values across documents, and other processes that once required text analytics tools and, sometimes, human intervention in the form of rule writing. The speed to taxonomy and ontology development is increasing. Like other iterative feedback processes, the taxonomist and LLM work together to create domain schemas in the form of taxonomies and ontologies that provide additional guidance to future “manual” and automated processes with LLMs in the mix.

Pure speculation on my part, but is the niche and sometimes still esoteric and obscure field of taxonomy and ontology design being replaced, for better or for worse, with LLM use? Or, more specifically, is it being viewed as a replacement for taxonomy and ontology building and the experts who do it? As I stated above, I have seen a shift from the business of taxonomies to the technology of ontologies.

I Disconnect from You

In my opinion, the roles of a taxonomist and of a technically skilled ontologist are still separate. While many people in the industry have the skills to do the work of both, the paths to the two roles have been different. Many taxonomists have library science degrees. They likely have technical skills, but are more focused on the information science aspect of taxonomy and ontology development, interfacing directly with the business and providing other services, such as research, business analysis, and support for use cases relying on semantic models. Ontologists are typically computer scientists who can code and develop the technical infrastructure for ontologies. Finding resources who can, or who like, to do both has not been common. This may be changing. Certainly the roles are asking for both, with, in my view, a leaning toward the technical.

Again, speculatively, is there a shift toward more technical resources in support of rising AI use in organizations? Is there a move to cut out the intermediary roles of taxonomists to let the business owners and technical implementers of taxonomies and ontologies interface more directly? If so, what do organizations lose in the process?

The Connect

If I were reading this blog, I would think the author was trying to sell you the value of taxonomists with more of the “soft” skills of research, business stakeholder interaction, and translation of business requirements into taxonomies and to the more technical resources who support their implementation and move to actionable production. After catching up on a few seasons of Landman and hearing in my head at some point in every episode, “Brought to you by [insert name of large oil company]”, maybe I’m a little sensitive to reading between the lines. You don’t have to. I am selling you on the value of taxonomists for all of the reasons I’ve listed above. If there is indeed a shift from the business skills of taxonomists to the technical skills of an ontologist instead of having the excellent skills of both, then your organization is missing out on a valuable resource who can work alongside AI technologies to bring the business requirements and practical domain building skills to bear.

In summary, believe there are two necessary components to bridge the seeming disconnect between AI and the foundational data quality governance needed to make AI operational:

  1. A taxonomist who can
    1. build taxonomies and ontologies to create domain-specific semantic models representing the business, 
    2. provide business analysis and requirements for technical implementation, and
    3. be the human-in-the-loop working with AI tools to continue building out, expanding, and governing semantic models, and
  2. Technical engineers who can
    1. operationalize ontologies by building data pipelines to AI tools, and
    2. focus on the engineering aspects of sharing out and productionizing ontologies for use across the enterprise.

An enterprise with both of these roles isn’t creating unnecessary resource overhead or additional layers effectively slowing the path from automation to implementation; rather, the two roles work in harmony to clarify requirements and optimize the use of AI in a variety of applications meeting business use cases. Taxonomists and ontologists are bridges between the business and the technical implementation enabling business needs.

Semantic Data 2025 Themes and Trends

https://pixabay.com/illustrations/rosette-mandala-art-model-design-7171917/

“Trust in me, just in me / Shut your eyes and trust in me” – The Jungle Book

I attended the Henry Steward Semantic Data Conference co-located with HS DAM in New York City a few weeks ago. As I’ve done with KMWorld in the past, I’m going to summarize some themes and trends I took away from both conferences, with an emphasis on Semantic Data.

Inevitable AI

The most common theme of both conferences, unsurprisingly, was artificial intelligence (AI) in all of its forms, applications, and impact. Broadly speaking, the key takeaway across all of the presentations and discussions was: this is happening. Whether it’s baked into digital asset management (DAM) systems (hint: it is), used wildly thrown at use cases until something sticks, or carefully governed with strict governance, guardrails to protect the organization, its people, and the people they serve, and measured to understand the effectiveness of different large language models (LLMs), AI is happening. So what do we, as digital asset and semantic data professionals, do about it? What is our role in the use of AI in the organization and in the public sphere? What are our responsibilities?

From the Semantic Data Conference, several themes emerged:

  • Organizations are going to experiment with generative AI models to develop workable pipelines with humans in the loop;
  • Context is key, and organizations can develop domain-specific and constrained semantic models to be used in conjunction with external LLMs;
  • It’s incumbent upon all of us to develop valid, organizationally-specific and curated training data sets to provide machine learning models the context to output reasonable results.

Themes from the Digital Asset Management Conference included:

  • AI can speed up the generation of assets and the automated application of metadata to those assets;
  • Access to clean, curated metadata is critical, both from taxonomies and sources like data lakes;
  • Metadata as a source of truth for embedded AI can lead to better analytics;
  • Asset provenance is essential for usage and rights management, especially when AI is involved.

Metadata Is Critical

That’s it. That’s the story. Metadata is critical. It has been, and it will continue to be. But, maybe, organizations are more aware of the importance of metadata because of the lightning fast rise of AI. Metadata is critically important as applied to digital assets, and semantic metadata powers better asset connections, discovery, personalization, and analytics.

Core to the importance of metadata is the importance of trust. Metadata quality must be trusted. The data and content to which metadata is applied must be trusted. Quality, trusted data leads to quality, trusted content and training sets which can feed into AI pipelines. Similarly, legal and reputational risks can be mitigated by ensuring the quality of information and data, especially as applied as compliance and usage rights.

Since semantic models are a source of truth for quality metadata, developing taxonomies and ontologies over time can create more complexity as needed to support a variety of use cases. Complexity sounds like a negative, but the world is complex, and semantic models are meant to represent organizational domains, which are by necessity complex. Complex semantic models support a variety of use cases, even if they do take more conscientious planning, development, and governance. Within these complex models are fit-for-purpose structures addressing use cases.

As with AI processes, developing, managing, and governing metadata in all its forms involves humans in the loop. Even as the identification, extraction, and application of metadata improves with AI, humans need to be involved in the process to add, remove, and quality check automatically applied metadata. As pipeline processes improve, reaching a specified threshold of metadata accuracy may reduce the need for human intervention and review.

Context and Trust

If I had to boil the conference down to two keywords–or, maybe, if I could only apply two metadata tags to the conference–they would be context and trust. Data and content requires context and semantic models are one way to provide this context whether for use in machine learning pipelines or direct human interaction with content.