Information Panopticon

Home » Uncategorized » The AI-Taxonomy Disconnect

The AI-Taxonomy Disconnect

https://pixabay.com/photos/telephone-contact-us-call-contact-4440525/

“Me, I disconnect from you.” – Gary Numan, Me! I Disconnect from You

I have seen a significant decline in the number of taxonomist positions available. Wondering if I was in a myopic bubble based on my location or choice of job boards, I’ve been asking colleagues if they see the same thing. The general consensus is, yes, there are fewer taxonomist jobs available even as the importance of foundational data structures are becoming more important with AI tools. I have seen an increase, or at least a steady availability of, ontologist jobs, many requiring more technical expertise (the ability to create Python data ingestion pipelines and retrieval augmented generation (RAG) systems) than I have seen in the past. Again, this may be a matter of where I live or where I am seeking jobs, but it seems jobs in the taxonomy and ontology field are becoming more technical and less focused on the business operations side of working with stakeholders to provide analysis and guidance on semantic frameworks.

I suspect this shift is driven by a wider adoption of AI tools and a contraction in the job market. Employers seem to be seeking a single resource to do both the business and technical sides of implementing semantic structures and the foundational components for building out applications. I wonder if another factor is also the belief that large language models (LLMs) are a replacement for semantic models. If true, there are several reasons I can see that may be driving these beliefs.

Speed to Business

As I have touched upon before in my blog (Friction and Complexity and The Taxonomy Tortoise and the ML Hare), the manual, slower, but deliberate curation of controlled vocabularies can be seen as a roadblock to business speed and agility. There are several valid points here.

One area of pushback I see frequently in organizations is the response time from initially requesting a new concept, taxonomy branch, or vocabulary until it is available in production. The ownership of the concepts and data is, to some degree, taken out of the hands of the business users and put under the control of taxonomists who incorporate these concepts into centralized semantic models (taxonomies, thesauri, ontologies). Even with governance models including service level agreements stating turnaround times and taxonomy availability, not every group in the organization is going to see this centralized service as a benefit. Rather than wait for enterprise-level support, stakeholders may develop workarounds in services and tools to support their own use cases. As we know, this decentralizing of schemas and tools creates a fragmented landscape of differing terminology, functional support in what tools have available to manage taxonomies, and processes in the way data and content is handled and tagged with metadata.

Similarly, enterprise taxonomists serving many areas of the organization may seem to be too domain agnostic to serve the variety of use cases served by controlled vocabularies. While taxonomists do not need to be a domain expert in the areas covered by semantic models, the perception may be that their time spent ramping up in a domain would be better served having the subject matter experts in those domains build their own models. There is some validity here if the domains are truly standalone and operate to serve only those domain use cases. However, tying together various domain areas to form enterprise-wide knowledge graphs seems to be the direction most organizations want to go. If that’s the case, then a centralized taxonomy team as a service to the entire enterprise makes a lot of sense.

Given these counterpoints to slowly developing semantic models, why not, one may ask, simply ask machine learning models to provide the schemes we need to organize and optimize information?

Words Are Words Are Worlds

Are LLMs seen as a replacement, or at least a viable alternative, to taxonomies (using the term as a broad umbrella for all controlled vocabularies?

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) and provide the core capabilities of modern chatbots. LLMs can be fine-tuned for specific tasks or guided by prompt engineering. These models acquire predictive power regarding syntax, semantics, and ontologiesinherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on (Wikipedia).

Certainly building your own language model reflecting all of the ways that natural language queries could be asked is non-starter when there are clearly highly performing, public LLMs available at our fingertips. The chatbots ChatGPT, Google Gemini, and Microsoft Copilot are some of the more familiar tools based on foundational LLMs. One of the primary, and arguably most successful, use cases for these tools is language generation. When prompted, a chatbot can generate text, format this text based on instructions or examples, and produce a slick product which, short of a quick review to ensure the content checks out, is ready to go. These LLMs are based on a “vast amount of text”; in short, a lot more text than you will be able to provide to train a whole model.

What LLMs are missing, however, is your context. There are many, fairly easy, methods for providing them context. You can allow them access to your documents, at work or at home, so the chatbot can “see” the content of your documents, the structures, and even the writing style. That is very much your context, allowing the chatbot to compose in a way that reflects your home or work artifacts and generate new text combining the existing LLM with your specific context. Using additional context moves an LLM from generic to specific, from a wider world of words to your world of words.

At enterprise scale, the same method applies. If your organization is large, has a lot of public exposure and presence, and has a history of public interaction on the Internet and in social media, LLMs are going to know a lot about you even without additional context. I think, however, for the same reason that the market for reusable, one-size-fits-all taxonomies has never taken off is that every organization—whether true or not—is unique and special. Outside of certain industries like healthcare, life sciences, pharmaceuticals, and finance which have well-defined, and often extremely complex, ontologies they can adopt, other industries or functions within a company do not. In my experience, marketing is a great example. Despite the common needs, I have never seen a marketing department adopt any public standard. They build from scratch even when a majority of the taxonomies are commonly available terms.

In these cases, building taxonomies and ontologies to add context specific to your organization provides LLMs with both words and structures modeling the world of your domain. In fact, it is becoming more common that the development of these taxonomies is being done with human-chatbot interaction. A taxonomist can provide glossaries, metadata schemas output in spreadsheets, and documents to provide chatbots with the raw materials to extract entities, cluster topics, compare values across documents, and other processes that once required text analytics tools and, sometimes, human intervention in the form of rule writing. The speed to taxonomy and ontology development is increasing. Like other iterative feedback processes, the taxonomist and LLM work together to create domain schemas in the form of taxonomies and ontologies that provide additional guidance to future “manual” and automated processes with LLMs in the mix.

Pure speculation on my part, but is the niche and sometimes still esoteric and obscure field of taxonomy and ontology design being replaced, for better or for worse, with LLM use? Or, more specifically, is it being viewed as a replacement for taxonomy and ontology building and the experts who do it? As I stated above, I have seen a shift from the business of taxonomies to the technology of ontologies.

I Disconnect from You

In my opinion, the roles of a taxonomist and of a technically skilled ontologist are still separate. While many people in the industry have the skills to do the work of both, the paths to the two roles have been different. Many taxonomists have library science degrees. They likely have technical skills, but are more focused on the information science aspect of taxonomy and ontology development, interfacing directly with the business and providing other services, such as research, business analysis, and support for use cases relying on semantic models. Ontologists are typically computer scientists who can code and develop the technical infrastructure for ontologies. Finding resources who can, or who like, to do both has not been common. This may be changing. Certainly the roles are asking for both, with, in my view, a leaning toward the technical.

Again, speculatively, is there a shift toward more technical resources in support of rising AI use in organizations? Is there a move to cut out the intermediary roles of taxonomists to let the business owners and technical implementers of taxonomies and ontologies interface more directly? If so, what do organizations lose in the process?

The Connect

If I were reading this blog, I would think the author was trying to sell you the value of taxonomists with more of the “soft” skills of research, business stakeholder interaction, and translation of business requirements into taxonomies and to the more technical resources who support their implementation and move to actionable production. After catching up on a few seasons of Landman and hearing in my head at some point in every episode, “Brought to you by [insert name of large oil company]”, maybe I’m a little sensitive to reading between the lines. You don’t have to. I am selling you on the value of taxonomists for all of the reasons I’ve listed above. If there is indeed a shift from the business skills of taxonomists to the technical skills of an ontologist instead of having the excellent skills of both, then your organization is missing out on a valuable resource who can work alongside AI technologies to bring the business requirements and practical domain building skills to bear.

In summary, believe there are two necessary components to bridge the seeming disconnect between AI and the foundational data quality governance needed to make AI operational:

  1. A taxonomist who can
    1. build taxonomies and ontologies to create domain-specific semantic models representing the business, 
    2. provide business analysis and requirements for technical implementation, and
    3. be the human-in-the-loop working with AI tools to continue building out, expanding, and governing semantic models, and
  2. Technical engineers who can
    1. operationalize ontologies by building data pipelines to AI tools, and
    2. focus on the engineering aspects of sharing out and productionizing ontologies for use across the enterprise.

An enterprise with both of these roles isn’t creating unnecessary resource overhead or additional layers effectively slowing the path from automation to implementation; rather, the two roles work in harmony to clarify requirements and optimize the use of AI in a variety of applications meeting business use cases. Taxonomists and ontologists are bridges between the business and the technical implementation enabling business needs.


Leave a comment