Information Panopticon

Home » taxonomy » The Taxonomy Tortoise and the ML Hare

The Taxonomy Tortoise and the ML Hare

“I knew I shoulda’ taken that left turn at Albuquerque.” – Bugs Bunny

For better or worse, much of my childhood was informed by Looney Tunes, Monty Python, and a diet of science fiction ranging from the profound to the disjointedly camp. As such, I expect the absurd and am wildly skeptical of easy answers. Additionally, my foundation of science fiction books and films compels me to speculate that artificial intelligence will become a more realistic probability in our lives with actions ranging from locking us out of airlocks and starting global thermonuclear war to providing answers to our most pressing global problems.

The long-promised advantages of artificial intelligence seem finally to be reaching a point at which they can be utilized for enterprise purposes, including parsing, and even understanding, large amounts of text and data at rapid speed. The recent successes beg the question that if machine learning models can operate on data at high volume and velocity, then why shouldn’t they be used to come up with answers on the fly based on large amounts of data internal or external to an organization? Well, in fact, they already are, and, in my opinion, they should, but not without some acknowledgment of absurdity and a certain degree of skepticism.

I’m a firm believer in defining semantic models in the form of taxonomies and ontologies to be used as a foundational schema for an organization’s data. One of the arguments against investment in taxonomies is the time it takes to create them and the amount of maintenance they require to sustain them. In a world in which what is trending changes frequently, user tastes are fickle, and the jargon associated with these trends passes quickly, the desire to avoid the tortoise-like pace of building taxonomies in lieu of utilizing other, faster technologies is tempting. But, as the hare who lost the race to the tortoise laments, “I knew I shoulda’ taken that left turn at Albuquerque.” Or, let’s consider checking the map before we go racing off in the wrong direction.

Let’s talk semantics. Putting it simply, ontologies are semantic structures which define one or more domains. They describe the types of things in the domain (classes), how these things can relate to each other (relationships, predicates, or edges), what labeled fields are used to describe these things (properties), and the instances of things (subjects, objects, or, more plainly, taxonomy concepts). Ontologies describing the general domain and taxonomies including the specific instances within one or more domains can be created as a map of your organization. These semantic structures represent the organization in all of its complexity. They specify the concepts important to the company and how these concepts relate to each other, data, and content. Once data or content is added, we can call this entire structure a knowledge graph.

In short, ontologies, taxonomies, and content are the organization’s view of itself, the world, and where it lives in it.

Large language models (LLMs) have the ability to generate text, answer natural language questions, and classify content. Most publicly available LLMs, like ChatGPT, are trained on publicly available information. It is also possible to supply these LLMs with your own training sets of documents and language samples to develop answers more applicable to your own organization. Wisely, many organizations tightly control what information can be presented to these AI tools to avoid company information leaks or supplying competitors with proprietary information.

What’s lacking in using these hare-rapid models, however, is the organizational perspective. They are very good at answering general questions and making factual assertions from text, but they require tailored training content with specific use cases in mind to generate answers specific to an organization’s needs. There can be a temptation to feed one of these models a large quantity of organizational content to train them faster. However, the span of topics, language, jargon, and acronyms used in an organization can yield unsatisfying or unpredictable results. Imagine, if you will, the amount and variety of content in any one of your company’s content management systems. Now imagine asking a machine learning model to analyze and make sense of it all without guidance. You can index all of your own content, but without a framework, what sense does it make?

At this moment, the hare and the tortoise must strike a deal if they both want to win. To improve the performance of LLMs and other machine learning models, a domain topology specific to your organization defining the concepts, their synonyms and acronyms, and how they relate to each other, can be used as a schema input into the model. Semantic models are, after all, assertions in the form of triple statements (subject-predicate-object). Ontologies establish factual statements as determined by your organization’s use cases and, hence, provide patterns which can be used by machine learning models. Lexical proximity can be gathered from taxonomy hierarchies (these concepts are more closely related because they share a parent-child relationship) and associative relationships (these concepts, separated across several taxonomies, are actually very closely related because they have a direct associative relationship between them). Semantic models provide factual statements, built slowly over time based on business use cases, which can augment and improve LLMs.

Not only can we think of semantic models as a collection of factual statements according to your organizational domain and use cases, we can also think of it as a summary, requiring the LLM ingest a lot less information to reach the same factual conclusion. For example, you can provide the model with a huge amount of training data stating that a particular SKU-level product is available in the color blue. If this is a factual assertion in your semantic models (Product name has color Blue), however, then this fact can be tagged to a single product representation in a database and in turn is applied to thousands of real-world SKU instances. Semantic models are a distilling and modeling of thousands of instances of truths across an organization and summarized into a collection of ontology structural elements and taxonomic instances. Citing a joke by Steven Wright, in which the comic tells us he has a map of the United States which is actual size, your organizational map can be represented in a much smaller scale.

Yes, it’s certainly true that given large amounts of data, machine learning models or text analytics can identify all kinds of important concepts. These concepts (and fact assertions between concepts) can be a great pipeline to feed into taxonomy and ontology construction. I am skeptical of machine learning models generating taxonomies and ontologies based on organizational data and content unless there is heavy human-in-the-loop curation to reconcile those absurdities which I believe inevitably creep in. And, yes, it’s certainly true that this curation is potentially at a tortoise pace, but once these concepts and assertions are built into semantic models, the ongoing maintenance and governance demands less time and effort.

Those slow semantic model builds enable fast-moving machine learning models and LLMs to be grounded in organizational truths, allowing for expansion, augmentation, and question-answering at a much faster pace but backed with foundational truths as asserted by your organization.

Be the tortoise first and foremost and the hare will follow.


3 Comments

  1. […] The taxonomy tortoise and the ML hare via Information Panopticon […]

  2. Marjorie Hlava says:

    Great post! I agree it seems to me that the taxonomy/ontology/thesaurus not only provides perspective, as you point out, but also provides guard rails for the training. Keeping the human in the loop ensures that the point of view of the organization is represented In the resulting in enriched collection. Tagging the data first allows the training model to provide a significantly superior results. Does not work to train the collection and then at the tagging from the taxonomy

Leave a comment