Home » taxonomy

Category Archives: taxonomy

The Taxonomy Tortoise and the ML Hare

03/26/2024 12:27 / 3 Comments on The Taxonomy Tortoise and the ML Hare

“I knew I shoulda’ taken that left turn at Albuquerque.” – Bugs Bunny

For better or worse, much of my childhood was informed by Looney Tunes, Monty Python, and a diet of science fiction ranging from the profound to the disjointedly camp. As such, I expect the absurd and am wildly skeptical of easy answers. Additionally, my foundation of science fiction books and films compels me to speculate that artificial intelligence will become a more realistic probability in our lives with actions ranging from locking us out of airlocks and starting global thermonuclear war to providing answers to our most pressing global problems.

The long-promised advantages of artificial intelligence seem finally to be reaching a point at which they can be utilized for enterprise purposes, including parsing, and even understanding, large amounts of text and data at rapid speed. The recent successes beg the question that if machine learning models can operate on data at high volume and velocity, then why shouldn’t they be used to come up with answers on the fly based on large amounts of data internal or external to an organization? Well, in fact, they already are, and, in my opinion, they should, but not without some acknowledgment of absurdity and a certain degree of skepticism.

I’m a firm believer in defining semantic models in the form of taxonomies and ontologies to be used as a foundational schema for an organization’s data. One of the arguments against investment in taxonomies is the time it takes to create them and the amount of maintenance they require to sustain them. In a world in which what is trending changes frequently, user tastes are fickle, and the jargon associated with these trends passes quickly, the desire to avoid the tortoise-like pace of building taxonomies in lieu of utilizing other, faster technologies is tempting. But, as the hare who lost the race to the tortoise laments, “I knew I shoulda’ taken that left turn at Albuquerque.” Or, let’s consider checking the map before we go racing off in the wrong direction.

Let’s talk semantics. Putting it simply, ontologies are semantic structures which define one or more domains. They describe the types of things in the domain (classes), how these things can relate to each other (relationships, predicates, or edges), what labeled fields are used to describe these things (properties), and the instances of things (subjects, objects, or, more plainly, taxonomy concepts). Ontologies describing the general domain and taxonomies including the specific instances within one or more domains can be created as a map of your organization. These semantic structures represent the organization in all of its complexity. They specify the concepts important to the company and how these concepts relate to each other, data, and content. Once data or content is added, we can call this entire structure a knowledge graph.

In short, ontologies, taxonomies, and content are the organization’s view of itself, the world, and where it lives in it.

Large language models (LLMs) have the ability to generate text, answer natural language questions, and classify content. Most publicly available LLMs, like ChatGPT, are trained on publicly available information. It is also possible to supply these LLMs with your own training sets of documents and language samples to develop answers more applicable to your own organization. Wisely, many organizations tightly control what information can be presented to these AI tools to avoid company information leaks or supplying competitors with proprietary information.

What’s lacking in using these hare-rapid models, however, is the organizational perspective. They are very good at answering general questions and making factual assertions from text, but they require tailored training content with specific use cases in mind to generate answers specific to an organization’s needs. There can be a temptation to feed one of these models a large quantity of organizational content to train them faster. However, the span of topics, language, jargon, and acronyms used in an organization can yield unsatisfying or unpredictable results. Imagine, if you will, the amount and variety of content in any one of your company’s content management systems. Now imagine asking a machine learning model to analyze and make sense of it all without guidance. You can index all of your own content, but without a framework, what sense does it make?

At this moment, the hare and the tortoise must strike a deal if they both want to win. To improve the performance of LLMs and other machine learning models, a domain topology specific to your organization defining the concepts, their synonyms and acronyms, and how they relate to each other, can be used as a schema input into the model. Semantic models are, after all, assertions in the form of triple statements (subject-predicate-object). Ontologies establish factual statements as determined by your organization’s use cases and, hence, provide patterns which can be used by machine learning models. Lexical proximity can be gathered from taxonomy hierarchies (these concepts are more closely related because they share a parent-child relationship) and associative relationships (these concepts, separated across several taxonomies, are actually very closely related because they have a direct associative relationship between them). Semantic models provide factual statements, built slowly over time based on business use cases, which can augment and improve LLMs.

Not only can we think of semantic models as a collection of factual statements according to your organizational domain and use cases, we can also think of it as a summary, requiring the LLM ingest a lot less information to reach the same factual conclusion. For example, you can provide the model with a huge amount of training data stating that a particular SKU-level product is available in the color blue. If this is a factual assertion in your semantic models (Product name has color Blue), however, then this fact can be tagged to a single product representation in a database and in turn is applied to thousands of real-world SKU instances. Semantic models are a distilling and modeling of thousands of instances of truths across an organization and summarized into a collection of ontology structural elements and taxonomic instances. Citing a joke by Steven Wright, in which the comic tells us he has a map of the United States which is actual size, your organizational map can be represented in a much smaller scale.

Yes, it’s certainly true that given large amounts of data, machine learning models or text analytics can identify all kinds of important concepts. These concepts (and fact assertions between concepts) can be a great pipeline to feed into taxonomy and ontology construction. I am skeptical of machine learning models generating taxonomies and ontologies based on organizational data and content unless there is heavy human-in-the-loop curation to reconcile those absurdities which I believe inevitably creep in. And, yes, it’s certainly true that this curation is potentially at a tortoise pace, but once these concepts and assertions are built into semantic models, the ongoing maintenance and governance demands less time and effort.

Those slow semantic model builds enable fast-moving machine learning models and LLMs to be grounded in organizational truths, allowing for expansion, augmentation, and question-answering at a much faster pace but backed with foundational truths as asserted by your organization.

Be the tortoise first and foremost and the hare will follow.

Semantics and Risk

02/22/2024 10:47 / Leave a comment

“There are no facts, only interpretations.” – Friedrich Nietzsche

Behind every creation–a work of art, musical composition, outstanding sporting performance–there is a creator. Behind a creator’s creation are copyright laws and license agreements detailing how that creation can be bought, sold, represented, and reused. In retail, the works and likenesses of artists and creators appear on merchandise which may be globally distributed. Whether it is the recognizable Jordan Jumpman icon or Bob Marley’s face on a t-shirt, there are contractual obligations and regulations which must be followed to avoid risk and exposure for both the creator and the contracting company. The use of names, images, and likenesses is captured in contractual language, defining the way an image or other work can be used and providing the legal foundations to make sure royalties are paid to the artist and their representative company.

In many organizations that have arrangements with artists or athletes, there are dedicated systems to manage contract documents specially designed to handle the content and also apply the appropriate type of metadata to describe these assets. The digital assets protected by these contracts, such as works of art, images, video, audio, and the like, can be stored in a digital asset management (DAM) system, which specializes in maintaining technical, administrative, structural, descriptive, preservation, usage, and rights metadata. In digital visual works, these rights and permissions may be captured as rights metadata accompanying the image file. Similarly, audio works can carry this information on physical media, and, more commonly, embedded in the individual audio files and the bound collection of an album. Rights and permissions are thus often associated to the artistic creation in both physical and electronic formats.

The metadata to populate values in these platforms can be created and maintained in a taxonomy management system to ensure the application of consistent values in one or more consuming systems. Taxonomies can contain centralized, standardized values describing asset usage and rights and can also provide descriptive metadata values; that is, the metadata that describes what the content is about or what is represented in it. In the world of semantic technologies, taxonomies and ontologies model domains to provide meaning which can otherwise be lost when flattening or decontextualizing metadata attributes. Since taxonomy and ontology management systems were not designed to be content or digital asset management systems, the taxonomy values and the ontology structures which define their use can potentially become detached from the content they describe. The semantic model lives in one system for application across one or more consuming applications while the objects themselves live in those systems, likely containing a mix of system-specific metadata, like technical descriptions of asset size and format and descriptive metadata coming from a taxonomy.

Semantic modeling is an act of establishing veracity. Taxonomies and ontologies model the domains of the organization, including describing the concepts in taxonomies and the relationships between them. Agreeing on what preferred label form to use, which concepts are synonyms, and establishing hierarchical and associative relationships between concepts are all actions to model the truth; or, more accurately, your truth based on the knowledge domain and how it will be used. For instance, modeling fixed relationships between an authority file of athlete names and associated taxonomy concepts could have the following named entities:

Athlete name has team Team name

Athlete name has product Product name

Team name has geography Geographical location

Each of these statements asserts a truth between one or more concepts at any given time. When that truth changes, but the model doesn’t change to reflect it, there is a drift between the current legal standing stated in contracts and how this is represented in the organization’s schema. When creative works or artist likenesses are surfaced on a front end user experience, such as linking a search term to a product or presenting audio or video content in a streaming service, the assumption is that the party responsible for the platform is presenting content for which it is legally liable.

Ironically, the attempt to standardize taxonomy values and their relationships for application across different systems to mitigate the risk of inaccurate metadata values being applied to content and to drive functionality like search and personalized recommendations can in fact introduce new risks. When defining the organization’s domains, it is important to bear in mind how that schema is applied to the actual “things” in other systems and what modeling a full domain truth might mean for how those assets are discovered and used. Attempts to model accurately and truthfully, when applied to content, can inadvertently reveal or persist truths which should not be exposed or have changed meaning over time.

For example, selling athlete merchandise associated with a team is common practice, since fans support teams and the athletes who play for those teams. However, facts which seem immutable, like team names and rosters, can change over time. I live in Oakland, California, home of the Raiders. Well, former home of the Raiders, who are the Las Vegas Raiders who were the Oakland Raiders who were the Los Angeles Raiders who were the Oakland Raiders. How many Raiders players have come and gone over the years and have played for multiple teams throughout their careers? Keeping on top of this kind of high-velocity data changing at scale can be extremely difficult. The same can be said of company-sponsored athletes. Contracts change and products once associated to or endorsed by an athlete may no longer be associated to that athlete. Selling one of these products out of contract can have serious financial and reputational repercussions.

Taxonomy and ontology governance processes must be firmly established and followed in order to make it possible to represent the company domain accurately while mitigating risk. A tight working relationship between the taxonomy team and business representatives in legal and/or marketing is a first priority. Whenever a contract changes, a request must be submitted to reflect this change in the semantic models so tagging moving forward is accurate. Changes to semantic modeling to reflect the new truth doesn’t only happen in the taxonomy management system. The change must be propagated to consuming systems and content on a known schedule. If the change is immediate, the tagging must be changed immediately. If the change goes into effect on a given date or at the end of an established period, the change must be made and pushed to meet that date.

Changing available concepts and tagging practices from a fixed point moving forward is relatively easy compared to other requirements, which may include the untagging or retagging of content to ensure it is not retrieved by search or otherwise discovered by consumers in the user experience. If an association between a team and an athlete is no longer valid, the company may have the opportunity to sell product through an established period of time, allowing the liquidation of as much outdated product stock as possible. After that time, however, discovering a product and purchasing it with the contract no longer being in effect can open the organization to significant risk.

The same risk applies to content not yet ready for prime-time. Any new product release, if tagged and made available to the user experience before a release date, can damage a company’s product launch and severely impact the time and planning that went into organizing an effective campaign. Internally, the risk still exists. If artist or product information is pre-planned in the organization’s metadata, there must be processes in place to keep this information known only to those who need to know it. That means allowing permissioned access to planned artist and product concepts to only those who need to know to avoid information leaks ahead of a release date. While organization’s have contracts stipulating how internal employees handle private company information, it is possible for a product release statement or other content to be discovered internally and intentionally or unintentionally shared to social media sources.

The examples I’ve provided here are in the domains of copyrighted work, but modeling against risk and exposure is even more critical in industries like medicine, health and safety, and manufacturing, just to name a few. While modeling truth is the ultimate goal of creating and curating semantic models, all concepts and their relationships which may be used to power a user experience, such as seeking prescription medicine or medical advice, safety procedures and equipment, and best practices in manufacturing, need to be carefully evaluated before being put into production in an application. While financial and reputational risks and penalties can be damaging, they pale in comparison to inappropriately modeled information people rely on to make potentially life and death decisions.

True to their name, semantic models are meant to be meaningful and act as a source of truth. Actively maintaining taxonomies and ontologies to be in line with truth as it changes includes a vital set of governance processes which must be carefully planned and executed to ensure the company is not exposed to risk. Unfortunately, there may be domains which can’t be modeled in taxonomies because the supporting processes don’t exist or can’t be maintained and executed upon efficiently enough to demonstrate low risk. If these domain areas are important to the organization, a strong case must be made to establish and maintain the processes and resources necessary to exploit beneficial modeling before the company is put at risk in the first place. An organization might not view semantic models as a potential source of risk, so it is up to semantic architects, taxonomists and ontologists, to be conscious of modeling repercussions and actively pursue governance processes to ensure their work is an accurate and up to date source of truth.

Polyhierarchy and the Dissolution of Meaning

01/02/2024 13:33 / 1 Comment on Polyhierarchy and the Dissolution of Meaning

https://pixabay.com/illustrations/red-pattern-abstract-background-2703887/

“Everything is everything/What is meant to be, will be.” – Lauryn Hill

Polyhierarchy

Polyhierarchy is “a controlled vocabulary structure in which some terms belong to more than one hierarchy. For example, rose might be a narrower term under both flowers and perennials in a horticulture vocabulary” (ANSI/NISO Z39.19-2005 (R2010), Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies).

While the ANSI/NISO Z39.19-2005 (R2010) standard is still my go-to for foundational taxonomy principles and may provide validation for using concepts in more than one location, I try to avoid polyhierarchy as much as possible. I see it as a construct necessary only in rare situations and because many systems are unable to consume taxonomical concepts in any other way than their actual location in a hierarchy. Specifically, I don’t like polyhierarchy which is 1) abused out of necessity to suit use cases consuming systems can not otherwise meet, or 2) used to solve many, differing use cases. To me, polyhierarchy is the enemy of specificity; it is the forward slash of the taxonomy world…the imprecision and indecision of the either/or.

There is a conflict between the construction of one or more taxonomies for semantic accuracy and how those taxonomies are displayed because of the inability to transform and restructure taxonomies to meet different, real-world use cases. If the use case demands a concept be more than one thing in more than one place, it must be put in all of those locations in the originating taxonomies to suit navigational needs.
My former colleague and contemporary taxonomy practitioner, Bob Kasenchak, wrote in his blog post “On Polyhierarchy”, “The most common misuse of polyhierarchy is overuse: the tendency to give terms multiple parents without sufficient reason.” I agree. This statement gets to my main objection with polyhierarchy in that when it is overused, semantic precision is diluted. When everything is everything, nothing is anything.

Polyhierarchy in Navigational and Information Access Taxonomies

People have different ways of searching for information and, in an online world in which a user can start in any number of locations and expect to get to the information they want, polyhierarchical taxonomies facilitate navigating to information through multiple pathways.

A common and familiar use case for polyhierarchy is in navigational taxonomies used in online retail. Consumers may require multiple entry points in product hierarchies to find what they are looking for. Using a search engine to get to a product display page in the first place is a common scenario in findability, while searching directly on the retailer’s website is often a consumer’s next choice. However, once on a website, users may use navigational structures and filters to get to specific products. Even if the navigational browse taxonomy is displayed as a flat list rather than a hierarchy, having multiple points of entry is going to lead consumers to the product they are seeking.

For example, one might expect to find Basketball shoes under Men, Women, Unisex, AND Kids. One may also expect to find Basketball shoes under Sports > Basketball. Given the current trends in athleisure apparel, one might also expect to locate Basketball shoes under Casual or Lifestyle. These divergences in meaning account for both a consumer’s individual browsing paths and competing notions of what Basketball shoes are worn to do. For a consumer, Basketball shoes may be just as easily in one category as another without any conflicting meanings.

Supporting this use case in one or more back end systems powering a front end experience may demand a concept be placed in more than one location in a taxonomy management system because the downstream system(s) can only consume concepts exactly as they appear in a hierarchy. In this scenario, you are forced to set up taxonomies that look like the following:

Kids’ shoes

Basketball shoes

Men’s shoes

Basketball shoes

Unisex shoes

Basketball shoes

Women’s shoes

Basketball shoes

Sports

Basketball

Basketball shoes

In the Basketball shoes example, the concept isn’t inherently a member of all the locations it is listed, but is listed in all locations as a way to facilitate user access to products through navigation. Even in this oversimplified taxonomy model, the repetition of the concept is becoming unwieldy.

Sometimes products really are two different things which can’t, or shouldn’t, be reconciled. The Z39 provides the example that a piano is both a percussion and stringed instrument. Therefore, on a website which sells many kinds of musical instruments, listing pianos under both seems sensible. Similarly, for a retailer selling toasters, ovens, and toaster ovens, we might expect to see Toaster ovens listed under concepts like Ovens and Countertop appliances.

The same principle applies when accessing informational content. For example, a country can be a part of a continent and a designated geographical region including more than one continent. For example, Denmark is both a part of Europe and EMEA (Europe, Middle East, and Africa). In a hierarchy, the construction may look like this:

Continents

Europe

Denmark

Geographical Regions

EMEA

Denmark

These use cases illustrate a need for polyhierarchy even in cases in which the back end systems may not support the need well.

Polyhierarchy in Semantic Taxonomies

Taxonomies which adhere to more stringent guidelines, which I will term semantic taxonomies, are those which follow taxonomy construction and maintenance standards in an attempt to arrive at more regular, logical structures to reduce or eliminate ambiguity. Building logical, semantic taxonomies have several long-term advantages.

First, adhering to simple principles of placing a concept in its single best location mitigates problems with system interoperability. In some cases, downstream systems consuming from a taxonomy management system can only recognize a single instance of a concept, most likely because it doesn’t have the ability to reconcile a label name with exactly the same string of characters. Another potential issue is consuming systems won’t allow for a concept with any label to have the same GUID to exist in more than one location. In well-structured semantic models, any polyhierarchical concept should only have one GUID or URI and not be a unique instance with exactly the same label but different identifier in each location. In this situation, the system receives the above example taxonomy hierarchy Kids’ shoes > Basketball shoes first on import and ignores each subsequent instance as it reconciles matching label strings.

Second, maintaining models requiring many polyhierarchical concepts becomes more difficult as more instances, and more semantically different domains, are covered by the taxonomies. Using the same form for a concept label with a single URI or GUID for multiple purposes can eventually cause a maintenance breakdown in which the concept loses semantic precision and scope and appears in locations with different logical underpinnings, especially using relationships with unique semantic meanings.

Finally, building semantic taxonomies supports the root purpose of taxonomic structures and ontologies: to define concepts so they are unambiguous. My taxonomy 101 go-to is the “is a…” principle. As a fundamental premise, I reject that a concept in most cases can not be placed in one, single best location expressing its intrinsic meaning. Is a toaster an appliance? Yes. Is an oven an appliance? Yes. Based on this, it’s easy enough to put toasters and ovens in their place.

Polyhierarchy also has acceptable use in semantic taxonomies. A concept can truly be a member of two categories which are overlapping or mutually exclusive. Our Denmark example above is a case in which a concept is a member of two categories. A homograph, like Mercury, is an example of a concept which has several, mutually exclusive, meanings.

However, in both cases, there are modeling choices to avoid polyhierarchy but are dependent on having the right functionality available. If the taxonomy tool supports associative relationships and consuming systems can use both hierarchical and associative relationships, the modeling may include a semantically named relationship in place of a standard hierarchical relationship. The associative relationship is part of geographical region can be used to create a specific semantic relationship to the concept EMEA allowing Denmark to be a child of Europe but not of EMEA.

Continents

Europe

Denmark is part of geographical region EMEA

Geographical Regions

EMEA

In the Mercury example, the Z39 suggests the use of parenthetical qualifiers so the concept appears in mutually exclusive domains which may very well all appear in one thesaurus:

Planets

Mercury (planet)

Metals

Mercury (metal)

Space vehicles

Mercury (space vehicle)

One of the challenges, especially in retail taxonomy concepts, is that concepts are rarely a single term. Returning to our Toasters and Ovens example, the concept Toaster oven was intrinsically two concepts, not one, because we have introduced a pattern or stacking nouns (toaster + oven) to create a new, compound concept. Even more frequently, adjectives are modifying nouns to include more than one independent, atomic concept. For the concept Men’s basketball shoes, the pattern is gender + sport + product. Sticking with our notion of a semantic taxonomy, the three separate concepts can easily belong to three, mutually exclusive schemes covering Gender, Sports, and Products. When the new concept is created, it’s easy to see how concepts find polyhierarchical locations in different schemes to support navigation.

What a thing is versus what is used for can also be problematic and demands a shift in thinking. Or, rather, defining exactly the modeling approach used across a set of taxonomies to maintain consistent semantic principles. Again, I stick with what a thing is. My favorite example is James Bond’s exploding pen from GoldenEye. Is the pen a writing utensil? Yes. Is the pen a weapon? Well…in this case it is. In the narrow perspective of spycraft, perhaps a pen is a weapon, but it is not inherently a weapon. In the Bond universe, a pen could very well appear in a taxonomy of weapons, but, as above, there are concept form and modeling choices which would alleviate the confusion. Rather than Pen, would it not then be entered as Exploding pen? Similarly, Bond has used a Rocket pen and a Poison pen. Once we modify these concepts, they then can find themselves in one best place in a taxonomy of weapons.

Why consider alternate modeling practices to avoid polyhierarchy if the standards and tool functionality allow it? In addition to the two reasons noted in this section, there is planning for unknown domain expansions in attempts to future-proof taxonomies for additional, currently unknown use cases.

Polyhierarchy across a Graph

A fundamental problem in modeling taxonomies is trying to serve two masters by including both semantic structures following logical rules and the useful, though typically less semantically precise, structures required for navigation. By trying to model for both purposes, there are inevitable conflicts which cause compromises in structure and meaning.

Different types of polyhierarchical instances living in the same domain attempting to address conflicting use cases cause the hierarchical taxonomies and the ontologies which provide logical modeling practices for the overall graph to experience semantic drift. While the human mind can understand seeing Dog food as a narrower term for both Pet food and Dogs, a system can only accept the strings it is given.

Using inconsistent modeling practices, like using different types of hierarchical or associative relationships for the same concept, causes concepts to drift from tightly bound semantic meaning, structural context, and scope. As the meaning expands to address more use cases, the precision wanes. As I said earlier, when everything is everything, nothing is anything. In other words, concept meanings become less precise and eventually concepts shift to mean what they are, what they are used for, where they are located in a navigational taxonomy virtual folder structure, who owns the concept, and on and on. The meaning erodes.

So what? We can see the concept in context and figure out what the meaning is, right? So why bother being so tightly bound to the concept meaning. A good use case example is using taxonomies to build machine learning models. The imprecision of having Basketball shoes under multiple parents to provide specific paths for gender navigation while also having the concept nested under sports requires that the model must be trained to understand that a basketball shoe is not a sport but is used for the sport of basketball. The more connections a concept has to other concepts through hierarchical and associative relationships, the more imprecise it becomes across the graph. While hierarchical structures are useful, graphs are even more so, providing the logical underpinnings for machine learning models, knowledge graphs, recommendation systems, semantic search, etc. Precise meaning becomes more important with each use case.

Polyhierarchy isn’t necessarily to be forbidden in semantic structures, but I propose using it sparingly, when a concept has truly more than one meaning, and for semantic structures which can then be transformed to provide concepts in any hierarchical structure for consuming systems and navigational use.