Information Panopticon

Home » taxonomy

Category Archives: taxonomy

The AI Bot Wars

https://pixabay.com/illustrations/ai-genrerated-robot-typewriter-9004499/

“A robot may not injure a human being or, through inaction, allow a human being to come to harm.” – Isaac Asimov, I, Robot

Someday in the distant future, automated artificial intelligence bots will wage misinformation, disinformation, fake news, and propaganda (University of Montana) campaigns directly against each other as a form of information and psychological warfare aimed at civilian populations. These campaigns will serve to erode trust, sow confusion, and create chaos within an enemy’s society. Hot wars, waged by humans or by drones and robots, will only be necessary as mop-up operations to consolidate power and assert authority. These wars will let peoples’ own interpretations and imaginations weaponize messaging against their fellow citizens until they destroy themselves from the inside. A fictionalized account of this type of hybrid warfare mixing misinformation campaigns, cyberattacks on infrastructure, and conventional military was the plot of the recent movie Leave the World Behind.

Bot wars are not the fiction of the distant future, however. They are here today and they are improving just as rapidly as the quality of artificial intelligence. Long gone are the days of blurry photos of Nessie and shaky video of Bigfoot. Misinformation created by generative AI was a key component in the Iran-Israel conflict of 2024-2025 (EDMO) and has been central to Russia’s online propaganda campaigns (NATO).

Today’s generated images and videos are hyperrealistic and can only be determined to be fake by 1) knowing the context or content to be untrue, or 2) having access to metadata which has not been tampered with. How do we combat this onslaught of misinformation? What role do semantic professionals, including taxonomists and ontologists, have in the war for truth?

Evolution of Bot Wars

Today’s artificial intelligence wars are mostly fought by people generating content. Easy access to cheaper, faster, and better artificial intelligence tools allows any user to generate new images and text rapidly with little to no skill in video or content editing necessary. Already existing content creation and social media sharing platforms have expedited and expanded the range and audience for user-generated content, real or not. Most of these platforms can’t keep up with content review and provide no mechanism for viewing the content source, including the metadata which may reveal whether the content is real or generated using AI tools. The democratization of content generation tools has meant an explosion of content (hence the term “content creators” as, seemingly, a professional occupational title). These tools have been praised for their ability to allow users to document, in real time, true events unfolding around them. These same tools allow users to document, in real time, unreal events manufactured by them with the same ease as documenting reality. Science fiction will just be fiction, the only science involved being the technical tools used to create the fiction.

I believe the next step in the misinformation wars will be an advancement in bot-on-bot directed counter-misinformation campaigns. In fact, these wars may already be happening with the number of fabricated online personas generating content responding to other comments which, in turn, may also be the product of fake online personas. Whenever one bot posts generated content, another bot will respond, countering and confusing the messaging. There may be truth in some of the counter-messaging, posting real content in direct response to fictional content. But, really, why bother with the truth at all? One bot can simply respond with equally outrageous content rebutting or retaliating against the first. Since artificial intelligence can generate content so quickly, why not take it a step further and do what any good marketer would do, segmenting and personalizing content to audiences based on their previous social interactions, including posts, likes, and network relationships. Not only can misinformation be generated quickly, it can be tailored to segmented audiences to trigger the most resonating and visceral reactions: fear, rage, mistrust, joy. Eventually, without any direct human intervention at all, peoples’ confidence in truth erodes and the reinforcement of already held beliefs and biases are strengthened. We already talk about echo chambers; the next echo chambers will be bots talking to bots with segmented human audiences receiving the exact messaging they would like to hear. Even as I talk about “will”, these trends are emerging on social media platforms today.

Recursion

I think “recursion”, “a computer programming technique involving the use of a procedure, subroutine, function, or algorithm that calls itself one or more times until a specified condition is met at which time the rest of each repetition is processed from the last one called to the first” (Merriam-Webster) is a great way to describe the more general content feedback loop we currently, and will increasingly, find ourselves in.

The referencing of original sources into various new content forms is happening increasingly in media as authoritative, unbiased news sources and are replaced by opinionated, subjective, and polarized “news” platforms. Algorithms on popular social media platforms weight toward content which has more interactions, positive or negative, and this content drowns out everything else. The number of memes and video clips I see repeated—or, rather, regurgitated—in my social media feeds gives a false impression that only a narrow range of topics are being covered. The breadth is shaved off at the long tails and only the highest middle of the bell curve is spit out into our feeds. Of course, these feeds are shaped by the content with which we interact, creating an echo chamber of reinforced, narrowly focused subject areas. Even as the overall amount of content expands exponentially, our exposure is limited to what we already think…or, rather, believe. Because belief is replacing authoritative fact. Our friends and feeds reinforce these notions; that unpleasant or dissonant facts are a matter of belief rather than any measurable objective truth.

The recursive, or regurgitative, nature of our content sources is going to have long-term effects on the bot wars. As AI bots create more and more content, they will seek out public sources of information and, eventually, feed their own previously created content into the self-guided learning models. Endless loops of self-referencing, recursive, regurgitated, manufactured information will act as the source of truth for new information; an endless entanglement of un-cited, untraceable, unverifiable information. As the bots play out their battles, the information will become so convoluted and unprovable that the only thing left will be belief. Even without the bot wars, we are finding ourselves here today. Belief over science or fact, individual belief over public sentiment, personal fictions over established facts.

The Battle for Semantics

From the early days of my career in an academic thesaurus to the present, the overwhelming mission of establishing “Truth” when so many concepts are only contextually true has haunted me. Fundamental, existential questions of being are, of course, at the heart of semantic modeling. Ontology is “the philosophical study of being” (Wikipedia) after all. As we watch truth and untruth blend into a bizarre miasma of half-truth in real-time, I wonder if other people in the semantic field feel the way I do. I have seen the frustration from scientists as they are dismissed as fraudsters somehow tricking the public into believing humans landed on the moon, vaccines can prevent disease, and fluoride is good for your teeth. Are taxonomy and ontology practitioners feeling the same level of dispirited frustration when they face the daunting task of asserting truth in a postmodern, truthless world? Will the AI bots win?

In the spirit of never giving up in the face of seemingly insurmountable odds, I offer up the following calls to action for semantic professionals which will, at least partially, address the coming AI bot wars:

  1. Lobby for increased use of semantic practices and technologies (taxonomies, ontologies, graph databases) in your organization. The use cases for semantics are real and can be clearly defined. The real work comes in convincing the C-suite that a rather insignificant financial investment in graph databases and taxonomy and ontology management software can indeed provide a large ROI.
  2. Taxonomists and ontologists need to engage directly with subject matter experts to ensure that semantic models accurately reflect the domain(s) they cover. Ongoing data ownership, quality assurance, and SME relationships should be an integrated part of the semantic model governance process.
  3. Similarly, semantic experts need to seek out and be involved with AI and machine learning activities in the organization. As foundational source-of-truth data for machine learning training sets, ensuring semantic models are accurate and are appropriately used in AI projects will help these projects be more successful with less risk to the organization.
  4. Target the most sensitive use cases. Semantic truth is the most convincing in areas in which the organization experiences risk. Find legal use cases tied to public content or product statements. Understand what risks threaten the company and which practical use cases semantic models can address.
  5. Design transparency into semantic models, including read-only access to taxonomies and ontologies in a variety of visualizations, so end users can understand and utilize them better. A significant part of any taxonomist’s job is helping users understand what taxonomies and ontologies deliver. Allowing end users to explore for themselves is a part of this work.
  6. Fight for the same transparency in content UIs in which metadata can be viewed by end users to understand the origin of the content, including whether it was generated by AI.
  7. If politically inclined, lobby for AI regulation and policies at the national and international level. Establishing regulations guiding the use, and particularly the transparency, of AI for all users will help to ensure that there are consistent best practices in how we implement and interact with AI and its generated content. In 2024, the European Union passed the AI Act, and more national governments and international organizations should follow suit.
  8. As a new technology, end users need to understand how AI works at least at the fundamental level. There needs to be more programs aimed at providing media literacy for the general public so they can learn how to identify and distinguish truth from untruth especially when it comes to AI-generated content.
  9. In support of media literacy and metadata transparency, publicly available AI-generated media detection tools need to be more common and easily usable by a general audience. These tools should have the ability to flag and identify misinformation for others.

The fight for truth will be partisan, political, frustrating, and even violent. We live in a postmodern world, but the death of truth will benefit those who create the most convincing and appealing misinformation the fastest. Counteracting these misinformation campaigns may very well be the last bastion of retaining democracy.

Taxonomy & Agile

https://pixabay.com/photos/washington-everett-rugby-game-80382/

The purpose of a scrum is to restart play with a contest for possession after a minor infringement or stoppage.” – World Rugby, Law 19

In 2010, I worked as a taxonomist (and content and records management specialist) on a project in which I was embedded in a software development team dedicated to using the Agile methodology. At the time, I wasn’t familiar with Agile. I wasn’t a software developer and had never had the opportunity to work in an Agile shop. I learned a lot about Agile on that project and saw the advantages of developing taxonomies in tandem with software development using the Agile methodology. I presented on this topic at the 2011 Taxonomy Boot Camp Conference as “Agile for Managing Taxonomy Projects”.

Years later, Agile is still a widely practiced software development methodology and, at least in my work experience, taxonomists are still embedded in technology teams and work with both the business and developers in iterative sprints. Since Agile is really a set of practices which can be adopted in part or in whole, let me focus on a few aspects and how they work well with taxonomy development.

The Scrum

The origin of the term scrum comes from rugby, with the idea being that a single, cross-functional team is working together to accomplish a goal. If you watch ruby, you might see it, as this American does, as a chaotic scramble of pushing and shoving out of which a strangely shaped ball eventually is released, hopefully carried by a player. Both the idealized goal of the team and the actuality of the scrum sound a lot like a typical workplace, so I believe the idea fits. Scrum “prescribes for teams to break work into goals to be completed within time-boxed iterations, called sprints. Each sprint is no longer than one month and commonly lasts two weeks. The scrum team assesses progress in time-boxed, stand-up meetings of up to 15 minutes, called daily scrums. At the end of the sprint, the team holds two further meetings: one sprint review to demonstrate the work for stakeholders and solicit feedback, and one internal sprint retrospective. A person in charge of a scrum team is typically called a scrum master” (Wikipedia).

There are several scrum concepts which fit neatly for both software and taxonomy development:

  • Determining the scope of work,
  • Understanding the amount of time in which that work must be completed, and
  • Working iteratively to accomplish the work and deliver a finished product.

The working taxonomist will note that taxonomy development is not nearly so neatly defined as to follow prescribed, time-boxed sprints. Especially true is that it is difficult to deliver independent, discrete taxonomy hierarchies (or full taxonomy schemes) due to the interconnected nature of multiple taxonomies bound by a single ontology. However, while restricting a portion of taxonomy development to a single sprint may be impractical, the overall guidelines of the functionality to be delivered in each sprint may be aligned with the supporting taxonomies and iterative development can still be accomplished and delivered in step with technical functionality.

In my experience, one of the most valuable features of the Scrum methodology is the daily scrums (or daily stand-ups). In these meetings, understanding what technical development is happening, and especially what changes may be occurring to deliver the right functionality within the sprint, is critical for taxonomy delivery. Separating taxonomy development from software development is one thing, but taxonomy delivery can’t be separated from functionality. If components of the taxonomy and/or ontology are essential for delivery, then it must be clear which components and how it is to be done within the software delivery. Will only taxonomy labels be delivered? What about alternative labels, definitions, or other properties? Will relationships be part of the ontology delivery and how will this be done? Does the functionality support hierarchy and, if so, is there a restriction around how many hierarchical levels? All of these are technical requirements, and, as often the case in Agile software development, the requirements and delivered work may change in the course of the sprint. Likewise, a taxonomist may determine that a new Boolean property needs to be added to the model to support the business use case. If this wasn’t scoped as a technical requirement, it needs to be considered in the functional development.

Epics, Features, Stories

The way that work is expressed in Agile is through epics, features, and stories. Epics are large pieces of work spanning multiple sprints. In an organization based on a quarterly planning cycle, an epic could feasibly span one or more quarters. For example, a taxonomy epic may be something like “Develop a Product Ontology”. Depending on the organization industry, number of products, and how many properties and relationships are involved, an epic such as this might be completed in a few quarters.

Features are the next sized work item, and these typically are significant pieces of work made up of many tasks. Following our epic example, a taxonomy feature may be “Engage with product marketers to define products and attributes”. In this example, there are clearly a lot of steps and individual tasks which need to be accomplished to realize this goal. Again, a feature may span several sprints, but it should not take longer than a quarter.

The most granular pieces of work are user stories (or just stories). Stories define the daily work and should be able to be accomplished during a sprint. Taxonomists don’t always follow sprint timelines, but they can be a good guide for defining whether a story is the right size. Again, in our example, a story could be, “Add preferred labels for 50 top-selling products”. The work is discrete, measurable, and includes some idea of how long this activity may take.

Some of the benefits of using epics, features, and stories in taxonomy development, just as in the parallel software development, are

  • Scoping the size of the work appropriately,
  • Defining all of the tasks necessary to complete the work,
  • Stating measurable, realizable pieces of work, and
  • Defining who the work is for and what it will accomplish.

A template for a user story might be something like “As a [role who wants to accomplish something], I want to [what they want to accomplish], so that [why they want to accomplish that thing]” (Agile Alliance). Following on with our example above, a user story may have a simple title like “Add preferred labels for 50 top-selling products” but then include a definition or business goal field with the statement, “As a product marketer, I want to make sure the top selling products are represented in taxonomy, so that product search and landing pages are optimized using taxonomy values for user searches.” Using a template like this makes it clear what the work is, who it is for, and what is intended to do. Although a good taxonomy should include definitions, scope notes, and editorial notes, the larger context around the intent of the taxonomy development can be captured where work is documented and tracked against software development work and organizational planning and goals. Using epics, features, and stories in taxonomy work is also a part of taxonomy governance, documenting why decisions were made and for what purpose.

Kanban

A Kanban board is a feature typically found in software supporting the Agile software development methodology. “Kanban boards, designed for the context in which they are used, vary considerably and may show work item types (“features” and “user stories”), columns delineating workflow activities, explicit policies, and swimlanes (rows crossing several columns, used for grouping user stories by feature). The aim is to make the general workflow and the progress of individual items clear to participants and stakeholders” (Wikipedia). Just as in software development, taxonomists can define epics, features, and stories at the appropriate level of work. Since Kanban boards tend to be less restrictive because they don’t necessarily need to adhere to sprint schedules, they may be more appropriate for defining taxonomy workflow items and how they fit within larger organizational bodies of work. “Work items are visualized to give participants a view of progress and process, from start to finish—usually via a kanban board. Work is pulled as capacity permits, rather than work being pushed into the process when requested. In knowledge work and in software development, the aim is to provide a visual process management system which aids decision-making about what, when, and how much to produce” (Wikipedia).

Especially in complex organizational environments involving many teams (software engineering, the business, the taxonomy team) and systems, tracking taxonomy work in a Kanban board within Agile development software can link non-technical taxonomy and ontology work directly to the technical work supporting its sandbox and production development and delivery for real-world taxonomy-based applications.

Summary

I’ve found that taxonomy development in the context of Agile works best when developing new software functionality. However, loosening some of the tenants of Agile to accommodate taxonomy development is possible. While the nature of software development and taxonomy and ontology development can be very different, taxonomists working within a technical team using the Agile methodology can reap significant benefits.

One, taxonomy work can too easily be abstracted and esoteric, losing its direct connection to business applications, business goals, and importance within complex, metadata-driven environments. Rooting taxonomy and ontology development work firmly in structured, documented Agile processes links taxonomy and ontology work directly to the technical functionality supporting its use in a host of applications. The Agile methodology helps make taxonomy work “real” and understandable.

Two, in organizations in which the Agile methodology is inseparable from quarterly planning and tangible software functionality delivery, taxonomy and ontology work is made visible, and, most importantly, measurable. How often, especially when workforce reduction (layoffs) is imminent, has a taxonomist been asked to prove the return on investment (ROI) for taxonomy work? Using Agile and documenting the work, its intent, and how the work was measured can provide a firm basis for developing metrics and use case examples of taxonomy and ontology work in action.

Finally, taxonomy and ontology development, release, and maintenance can be “squishy” work. Aligning taxonomy work with Agile provides a framework for how to develop and document work, how to measure work in progress and its completion, and how to map that work against technical development. Grounding taxonomy work in this way helps taxonomists stay focused on the immediate deliverables (please add this concept so I can make it live on the website) while remembering the greater strategic goals (develop this ontology so we can develop practical applications with real ROI). Taxonomy alignment with the greater goals and strategy of the organization helps to make the case for taxonomy’s importance in the organization as well as exactly what applications taxonomy supports.

In sum, 15 years after my first foray into Agile, its application and alignment with taxonomy and ontology work is still alive and well.

Truth & Consequences

https://pixabay.com/illustrations/ai-generated-painting-abstract-art-8963439/

“We live in a world where there is more and more information, and less and less meaning.” – Jean Baudrillard, Simulacra and Simulation

The art and science of taxonomy, developed by Carl Linnaeus, is a product of the Age of Enlightenment. From its outset, taxonomy has sought to neatly classify the world into named categories typically represented in hierarchical relation to each other. There is an essential human need to establish order in a chaotic universe, and the rooting of the world into scientific categories and nomenclature acts as a filing system superimposed on reality.

These foundations of taxonomic thinking proves to be both its promise and its challenge in a world which is now arguably Postmodern and dialectically opposed to the Enlightenment views of linear progress. Modern taxonomists and ontologists are still using Enlightenment tactics outside the scientific realm to provide meaning and order to information even as meaning and order collapse in practice. We have seen the flattening of truth as social media has provided platforms for democratized information, much of the content intentionally and unintentionally blurring truth and contextualized facts, beliefs, and baseless conspiracy theories. The fight to maintain (or establish) truth and order is at the heart of taxonomy, ontology, and semantic data work which has become increasingly in demand while also more challenging to establish in the shifting sands of truth.

Context

Ideally, taxonomies define a single, preferred concept in its one best hierarchical location. Such a concept is identified by its unique IRI (Internationalized Resource Identifier) allowing the concept to change preferred and alternative labels as necessary and add or subtract attribute properties and relationships. The foundational structure allows for both continuity and permanence and flexible change. A concept gains as much, if not more, meaning from its contextual structure as it does from its label. 

Technology functionality permitting, a concept can then be used in different contexts for different use cases without necessarily requiring the display of parenthetical qualifiers. For instance, in a navigational context, “mercury” might be displayed under “planets” and “metals” with the only difference being capitalization. On the back end, the seemingly single concept is unambiguously identified as separate concepts which could be represented as “Mercury (planet)” and “mercury (metal)”. Using semantic standards like SKOS and RDF provide frameworks for representation for both human and machine understanding. These frameworks are the underpinnings for the Semantic Web.

Despite decades of work trying to establish the Semantic Web as the norm, how the vast amount of information on the Internet is used in practice for purposes like building large language models (LLMs) does not necessarily have to retain these semantic practices. Hence, both humans and machines can potentially misunderstand labels in different contexts if those concepts are divorced from their structures. Removing context can remove meaning.

On a larger scale, removing data and ideas from their context has the same result. While it may be easy for reasonable, educated people to dismiss nonsense, cleverly constructed conspiracy theories can be built out of decontextualized information blending facts, believable or established fictions, and belief. Conspiracy theories are promulgated by bad actors attempting to spread misinformation. The rapid growth and exponential improvement of artificial intelligence has made this even easier because factual gaps can be filled with generated text, images, and video. While information scientists may work to apply metadata to such content, this metadata is not typically visible, or perhaps even believable, for the average user. Truth as intended becomes reused but revised, a cut up pastiche claiming to mirror the original but actually undermining it.

We could presumably trace the death of source of truth documents to the transition from printed to electronic documents, but there are too many examples of charlatan print works aimed to deceive, either maliciously or for entertainment. No, the shift away from truth isn’t a shift in medium, it is a shift in paradigm in which truth is derived from the context, or lack of context, in which it is presented.

Belief & Complexity

Postmodernism (or, rather, the loose and varied set of practices identified as falling under the Postmodern umbrella) has flattened our perspectives of hierarchical truth power structures and destroyed the notion of objectivity. Everything is subjective, everything is belief. No longer can we argue science versus religion, fact versus fiction, right versus wrong. In some ways, the allowance for multiple perspectives has democratized a globalized world; in other ways, it has made it nearly impossible to declare semantic truths in a world absent of absolutes.

It is no longer enough to provide evidentiary truth in opposition to supposition and unfounded belief. Conspiracy theories in particular are too interesting, too elaborate, too fascinating to crumble under the hard light of truth. At the heart of much of belief is, ironically, complexity. Belief often stems from the need to simplify an overcomplicated world operating on sometimes unknown (at least to the believer) principles beyond explanation. Good and evil, right and wrong, a New World Order, to paraphrase George W. Bush. Dialectic oppositions like us and them, right and wrong, Heaven and Hell, black and white, and so on, exist to simplify and understand a world full of grey, somewhere between dialectical opposites. Again, ironically, these easily adopted dialectics are also easily supported and reified by the adopting mind by a concoction of contextual complexity aimed at creating new truths.

Perhaps, then, the popular rise of misinformation is mirrored in the increasingly complex models used by business organizations to represent their domains. When I started in this field, most organizations outside of complex domains like biopharmaceuticals and the like were content with hierarchical taxonomies. Now, more and more of these corporations require complex taxonomies and ontologies, especially to support machine learning use cases. The complexity of semantic models mirror the complexity of the world, and, therefore, can easily mirror the complexity of truths and untruths.

Semantics

These paradigms seem to spell doom for those who aspire to create truth using semantic models and technologies. If we are in the midst of a Postmodern paradigm in which truth with a capital “T” does not exist, can semantics continue to exist as a practice? The phrase “paradigm shift” exists for a reason…or, perhaps more apropos, “this too shall pass”. As people become increasingly unmoored from meaning and the sands continue to shift under their feet, I believe that eventually they will seek some rope to pull themselves out of the quicksand. We, as taxonomists and ontologists, are here to weave rope.

As semantic professionals, we must go back to our sources, cite them, and be sure they are visible and referenceable by those who adopt the semantic models we create. We must continue to argue for the use of carefully curated semantic models as sources of truth for machine learning training data. We must continue to hire talented researchers adept at seeking and modeling truth based on semantic rather than causal relationships. We must aim to create the most truthful semantic models we can, domain by domain, regardless of whether they are reused or adopted by other companies or consumers.

If we give up on truths, regardless of whether or not they are capital “T” truths, we give in to the bad actors and malevolent forces who expect us to swallow whole the truths as they manufacture them.

Taxonomy Blues

https://pixabay.com/photos/background-blue-close-up-craft-3628553/

“I got to keep movin’, I got to keep movin’/Blues fallin’ down like hail, blues fallin’ down like hail/Hmm-mmm, blues fallin’ down like hail, blues fallin’ down like hail.” – Robert Johnson, Hellhound on My Trail

I was laid off recently from a taxonomy position I very much enjoyed. Rather than wait for the smart to wear off, the day after, I sat with a not insignificant IBU IPA and doubled down on the mixed emotions rattling through me to expound on some common issues I see in the work world of taxonomists. Many of these challenges I’ve dealt with in my blogs in one form or another in the past, but the day after a layoff hits a little differently when there’s a bruised ego and a true feeling of loss at play.

Like emails sent in the heat of the moment, posting a blog to social media hot on the heels of a personally emotional layoff is probably not the best idea. I’ve had nearly some time to let reality set in and revisit this writing. Surprisingly, I didn’t have to alter much of the content.

It’s Just Business

Layoffs happen. If you think your company loves you, you may have never been laid off. If it has never happened to you, then I am genuinely happy that you’ve not gone through the experience. Personally, I think it’s ok to believe in your company, drink the Kool-Aid (actually, it was grape Flavor Aid), and embrace their mission, goals, and strategy. I also think it’s natural and in the interest of self-preservation to recognize that any company will unceremoniously dispose of you when necessary. Their commitment to you will never match your commitment to their goals. Short of creating your own company and working for yourself building something you strongly believe in, this is always going to be the case. I enjoyed my company, believed in their mission, but also wasn’t one bit surprised when I got laid off.

From a detached, objective position, layoffs are as much a part of doing business as hiring when times are good. Layoffs can be triggered by downturns in which an organization’s revenue drops enough to merit reducing headcount. They can be triggered by well-meaning attempts at reducing redundancy and bloat. They can also be triggered by poor strategic decisions. Whatever spurs the layoffs, they are not always conducted in a strategic and thoughtful manner. Or, perhaps, there is a thoughtful strategy, but not one that will clearly bring about success.

No matter the impetus for a layoff, in my experience they disproportionately affect contractors and, because of what I can see immediately around me, taxonomists. There’s often an overlap between the two groups. When staff is augmented with consultant, freelance, or contract taxonomists, expect those people to be higher on the list when it comes to reducing headcount. The business likely doesn’t understand the role a taxonomist plays or minimizes the skill set as something anyone can do. As a seasoned taxonomist with years of consulting engagements behind me, I can tell you not everyone can just be a taxonomist. Like any proficient role, taxonomists bring unique organizational and research skills to bear. Shifting this work to the technology organization or a business domain in the enterprise is a misappropriation of work.

You’re a Taxo-what-now?

From years in the industry and still having not yet quite perfected my elevator pitch explaining my job, I can tell you taxonomy is not well understood. Cue the taxidermy jokes, financial tax questions, and, if you had the “ontologist” title, interrogations about whether you know anything about cancer. Speaking metaphorically, I do, and that is the metastasizing misunderstanding of what taxonomy, ontology, and semantic technologies bring to the table. Hence, taxonomists are not just laid off singly, but en masse, eradicating entire capabilities from organizations which likely had a long and painful path to establishing an enterprise taxonomy capability in the first place. With one swift slice of the oncologist’s scalpel, an entire function is excised with no idea of how to replace the missing connective tissue.

I’m sure many people think their job is extremely important, if for no other reason than it keeps one motivated to show up every morning. As part of justifying your work to yourself, and, more importantly, to the chain of command above you, there needs to be definitive and clear expressions of why the work is too critical to eliminate. Expressing the necessity of taxonomy work is essential precisely because it is misunderstood. Taxonomy work is often seen as simply gathering terms and putting them in lists or hierarchies, but the deep work of information science is frequently unseen. An inexperienced, self-nominated taxonomist is going to be at a loss when confronted by a dedicated, commercial off-the-shelf taxonomy and ontology system backed by a standards-compliant RDF triple store. That leaves the amateur with two options: do “taxonomy” in a simpler, alternative tool or establish a taxonomy capability in the organization starting with hiring a trained taxonomist to build the taxonomies and lead the effort to evaluate and purchase a taxonomy and ontology management system.

Having worked in many organizations which have gone from zero to taxonomy capability, it is no small task and can take anywhere from months to years. Taxonomists can start as contractors building taxonomies in spreadsheets and then either transition to a full-time role him or herself or lead the effort to hire a full-time taxonomist and bring in a tool. Regardless, the effort it takes to convince upper management of the need for a taxonomy program and the long journey to making it an essential part of the business involves a tremendous amount of time and resources. Establishing such a program and cutting it demonstrates a lack of understanding, an irresponsible waste of company resources, and a phenomenal strategic error, especially in the rising tide of machine learning and generative AI.

Foundations

It is widely understood and communicated that clear, accurate foundational data is essential for a business to create meaningful analytics, support strategic decisions, and train machine learning models properly. Despite the monumental efforts corporations put into building and maintaining clean data, it’s not all that common to see it done well. Typically, the issue is years of legacy data, all created with good intentions but frequently in conflict across disparate systems, inaccurate due to the passage of time, or made obsolete by shifts in strategic direction. To use a worn expression, there is no magic bullet to solve this problem. Migrating all that data to a data lake is time-consuming and still results in redundant, conflicting data. Creating taxonomies, ontologies, and tying data together with semantic layers and knowledge graphs also assists with creating and building foundational data, but these methods too can result in disparities.

There is no simple plug-and-play solution, but pursuing multiple strategies and bringing them together is not impossible. Data lakes serve a purpose, just as taxonomies and ontologies do. They are not either/or solutions, but AND solutions: structured, relational data working with structured semantic data and both describing semi- and unstructured content across the organization.You can have one strategy without the other…but why? There are realistic barriers to pursuing multiple data strategies, including budgets, resources, data ownership, and governance. Barriers do not preclude building strong capabilities to address these different aspects. Completely eliminating one or more of these pillars makes it more difficult for the business to execute a clear and effective data strategy, only to return to rebuild that capability at a later date at a not insignificant effort.

I’ll Be Back

There is a bitter vengeance tale in my head that goes something like this: you laid me off and now I’m back as a consultant making money from you to fix what you broke. Cyborg, guns blazing, blasting all your crappy taxonomies straight to hell. Well, not very likely, but I always win this one in my head (aside: does anyone ever lose the self-righteous conversations in their head?). Yeah, ok, so maybe I’m not back and may never be. Maybe the story runs more like someone either “discovers” that taxonomy is useful or stumbles on the skeletal, ancient remains of a discarded taxonomy management system half-buried in the earth, sorely out of date, and filled with the eggs of xenomorphs. Some face-hugger plants the seed of understanding in the astronaut and they decide the organization needs to do taxonomy stuff. Maybe someone listens and they hire dedicated roles to do taxonomy stuff. Taxonomy stuff takes off, becomes seemingly essential, and then taxonomy stuff gets stuffed in a round of layoffs. Maybe taxonomists are all just Cylons.

Anyway, bitter, wounded emotions aside, eliminating an enterprise taxonomy capability does your organization a disservice. Taxonomy is foundational to well-structured, semantic, governed data. Taxonomy data should be feeding your website navigation and search, applied as metadata to content, data, and digital assets, and providing a semantic layer to your products to power personalized experiences and recommendation engines. I’ll say it bluntly: if you don’t get taxonomy, you’ll never get machine learning. Or, rather, your machine learning models will never be optimized. If you think your generative AI proof of concept will run without taxonomy, it will…at first. Then, when scaling is the next step, expect it to fall on its face. Large language models without the context of your domain, your organization, what makes you you—that is, what is modeled in taxonomies and ontologies—will give you the bland, contextless results that can only be delivered by models that don’t get who you are. Taxonomists get who you are, but, well, they’re gone.

I said bitter, wounded emotions aside. Scrap that. I am very passionate and emotional about quality data. Nerd or no nerd, this is true. And your organizational truth is going to suffer without the expertise a semantic expert can deliver. You might say this is shameless self-promotion in search of the next gig, and you might be right. What is also right is that whether I’m the taxonomist hero who saves your disintegrating semantics or it’s another capable taxonomist, I’ll applaud the result. Because truth in data. Because waste and redundancy. Because efficiency. Because user experience. Because…

I’m not going to skewer the company that laid me off and dissect what I see as their poor strategic decisions. And, honestly, I don’t know what their strategy is or will be going forward. But, I will say this: I think it was a mistake. Not for me, not for the team I really admired and enjoyed working with, but for the greater strategy of the organization. Revenge tales aside, the company is going to feel the lack of governed, semantic data built by seasoned, professional taxonomists. There are people who remain in the organization who will carry the torch, but they’ve been hobbled by indiscriminate layoffs subjected to, unfortunately, a misguided data strategy. Maybe there are other options the company will pursue to fill the gap. Maybe taxonomy will come back someday when the stock prices are more favorable, but, in the meantime, no decent data strategy is complete without semantics.

I’ll close with a call to action to taxonomists. You already know how difficult it can be to build and maintain a taxonomy capability in your organization. Once established, make it essential. Make it foundational to data work wherever it happens. Integrate the taxonomy system into important, enterprise-wide data systems and strategies. No job is impervious to layoffs, but cementing the capability will, hopefully, help you avoid the taxonomy blues.

Taxonomies, the Eternal, and the Ephemeral

https://pixabay.com/illustrations/man-time-temporariness-sky-surreal-8768564/

Have you sped through fleeting customs, popularities?” – Walt Whitman, As I Sat Alone by Blue Ontario’s Shores

Taxonomists and ontologists are, quite reasonably, obsessed with the is-ness of things. We are, after all, classifiers, and what we classify must be able to conform to one or more categories. A significant factor in categorizing things is time. What is was is not necessarily what is is now.

Time-based categories impact taxonomy concepts in a number of ways, including defining is-ness and the maintenance and governance of is-ness as an ongoing practice.

Is-ness Terms, Business Terms

The subjectivity of is-ness is contextual. When navigating a website, categories are not always strictly semantic in their aboutness because our minds can fill in the blanks. If I navigate to Men’s > Shoes > Basketball, I know not to expect to find basketballs. I also know that basketballs aren’t shoes anymore than shoes are men. My mind fills in any missing words, which might be “Men’s Shoes” or “Shoes for Men”; “Basketball Shoes” or “Shoes for Playing Basketball”. In navigation, we don’t need to be so specific because, in this context, we are less concerned with is-ness than we are about navigational findability.

Even as we describe what things are, our terms may be subjective. What does it mean for a shoe to be a “Lifestyle” shoe? What lifestyle? Whose lifestyle? Similarly, what does it mean to be “Retro”? It depends on the product, the year, and the history of the item. These examples, most importantly, are commonly understood by people based on the immediate context; that is, they are time-based. Loosely speaking, “retro” tends to span 20-30 years…within the lifetime of the consumer. I wouldn’t expect to look for a “Retro” shoe and get a Roman solea (sandal) made of leather and woven papyrus leaves. Retro, for sure, but not what we commonly agree to being retro in the consumer product space.

What people know is slippery, but, somehow, we can commonly understand the difference between concepts with longevity and those which are ephemeral and trending.

The Ephemeral and The Trending

We live in a rapid age, arguably driven by a vast online sales force of young people created by capitalist organizations. Why spend countless millions and human resources on sales and marketing when pre-teens and teens can hype and distribute your product on online social media sites and sales platforms? Some of us have seen in our lifetimes the death of a salesman (that is, a door-to-door sales person) and the rise of the young, entrepreneurial sales people receiving free products, monetary compensation, and social compensation quantified by likes and follows. It’s ingenious, really.

The fast follow to fast following is the vaporous ephemerality of what’s popular and trending. But, hasn’t this always been the case? Weren’t people obsessed with trends and topics in fashion and the public sphere which were very quickly dropped and fell out of awareness in short order? Of course, but the nature of the online race to be ahead of what’s next–to be that trendsetter who identified and pushed the next big thing–is easier in an online world and has immediate social and financial consequences.

Here’s a fairly recent example from Google Trends. “Barbie pink” was a hot topic for several months around the release of the Barbie movie. The movie’s U.S. release date was July 9, 2023 (which, incidentally, is my birthday, and, depending on the source, the birth date of Nikola Tesla). Look at how neatly the searches rise to meet the weeks following the movie’s release and how that trend falls off by Christmas. Likewise, Oppenheimer was released July 17, 2023, and the trend pattern is nearly identical. And, of course, their juxtaposition as “Barbenheimer” follows a similar popularity graph.

If you produce products in pink–any products in pink–you are going to want to jump on that trend and ride the wave until it disappears. When one considers what this means, the ramp-up and execution is significant. Identify all of your pink products and create landing pages so people can find all of those pink products. Make sure these pages can be found with the search term “Barbie pink” without using the word “Barbie”, because you are likely not licensed to do so. Ensure you have enough product to deliver on the increased popularity while also ensuring you are not stuck with a warehouse full of unsold product when the trend tapers off. Logistics of this nature requires foresight, and, most importantly, the infrastructure to deliver on trends.

As for Oppenheimer, it seems there was no fire sale on atomic weapons.

Longevity

The concept of “literary warrant” is an important one in taxonomy creation and maintenance. Literary warrant is the justification for indexing or classifying based on the content of existing literature; literature, in the modern context, extended to electronic and physical writings of all kinds. When we use sources like Google Trends, we can say that this is user warrant: we see what concepts people are actually using and consider adding them to taxonomies.

Taxonomies are never finished. They constantly grow and are governed to maintain currency. Out of date concepts or phrases are deprecated or updated with newer terms. Terms of art may evolve, new areas of study may arise, or social trends may push terminology into or out of use. Taxonomists consider these factors when deciding whether a concept should be added to taxonomies. In general, the goal is to include terms that represent the domain but that also have some stability and aren’t changing rapidly.

Practically speaking, maintaining stability is important because taxonomies which are constantly in flux aren’t very useful. When terms change frequently, there is greater chance that the same or similar content will be tagged using different concepts. Additionally, frequently changing tags on content can be difficult to manage and result in sporadic and chaotic retrievability. From the end user perspective, not knowing which terms to search for or use can result in lower use of taxonomies for metadata application.

So, if trending, ephemeral concepts are useful and taxonomies with stable terms with longevity are also useful, how do we maximize the use of both?

The Ephemeral and The Eternal

For textual problems like trending concepts, machine learning (ML) models are a practical and effective means for identifying and routing terms. Using ML models against sources like user search terms on an organization’s properties, general user search terms across the Internet, user reviews, social media channels, and the like can generate terms which may be useful. Some of these terms are so fleeting, they could feasibly be tagged immediately to content for findability. As the trend wanes, the tag remains, but no longer is critical for retrieving the content. Other terms will have a longer lifespan and may be considered for inclusion in taxonomies.

The main questions to ask are

  • What is the term source?
  • How and why was it proposed by the ML model?
  • Did the ML model compare the term to existing concepts in the taxonomy?
  • Did the ML model use only exact match when comparing new concepts to those already in the taxonomy or did it also perform near match or other similarity vectors?
  • How is it reviewed for potential inclusion in taxonomies?
  • How does the ML model receive and process positive and negative review feedback to improve the model?
  • How does the taxonomist know whether to add the term to the taxonomy or not?

Developing a process in which ML models identify ephemeral and trending concepts quickly and can route and act upon these as metadata can speed an organization’s response to trends. Human-in-the-loop reviewers can include subject matter experts for product or content tagging. Importantly for taxonomists, including them as human-in-the loop reviewers for potential candidate taxonomy concepts can help expand taxonomies and maintain their currency.

Maintaining the semantic integrity of taxonomies while also responding quickly to trending topics can improve an organization’s overall reaction to the market while also maintaining clean, quality data. Popular and timely.

Who Owns the Taxonomy?

https://pixabay.com/illustrations/ai-generated-home-ownership-house-9104189/

“Ownership is not a vice, not something to be ashamed of, but rather a commitment, and an instrument by which the general good can be served.” – Václav Havel

In my experience, when a business begins building a taxonomy program, two related questions arise: where does the taxonomy program live in an organization and who owns it?

There are at least two paths that lead to these questions. The first, and the most common from what I’ve seen, is that a taxonomy has arisen organically in the organization based on a real business need requiring a solution. An example of this might be the development of a marketing taxonomy used for planning or for tagging assets such as product copy or images. In this case, a part of the organization has covered a narrow domain of knowledge and there is a recognition that it needs to expand and grow to serve the greater needs of the business.

The second, and less common, is that an organization recognizes the importance of an enterprise taxonomy where none has existed before and makes a calculated decision to start one. From the scattered remains of glossaries and metadata schemas, consultants or a hired taxonomist builds a new enterprise taxonomy from the ground up and sets the foundation for a taxonomy program. Because an organization must requisition for a consulting budget and new positions, a decision must be made as to where this position will sit in the organization and to whom this individual, or taxonomy team, will report.

Who Should Not Own Enterprise Taxonomy

Let me start by saying who I think should not own the taxonomy. Although I myself have worked in taxonomy in a technical group, I would advise against ownership by any group called Information Technology (IT) or some similar variant. In fact, I would actually be surprised if anyone in a technical organization disagreed with this assessment. Technology exists to serve the needs of the business and it is the business who should define those needs and requirements. Even when a technology organization leads the business in best practices for tooling, the business needs to define how and in what capacity the technology supports business processes and activities. While technologists such as information architects may be adept at building metadata models and schemas, including taxonomies, it is the business who must decide what values those metadata models include.

Now that I have stated the business should own the taxonomy, I’ll now go further and say that no one business domain within the organization should be the owner. Marketing should not dictate enterprise taxonomy needs, but should own marketing taxonomy needs. The same goes for any other specific domain within an organization, as any functioning company will be made up of multiple domains all working together to achieve common goals.

Where Does Taxonomy Live?

Following on the idea that no one business domain should own the enterprise taxonomy, so too should the taxonomy not live in a technology solution supporting one part of the business. While digital asset management (DAM) systems absolutely require metadata, the use case of applying taxonomy to describe assets is too narrow to act as a centralized repository for other business needs. Similarly, content management systems (CMS) are not the best place to store data that could also be described by taxonomy metadata. Using the business glossaries in data catalogs is valuable for describing the data living in or passing through that system, but is not the right tool to house business terms which should be applied in other repositories or, again, in a separate CMS. While any of these systems can house a taxonomy, none of them is purpose-built to provide enterprise taxonomy services.

As a former taxonomy and ontology management (TMS) software product manager, there is truth in the positioning of these tools as centralized, agnostic, metadata repositories for many (but maybe not all) enterprise use cases. Centralizing taxonomies in a tool allows for building enterprise taxonomies that can serve multiple use cases and multiple systems. Because the tool stands alone, it is less subject to changing business directions and domain imperatives. On the flip side, making the case for purchasing a standalone system that “only” houses taxonomies and ontologies can be challenging. I have written about this in my former position in a blog called Running a Successful Taxonomy Campaign.

So, Who Owns the Taxonomy?

An independent, centralized, enterprise taxonomy team should ultimately own the enterprise taxonomy and the TMS it lives in. The taxonomy team owns the taxonomy and ontology models they build, but what they build is always in the service of use cases defined by the business. Having a centralized team allows them to be in a position in which they can serve any and all business domains and work with technology groups to fulfill use cases in enterprise and domain-specific technologies. I’ve seen taxonomy teams reporting up to enterprise knowledge management or learning organizations which serve the same enterprise-wide function.

Some of the business use cases are truly enterprise while others may be for specific domains which in turn serve the enterprise. For example, values from the taxonomy used in navigation and search typeahead on the company’s website is where the taxonomy ROI is realized. Tagging product images and copy in a DAM serving the front end are also enterprise. The metadata from the taxonomies is used on assets which are likely going to live in multiple downstream systems and channels in which products are presented and sold.

Other use cases may be specific to a domain requiring metadata values which may or may not be shared with other domains and systems. However, centralizing these values also supports interoperability and business continuity should the domain decide to switch technology platforms. Rather than migrate metadata from the old to new system, the metadata can still be pulled from a centralized taxonomy management system using common GUIDs used across the enterprise.

The real key here is finding the owners within the business who will be accountable for the concept values, properties, and relationships needing to be maintained in the taxonomy. Taxonomists are usually generalists who can build and maintain taxonomies across a variety of domains. They are taxonomy subject matter experts, not domain subject experts (though they may become so over time). In this way, the business SMEs who know the subject matter can be accountable for adding new concepts and identifying concepts which need to be deprecated over time. The business owners are essential to the ongoing governance and sustainability of the taxonomy and, of course, are the people who know the business needs the best.

The working relationship with technology groups is the same. There are product managers owning technology platforms serving the business. Each of these tools can be integrated via APIs to a centralized TMS to consume all or part of the taxonomy and ontology graph for the appropriate use case.

A standalone, independent, enterprise taxonomy program will allow for service to any and all business domains without bias…except for those shared business goals at the enterprise level. The ability for all business domains to have ownership and stake in the shared enterprise taxonomies also allows for cross-team collaboration and innovation with shared metadata use.

Taxonomies and Turnover, or Johnny Pneumonic

https://pixabay.com/vectors/bookshelves-books-reading-learning-1751334/

“In any given culture and at any given moment, there is always only one ‘episteme’ that defines the conditions of possibility of all knowledge, whether expressed in theory or silently invested in a practice.” – Michel Foucault, The Order of Things: An Archaeology of the Human Sciences

When the movie Johnny Mnemonic came out in 1995, I often heard people mispronounce the word “mnemonic” as “pneumonic”. I speculated at the time, rightly or wrongly, that more people knew the term “pneumonia” than “mnemonic” or maybe one was simply easier to say. Strangely enough, I’ve now associated the concepts of memory and sickness because of the similarity, and confusion, of those two concepts. Maybe a pneumomic device helps you remember something…or maybe it helps your lungs function properly. Are we concerned about another pneumonic plague or a pending mnemonic plague? Who can say?

The Mnemonic Plague

If we live in a knowledge zeitgeist, it may someday later be defined as the death of knowledge. Or, more accurately, the death of the belief in knowledge and expertise in favor of opinion and belief. The results of successful misinformation campaigns include a skepticism of expertise and knowledge, feelings that opinions are equal to or supersede knowable facts, or even the inability to know anything at all. Postmodern philosophy has concerned itself with the idea that there is no objective truth, either foreshadowing the imminent knowledge paradigm or driving it. Baudrillard has predicted, quite accurately, that simulations and simulacra will become so prevalent that all meaning will be meaningless.

Add to this another sociological trend, the current workplace generational turnover from a larger, knowledgeable, older generation to a younger generation buried in and swept up by these currents of skepticism. With reality changing so quickly, and so many people being distrustful of what reality is anyway, are we on the cusp of a mnemonic plague? A plague in which no one remembers anything? A plague in which we all doubt the ability to remember it correctly? A plague in which we are convinced by others that reality is not reality, facts are not facts, and that memory is susceptible to convincingly reality-sounding alternative histories?

Johnny Pneumonic

When we began the great shift from paper to electronic documentation, the speed at which we would be able to create, store, and access information grew exponentially. Imagine a library which could feasibly contain all knowledge from human history across all countries, languages, ethnicities, and religions. An electronic library that would make the Library of Alexandria, the Library of Congress, and the Bodleian Library look like corner news stands. One library to rule them all. That happened, sort of, with the expansion of the World Wide Web. Vast troves of information were digitized and made available online even as people continued to create informational web pages and create documents which were born digital, never seeing the pulpified remains of a tree. In parallel, all the other kinds of news, information, and entertainment were also being born digitally and, in the great democratization of the Internet, anyone anywhere could theoretically have access to the ability to view and create information assuming they had some kind of device and network or satellite infrastructure.

Imagine the possibilities! Imagine how knowledgeable we could all become! Imagine knowing anything, anywhere, at any time, at the moment you need to know it! And some of that is true. What is also true is that people create patently false information reflecting their personal beliefs and agendas. There are millions of mediums, channels, and formats available to anyone to tell us what magical healing herbs will make us live longer and with fewer wrinkles; entire video libraries dedicated to documenting the horrors of vaccines and Western medicine; travel logs with erroneous anecdotes and false facts with the only verification being the person who created the content; videos of lizard-people trafficking human children; videos of directors faking the moon landing. All of this available to anyone, anywhere who wants to be an audience.

Whether an individual or a state actor, whether the content creator believes their own content or not, whether we can decipher what is true and what is not, all of this information is there for us to consume and form our own opinions about. That is freedom. That is democracy. That is the democratization and platform of eight billion voices. We no longer have to be slaves to a single narrative; no, we can write our own narratives and gather our own followers, forming fragmented communities in a new-world Splinternet in which we can all, finally, show everyone else a picture of what we had for lunch.

The Codification of Memory

I could say things were better back in the good old days, whenever those were and for whomever they were good, but that’s also an opinion, not a fact. The genie is out of the bottle. The train has left the station. The die is cast. We have crossed the Rubicon. The horse has left the barn. We now live in a world of snowclones, memechés, and information of dubious provenance. Compounding the intersection of easily creatable and accessible information, skepticism of expertise, and generational turnover is the ever-improving use of AI tools to generate very impressive, lifelike, and believable images to support textual messages. Don’t believe we faked the moon landing? Here’s a photo! Didn’t that famous Hollywood actor appear in a porno? Here’s the video!

And now for something completely different. Or, rather, here is me now getting to the point for my target audience of information professionals: what do we do in the face of all this misinformation? Our jobs…the best we can.

I’ve been working on taxonomies for over 20 years. At various times–sometimes from one year to the next and other times one hour to the next–I have experienced that overwhelming sense of doom, frustration, or hopelessness in the face of shockingly ignorant misinformation and opinions. I have long since abandoned any hope for the Semantic Web’s promise that real, verified meaning could be captured in logical, formalized, human and machine-readable structures like taxonomies and ontologies. Despite the acknowledgement, both philosophical and practical, that there is no Truth with a capital “t” and that all facts are context-dependent, I still try to build semantic structures that codify the truth of my organization at the moment we are building it.

In the context of any moment, there are truths we can build and verify. We can create preferred concepts and connect them with human-readable, sensible relationships and build taxonomies into interconnected graphical knowledge bases applied to a variety of content vetted by subject matter experts. Verified and vetted content may no longer be true in a week or a year or a decade. The relationship between two concepts may quickly become outdated. A preferred concept may fall out of use or fashion. Despite all of this, codifying knowledge in taxonomies and ontologies is not an act of futility; it is an act of capturing truth and memory at a moment. We can document these changes over time and look back over the history to see what was true versus what is true now.

These semantic structures are a way we can document organizational knowledge and pass this knowledge on to the next employee, the next generation, the next iteration of a company. Creating good historical records in semantic models serves a knowledge management function in being one way we can enable knowledge handover from one person to the next, one team to the next, one project to the next.

The organizational book of knowledge as written in taxonomies is often edited and changed, but it is still a book to which we can refer with confidence that what we tried to build was accurate and true to the best of our abilities. Taxonomies are the new black. Ontologies are the mother of all semantic structures. In space, no one can hear you taxonomize…but your work is still valuable.

Modeling a Moving Target

https://pixabay.com/illustrations/ai-generated-pipes-industrial-9043429/

Like a wave in the physical world, in the infinite ocean of the medium which pervades all, so in the world of organisms, in life, an impulse started proceeds onward, at times, may be, with the speed of light, at times, again, so slowly that for ages and ages it seems to stay, passing through processes of a complexity inconceivable to men, but in all its forms, in all its stages, its energy ever and ever integrally present.” – Nikola Tesla

Some of the most thorough, authoritative, and well-constructed controlled vocabularies have been built and curated over the course of decades. The NASA Thesaurus was first published in 1967. The Library of Congress Subject Headings can be traced back to its roots in 1898. The MLA Thesaurus was modernized in 1981 based on previous classification methods. The Getty Art & Architecture Thesaurus was started in the late 1970s. There are many more examples of well-established controlled vocabularies, some older and some newer.Maybe there’s something to the adage “with age comes wisdom” in the longevity and authority of these vocabularies.

Many of these longstanding vocabularies serve the academic system and do not necessarily “move with the speed of business”. In a business environment, the emphasis is typically on speed, agility, and efficiency. None of these qualities are in opposition to the typical development of good controlled vocabularies, but a fast-paced organizational environment can certainly be difficult to support with vocabulary development that can’t keep up.

In my experience, product vendors talk about how to speed up taxonomy and ontology development while semantic practitioners (librarians, taxonomists, ontologists, etc.) regularly have to strike a balance between maintaining quality semantic models and serving the business needs. The speed at which semantic models need to be developed in a business setting often leads to the extension of inappropriate taxonomy use cases in order to support and facilitate immediate business priorities. We need to step back and ask ourselves how we can as taxonomists effectively support the business while addressing what should be modeled in taxonomies and what should not.

In this blog, I’m going to focus on two closely related cases of data which I think are not appropriately modeled or maintained in a semantic model but are often brought up as examples of serving the business quickly.

Processes & Sequences

A business problem I’ve seen in my roles as a taxonomy management platform product manager for a vendor and as a taxonomy practitioner is the modeling of processes as taxonomies.

There are several difficulties with modeling processes as taxonomies, and why I tend to discourage their inclusion as part of the overall enterprise taxonomy and ontology semantic framework. First, steps in a process aren’t usually semantically hierarchical. For example, say you are asked to support a step-by-step process for submitting a ticket in a request tool such as those offered by ServiceNow or Atlassian in a centralized taxonomy. An IT service ticket can be opened, assigned, reassigned, resolved, closed, and reopened. Building these steps as a hierarchical taxonomy doesn’t convey their true relationship to each other. You can build these steps as a flat list, which addresses that problem, but then you have a sorting issue of making sure the steps are displayed in the proper order in the consuming system. Since the actual ticket requests live in a consuming ticketing system while the controlled values naming each step in the process live in a taxonomy management system, this may not be a problem at all. The ticket is an object that is simply tagged with a new value at each stage in its lifecycle, regardless of the flat or hierarchical structure in the taxonomy. So, the first problem may be solvable, but it needs to be addressed.

The second issue is that processes are rarely progressive steps from beginning to end. The stages of a ticket may move up and down a flat list or hierarchy, often skipping steps in between and moving back to a previous step. Again, if the ticket is tagged from a flat list which is properly ordered for display, this may not be an issue. If the consuming system must follow the flat list or tree order, however, there may be challenges in changing the value to a previous step. Where this gets complicated, however, are processes in which the same steps are repeated by different groups in the organization. A contract, for example, can go through contract review by the vendor, by the business team representatives, by legal, and again by the vendor’s legal. Modeling a process like this taxonomically often means that these repeated processes are either appended with the team or entity conducting the process step (business review, legal review, etc.) or repeating the steps as children under organizational group headings, creating a polyhierarchical nightmare. 

The final and most difficult challenge with modeling a process is that they often change. In another example, modeling the customer journey from start to finish is much like stages in an IT ticketing process. The customer rarely moves neatly from each step to the next. More importantly, however, is that marketing processes frequently are overhauled with changing values anytime there are changes in business strategy. Even when the values don’t change, their ordering of the process does and these orders are often reflected as navigational structures represented as filters on the front end. Representing a customer journey based on filterable values is challenging because of the numerous ways a customer may enter the UX pipeline. In retail apparel, for example, they may start looking for products at Gender (Men’s, Women’s, Children), then apparel type, then size, then filtering by material or color. Or, the customer may start with color, then apparel type, then size. There is a process here, but one with multiple entry and end points. Trying to represent this as a taxonomy is incredibly difficult.

A similar problem I’ve come across in taxonomy modeling is using lists or hierarchies to indicate sequence. Most processes have steps that are in sequence, but in this case I’m talking about strictly fixed sequential order. Examples can include the order of books or films (by date or by narrative order), the list of U.S. Presidents in order, or events by date.

The fundamental issue for most taxonomy management systems, and for many systems relying on taxonomy data for that matter, is the default to alphabetical display in lists. For most taxonomy hierarchies, alphabetical is the preferred display order, with each cascading branch also alphabetized. On most front end websites, navigational taxonomies are ordered by use with the most prevalent ways to access information listed first. Front end website platforms are built for this type of information display because the hierarchies do not necessarily follow what I would call semantic or “is a” taxonomy practices; parent child relationships are based on filtered drilldowns, not by strict contextual meaning. Where this becomes an issue is when an organization, quite rightly, wants to consume centralized taxonomy concepts for the front end experience.

Even when using the larger graph underlying taxonomy and ontology structures, conveying order can be challenging without the right functionality to support it or the ability to leverage the model in consuming systems that can only handle flat lists or shallow hierarchies.

Modeling Options

Modeling sequences isn’t out of the question in taxonomy management systems assuming that 1) the taxonomy and ontology management system includes functionality supporting modeling options, or 2) downstream systems can effectively handle or transform concepts received from a centralized taxonomy.

To model steps in a process or to capture sequences, there are a few options. If you are using a tool supporting RDF and core SKOS elements, you can consider using skos:OrderedCollection. An ordered collection is exactly what it sounds like: a collection of concepts put in a specified order. Using an ordered collection for a list of concepts in a branch of taxonomy allows listing those items in the order desired. There may be no other indicators of why those concepts are in that order if stripped from their contextual parent, but it will force sort a group of concepts. This assumes, of course, that the consuming systems don’t simply revert to alphabetical order once received.

A more flexible and sustainable way to model a process is to model semantically using “is a” rules and then leveraging a true ontological structure and a graph to map the journey. This means modeling concepts in their one best location in one or more taxonomies as part of a larger domain model. Modeling this way leans into the strengths of an ontology by using associative relationships between entities to make a graphical representation of the order while also connecting the entities to their owners. So, for example, books or films may have relationships like is followed by / is preceded by and then can be connected to their authors, directors, and actors as part of a greater graph.

Another option is to include a property on each concept. The property could be a field which indicates a numerical value listing its placement in the list. While this metadata field could be useful in the taxonomy user experience, whether or not these property values could be used in consuming systems to order items is still problematic. Furthermore, it gets complicated if it’s necessary to order multiple sets, all of them including the same numbers as an ordering property.

In advanced graph-based taxonomy and ontology management systems, there may be an option to use reification or RDF* to support metadata on triples. In this way, the ordering is embedded on the edges themselves. For example, books and films could include relationships with a release date. This could look something like “James Bond film” has release date [20061117] “Casino Royale” in addition to a broader/narrower relationship between the two concepts. There are several modeling options to make use of associative relationships with added metadata on the relationship edge.

In sum, it’s not impossible to model processes and sequences in taxonomies, but it requires thoughtful modeling in the context of other existing business taxonomies likely sharing the same overarching business ontology. Moreover, thoughtful modeling may not move with the speed at which an organization wants to move, but slowing down and getting it right the first time can save a lot of painful rework later.

Taxonomy Calling

https://pixabay.com/vectors/alien-greeting-hello-long-life-1292972/

“Hello, is it me you’re looking for? / ‘Cause I wonder where you are / And I wonder what you do / Are you somewhere feeling lonely?” – Lionel Richie, Hello

One of the most challenging activities in taxonomy work is communicating the value of taxonomy to potential business stakeholders. With so many shiny, promising technologies and methodologies, it can be daunting for the taxonomy strategist to win over converts to taxonomy use. Taxonomies and their applications are often misunderstood or are narrowly focused on a few common use cases like navigation. While business users can clearly articulate their needs, they may not be able to connect those needs to how taxonomies can be applied in the business.

The taxonomy strategist must be able to communicate the value of taxonomy while expressing the complexity of semantic structures like ontologies and their supporting technologies simply and succinctly to a variety of business stakeholders. 

Communicating the Value through Examples

To gain taxonomy users, it’s essential to communicate the value of taxonomy. One way to start is to seek out areas taxonomy can directly address, find examples of the current state problems, provide taxonomy-based solutions, and then communicate the findings to the business owners. This process can be initiated by the taxonomy strategist or by the business owners themselves, assuming, of course, they know to contact the taxonomy team in an effort to answer their need.

One simple, powerful example is to review a search-dependent organizational website–which could be an internal intranet or external, public-facing website–and collect examples of navigational and search barriers causing confusion, poor search results, or revenue-losing scenarios. For each example, provide an explanation of how taxonomy might help. For navigational issues, the taxonomy solution may be category restructuring or improved, facet-based results filtering aligned to the typical user journey. For search retrieval issues, taxonomy may be used for typeahead search keyword matching or to improve search relevance to include more accurate or additional results through content tagging or keywording. Navigation and search are often close to time-saving or profit-driving activities, improving the efficiency and bottom line of the organization. Search examples and their potential taxonomy solutions, therefore, are closer to the source of organizational revenue and make convincing use cases.

As generative AI becomes more prevalent in organizations, finding examples of general or inaccurate results and how an enterprise, domain-specific taxonomy (and ontology) can act as foundational training data to improve those results can result in convincing proof of concept projects. Generative AI and machine learning models can seem like magic to the average user who may not know the amount of time and data it takes to train a model to produce accurate and useful results. Providing examples of poor machine learning model output can illuminate the need for clean, accurate foundational data. As an organizational source of truth, taxonomies can provide such semantic data.

To overcome user confusion about what taxonomy means or clarify what they think taxonomy means, try starting with the end result and work backwards. When assessing the value of a new bathroom faucet, someone will look at whether the fixtures look appealing and if hot and cold water comes out as expected. Initially, no one is interested in the pipes. Taxonomy, unflatteringly, is the pipeline infrastructure providing clean water to downstream consuming systems. First show excellent search results or machine learning outcomes and then explain how taxonomy is the basis for those results. If business stakeholders are interested in taxonomy, all the better for your work and evangelization. If they aren’t, let them be impressed by the final state and develop a process of working together to get to and maintain that final result.

Communicating the Value through Time and ROI

One potential stakeholder hesitation may be the time it takes to perform discovery, conduct the build, and put taxonomy values into production. This process can take time in the initial business stakeholder relationship. Once established, however, the speed at which business users can request concepts and see them live can move as quickly as your organizational systems can handle. People often believe they need to “move at the speed of business”, which, ironically, they think is fast but is more often cumbersome, manual, and slow. What they want is the magical now in which thought is converted to action faster than Captain Kirk can have his shirt ripped when first confronting an alien species.

Machine learning techniques, once perfected, can offer the kind of rapid response business owners are looking for, but only after a lot of training. Specifically, a lot of training on assets and data tagged with taxonomy. Too often, the “magic” of artificial intelligence business users are sold isn’t artificial at all: it is thousands of hours of tagging content and training models to get the desired results. If done properly, there’s nothing wrong with using machine learning models to quickly react to trending topics or generate text on the fly. However, the slower growth of a taxonomy, as I cover in my blog The Taxonomy Tortoise and the ML Hare, actually creates speed in other areas, saving time in responding to consumers’ direct search queries and tagging content to train and evolve machine learning models. Communicating the need for time investment up front to generate time-savings later can be compelling.

Communicating taxonomy ROI, which I covered a few years ago in my blog for Synaptica, Running a Successful Taxonomy Campaign, can be extremely difficult. How do you explain how words become money? Again, show the examples. Mining successful and failed search results and mapping these to taxonomy as metadata tagged to assets can show a direct line between creating taxonomy concepts, applying them to content, and successful search results that end in a product purchase. Going back to time, time is money: time employees spend manually creating, tagging, and manipulating content which drives sales; time spent training machine learning models; time spent seeking information which has not been tagged with metadata. Ramping up taxonomy processes to more quickly tag content and put words into production will result in quicker time to money and realized ROI. While starting taxonomies can be slow at first, the more success the taxonomy strategist has in engaging business users, the more quickly the taxonomy is built out and covers the breadth needed to tag assets and express important concepts users are seeking.

Communicating Complexity

Communicating the nuance and complexity of taxonomies and ontologies may be necessary as the details of a pending or ongoing project develop. Few business contacts need to know the difference between a flat list, taxonomy, thesauri, or ontology. In fact, I find there are disagreements about the differences even among practitioners. That said, users can come to the discussion believing that taxonomies are only hierarchical lists of terms. For most practical discussions, I use the term “taxonomy” to include flat and hierarchical lists of terms, properties, and hierarchical and associative relationships. I rarely bother with ontology concepts like classes unless they are necessary to meet the project objectives.

If these terms do need clarification, however, I often clarify with simplicity. Taxonomies are concepts (preferred labels) that include synonyms (alternative labels) and other metadata attributes (properties) and these concepts can be related hierarchically and through custom relationships (associative relationships). When discussing ontology, I usually state that taxonomies are the words you want to use and the ontology includes the rules for the words you want to use. For example, how concepts are grouped (classes), how they can be related to each other (domain and range constrained by classes), and whether certain properties can be made available for a use case (properties constrained by classes). That’s often all a user needs to know.

There are more advanced use cases, like machine learning, which is, in my experience, more of a mapping of ideas than an education. Data scientists usually use all the same concepts as taxonomies and ontologies but may use different terms to express them. After one or two conversations, the mappings are understood and the complexity is simplified. It’s not often a data scientist needs convincing to leverage taxonomies, but getting on the same page with conceptual ideas is a good way to make taxonomy value clear.

In large organizations, there is usually information architecture complexity as well. Because of this, taxonomy can often become necessarily complex as values are consumed by and flow through various systems. Understanding this workflow is not always a prerequisite for understanding the value of taxonomy, however, and does not need to weigh down conversations with potential business stakeholders. If it does become necessary, simplify those information architecture diagrams into simple flowcharts between systems, showing at a high level how taxonomy concepts move from system to system and what they do in each.

Being a taxonomy strategist is challenging, but is a necessary part of the job for taxonomy to show and prove its value in the organization.

The Taxonomy Tortoise and the ML Hare

https://pixabay.com/illustrations/aesops-fable-tortoise-and-the-hare-6570775/

“I knew I shoulda’ taken that left turn at Albuquerque.” – Bugs Bunny

For better or worse, much of my childhood was informed by Looney Tunes, Monty Python, and a diet of science fiction ranging from the profound to the disjointedly camp. As such, I expect the absurd and am wildly skeptical of easy answers. Additionally, my foundation of science fiction books and films compels me to speculate that artificial intelligence will become a more realistic probability in our lives with actions ranging from locking us out of airlocks and starting global thermonuclear war to providing answers to our most pressing global problems.

The long-promised advantages of artificial intelligence seem finally to be reaching a point at which they can be utilized for enterprise purposes, including parsing, and even understanding, large amounts of text and data at rapid speed. The recent successes beg the question that if machine learning models can operate on data at high volume and velocity, then why shouldn’t they be used to come up with answers on the fly based on large amounts of data internal or external to an organization? Well, in fact, they already are, and, in my opinion, they should, but not without some acknowledgment of absurdity and a certain degree of skepticism.

I’m a firm believer in defining semantic models in the form of taxonomies and ontologies to be used as a foundational schema for an organization’s data. One of the arguments against investment in taxonomies is the time it takes to create them and the amount of maintenance they require to sustain them. In a world in which what is trending changes frequently, user tastes are fickle, and the jargon associated with these trends passes quickly, the desire to avoid the tortoise-like pace of building taxonomies in lieu of utilizing other, faster technologies is tempting. But, as the hare who lost the race to the tortoise laments, “I knew I shoulda’ taken that left turn at Albuquerque.” Or, let’s consider checking the map before we go racing off in the wrong direction.

Let’s talk semantics. Putting it simply, ontologies are semantic structures which define one or more domains. They describe the types of things in the domain (classes), how these things can relate to each other (relationships, predicates, or edges), what labeled fields are used to describe these things (properties), and the instances of things (subjects, objects, or, more plainly, taxonomy concepts). Ontologies describing the general domain and taxonomies including the specific instances within one or more domains can be created as a map of your organization. These semantic structures represent the organization in all of its complexity. They specify the concepts important to the company and how these concepts relate to each other, data, and content. Once data or content is added, we can call this entire structure a knowledge graph.

In short, ontologies, taxonomies, and content are the organization’s view of itself, the world, and where it lives in it.

Large language models (LLMs) have the ability to generate text, answer natural language questions, and classify content. Most publicly available LLMs, like ChatGPT, are trained on publicly available information. It is also possible to supply these LLMs with your own training sets of documents and language samples to develop answers more applicable to your own organization. Wisely, many organizations tightly control what information can be presented to these AI tools to avoid company information leaks or supplying competitors with proprietary information.

What’s lacking in using these hare-rapid models, however, is the organizational perspective. They are very good at answering general questions and making factual assertions from text, but they require tailored training content with specific use cases in mind to generate answers specific to an organization’s needs. There can be a temptation to feed one of these models a large quantity of organizational content to train them faster. However, the span of topics, language, jargon, and acronyms used in an organization can yield unsatisfying or unpredictable results. Imagine, if you will, the amount and variety of content in any one of your company’s content management systems. Now imagine asking a machine learning model to analyze and make sense of it all without guidance. You can index all of your own content, but without a framework, what sense does it make?

At this moment, the hare and the tortoise must strike a deal if they both want to win. To improve the performance of LLMs and other machine learning models, a domain topology specific to your organization defining the concepts, their synonyms and acronyms, and how they relate to each other, can be used as a schema input into the model. Semantic models are, after all, assertions in the form of triple statements (subject-predicate-object). Ontologies establish factual statements as determined by your organization’s use cases and, hence, provide patterns which can be used by machine learning models. Lexical proximity can be gathered from taxonomy hierarchies (these concepts are more closely related because they share a parent-child relationship) and associative relationships (these concepts, separated across several taxonomies, are actually very closely related because they have a direct associative relationship between them). Semantic models provide factual statements, built slowly over time based on business use cases, which can augment and improve LLMs.

Not only can we think of semantic models as a collection of factual statements according to your organizational domain and use cases, we can also think of it as a summary, requiring the LLM ingest a lot less information to reach the same factual conclusion. For example, you can provide the model with a huge amount of training data stating that a particular SKU-level product is available in the color blue. If this is a factual assertion in your semantic models (Product name has color Blue), however, then this fact can be tagged to a single product representation in a database and in turn is applied to thousands of real-world SKU instances. Semantic models are a distilling and modeling of thousands of instances of truths across an organization and summarized into a collection of ontology structural elements and taxonomic instances. Citing a joke by Steven Wright, in which the comic tells us he has a map of the United States which is actual size, your organizational map can be represented in a much smaller scale.

Yes, it’s certainly true that given large amounts of data, machine learning models or text analytics can identify all kinds of important concepts. These concepts (and fact assertions between concepts) can be a great pipeline to feed into taxonomy and ontology construction. I am skeptical of machine learning models generating taxonomies and ontologies based on organizational data and content unless there is heavy human-in-the-loop curation to reconcile those absurdities which I believe inevitably creep in. And, yes, it’s certainly true that this curation is potentially at a tortoise pace, but once these concepts and assertions are built into semantic models, the ongoing maintenance and governance demands less time and effort.

Those slow semantic model builds enable fast-moving machine learning models and LLMs to be grounded in organizational truths, allowing for expansion, augmentation, and question-answering at a much faster pace but backed with foundational truths as asserted by your organization.

Be the tortoise first and foremost and the hare will follow.