“Before I built a wall I’d ask to know/ What I was walling in or walling out,/ And to whom I was like to give offense.” – Robert Frost, Mending Wall
What data should be included in taxonomy models can be difficult to define. One can see this in practice at organizations in which similar data exists in multiple systems without a common source. Scoping what is included in taxonomies can be written out as broad policies, but these often don’t capture the nuances of data use and overlap. Two helpful ways to look at what data should be included in taxonomies is to look at the purpose and scope of software platforms and clearly defining data use cases.
Scoping by System Purpose
We can begin to define what data belongs where by demarcating borders between systems. We can start by looking at what systems already exist in the organization and what they were brought in to accomplish. While systems may have overlapping functionality, vendors, who are experts in their product offerings, know best what their systems are intended to do and market their products accordingly. While some systems have broad capabilities in an attempt to act as Swiss Army knives for data, dedicated platforms are frequently better for the job they were designed to do.
Starting from the taxonomy platform, taxonomy and ontology management systems (TMS) are meant to house taxonomy instances and their ontological structures. It can be confusing for some because they believe the TMS is where content tagging happens, but they are not intended to house external data tables or content. That fact is an important stake in the ground towards defining system boundaries and their intent: when it comes to the TMS, the data and content always lives elsewhere. External data can be tagged with taxonomy values as metadata or joined as relational to graph data, but the content is meant to live in place in its home platform. Other dedicated systems in the organization are meant to handle and manage specific domains of data: human resources systems handle personnel data, product information systems handle product information, and enterprise data governance systems centralize data for application of access and sensitivity classifications.
Personally identifiable information (PII) like employee names, social security numbers, and salary require the kind of security policies offered by dedicated human resource management systems. Other information of this type includes enterprise financial information, legal and contractual information and documents, and any other sensitive content which are housed in systems with functionality particular to the data types. This kind of data is not meant to be centralized with the intent to distribute widely for consumption by other systems, so these platforms don’t offer those capabilities. Not only is PII better handled by dedicated systems, the risk involved in managing this kind of information in taxonomy systems is high with low reward. The high maintenance overhead of managing employee names and associated attributes in taxonomies is usually not worth the trouble.
Digital asset management (DAM) systems include a variety of metadata fields with different data entry types, including dropdown or typeahead controlled vocabulary fields, because string-based descriptors are essential for fully describing digital assets for findability and classification. Metadata application is such a necessary requirement that assuming another system would be integrated to supply controlled metadata would not offer a minimum viable product for managing assets. Quite often, there is significant overlap in the descriptive metadata used to tag assets and values that could be centralized and used across the organization for a variety of use cases. Nonetheless, these systems are also not meant to be enterprise taxonomy management systems and their taxonomy capabilities are often remedial.
Another domain with potentially overlapping data values is product information housed in product information management (PIM) systems. Brand and product names, colors, materials, dimensions, geographical availability, and a host of different product attributes are great candidates for modeling in graphs of taxonomy instances. Other attributes, like SKUs, price, and lengthy textual product descriptions, however, are difficult to manage in taxonomies and don’t offer great advantages by doing so.
When systems are designed to manage content, digital assets, and product information, the capabilities often get murkier because vendors must offer overlapping functionality to create a viable standalone system. The content management system (CMS) SharePoint, for example, can manage taxonomies in the Term Store for tagging content with metadata within the SharePoint ecosystem (and shared across other Microsoft platforms). The Term Store was never meant to be an enterprise taxonomy management system, and, in fact, is nearly impossible to use as a centralized metadata hub outside the closed Microsoft ecosystem.
Unsurprisingly, overlapping functionality occurs in systems which are built for a specific purpose. Software platforms are built to solve one or more business needs by different vendors whose primary aims are to sell product, not to design systems with selfless interoperability with other platforms. It doesn’t serve them to not include functionality provided by another platform with the intent that consumers will build a complete ecosystem of different, singularly-focused platform modules. Overlapping functionality may, in part, contribute to the confusion of what data lives where and for what purpose.
Scoping by Use Case
In addition to considering platform intent, we can also consider data use cases. Defining how data is to be used may be the most important consideration in determining whether it should be included in taxonomies. The general use case for using a dedicated taxonomy and ontology management system is to centralize metadata values and their attributes. The intent of the system is clear, but the scope of the data included is not. The ability to define use cases for the delivery of taxonomy values as metadata by virtue of having a dedicated taxonomy system at all allows us to change—or perhaps amend—the case for what data should live in the system.
As I noted, some data is simply inappropriate or difficult to manage in taxonomies. Any data that is meant for use in one or more systems that can be reasonably and securely managed in taxonomies, on the other hand, is potentially fair game for centralized management. The larger the platform footprint or reuse across systems is another good indicator that the data must be centralized for publication to multiple systems, one system to be carried with assets or data to other systems, or both.
Starting with data that is reused across multiple systems is one way to separate platform purpose from data use within the platform. Descriptive metadata for geographical locations, product names and attributes, organizational activities and processes, and topics for rich and textual content are just a few of the use cases which likely see data of the same type being used in many scenarios for many purposes. Concepts used for search typeahead suggestions, search results filtering, navigation, and to describe the assets displayed to users are frequently going to be the same. If they are the same, then providing them from common taxonomies not only fixes the concept label and synonyms, but allows for more advanced use cases like querying knowledge graphs, similarity search, and product recommendations. Tying separate but similar concepts together after the fact is a lost opportunity to measure concept and content performance and derive a host of analytic insights from user actions and internal planning decisions.
Another use case consideration is the speed at which the data changes. Rapidly changing data can include time-based or transactional data which is created for one-time use or reflects a moment in time. While it is possible to include rapidly moving data in the taxonomies themselves, data with high velocity can also have high volatility. Rapidly created and changing data can disrupt the purpose of controlled semantic structures by introducing inaccuracies or representing fleeting moments of truth. Using slower-moving taxonomic structures as a semantic layer over quickly changing data can help provide veracity from a consistent source. Taxonomies are still in play, but are not disrupted as a reliable, stable source of truth.
Though not absolute, concepts which don’t hold any meaning when presented alone should also be questioned before living in taxonomies. Numbers representing widths, lengths, distances, or product types, for example, have meaning in context but are unclear when presented alone. Navigational taxonomies or taxonomy-driven search filters may have values important to the end-user experience but don’t have good context when viewed within the context of the larger graph. If using models for machine learning, these types of values may add noise and should either not be built into the taxonomies in the first place or excluded from subgraphs made available for model use.
Any time concepts are added to taxonomies, consider the overall model, the many use cases it will serve, and whether the data makes sense in the total system and data framework.