Home » Posts tagged 'metadata'

Tag Archives: metadata

Defining Taxonomy Borders

04/18/2024 11:17 / 1 Comment on Defining Taxonomy Borders

https://pixabay.com/photos/trees-farm-fence-farmland-2900064/

“Before I built a wall I’d ask to know/ What I was walling in or walling out,/ And to whom I was like to give offense.” – Robert Frost, Mending Wall

What data should be included in taxonomy models can be difficult to define. One can see this in practice at organizations in which similar data exists in multiple systems without a common source. Scoping what is included in taxonomies can be written out as broad policies, but these often don’t capture the nuances of data use and overlap. Two helpful ways to look at what data should be included in taxonomies is to look at the purpose and scope of software platforms and clearly defining data use cases.

Scoping by System Purpose

We can begin to define what data belongs where by demarcating borders between systems. We can start by looking at what systems already exist in the organization and what they were brought in to accomplish. While systems may have overlapping functionality, vendors, who are experts in their product offerings, know best what their systems are intended to do and market their products accordingly. While some systems have broad capabilities in an attempt to act as Swiss Army knives for data, dedicated platforms are frequently better for the job they were designed to do.

Starting from the taxonomy platform, taxonomy and ontology management systems (TMS) are meant to house taxonomy instances and their ontological structures. It can be confusing for some because they believe the TMS is where content tagging happens, but they are not intended to house external data tables or content. That fact is an important stake in the ground towards defining system boundaries and their intent: when it comes to the TMS, the data and content always lives elsewhere. External data can be tagged with taxonomy values as metadata or joined as relational to graph data, but the content is meant to live in place in its home platform. Other dedicated systems in the organization are meant to handle and manage specific domains of data: human resources systems handle personnel data, product information systems handle product information, and enterprise data governance systems centralize data for application of access and sensitivity classifications.

Personally identifiable information (PII) like employee names, social security numbers, and salary require the kind of security policies offered by dedicated human resource management systems. Other information of this type includes enterprise financial information, legal and contractual information and documents, and any other sensitive content which are housed in systems with functionality particular to the data types. This kind of data is not meant to be centralized with the intent to distribute widely for consumption by other systems, so these platforms don’t offer those capabilities. Not only is PII better handled by dedicated systems, the risk involved in managing this kind of information in taxonomy systems is high with low reward. The high maintenance overhead of managing employee names and associated attributes in taxonomies is usually not worth the trouble.

Digital asset management (DAM) systems include a variety of metadata fields with different data entry types, including dropdown or typeahead controlled vocabulary fields, because string-based descriptors are essential for fully describing digital assets for findability and classification. Metadata application is such a necessary requirement that assuming another system would be integrated to supply controlled metadata would not offer a minimum viable product for managing assets. Quite often, there is significant overlap in the descriptive metadata used to tag assets and values that could be centralized and used across the organization for a variety of use cases. Nonetheless, these systems are also not meant to be enterprise taxonomy management systems and their taxonomy capabilities are often remedial.

Another domain with potentially overlapping data values is product information housed in product information management (PIM) systems. Brand and product names, colors, materials, dimensions, geographical availability, and a host of different product attributes are great candidates for modeling in graphs of taxonomy instances. Other attributes, like SKUs, price, and lengthy textual product descriptions, however, are difficult to manage in taxonomies and don’t offer great advantages by doing so.

When systems are designed to manage content, digital assets, and product information, the capabilities often get murkier because vendors must offer overlapping functionality to create a viable standalone system. The content management system (CMS) SharePoint, for example, can manage taxonomies in the Term Store for tagging content with metadata within the SharePoint ecosystem (and shared across other Microsoft platforms). The Term Store was never meant to be an enterprise taxonomy management system, and, in fact, is nearly impossible to use as a centralized metadata hub outside the closed Microsoft ecosystem.

Unsurprisingly, overlapping functionality occurs in systems which are built for a specific purpose. Software platforms are built to solve one or more business needs by different vendors whose primary aims are to sell product, not to design systems with selfless interoperability with other platforms. It doesn’t serve them to not include functionality provided by another platform with the intent that consumers will build a complete ecosystem of different, singularly-focused platform modules. Overlapping functionality may, in part, contribute to the confusion of what data lives where and for what purpose.

Scoping by Use Case

In addition to considering platform intent, we can also consider data use cases. Defining how data is to be used may be the most important consideration in determining whether it should be included in taxonomies. The general use case for using a dedicated taxonomy and ontology management system is to centralize metadata values and their attributes. The intent of the system is clear, but the scope of the data included is not. The ability to define use cases for the delivery of taxonomy values as metadata by virtue of having a dedicated taxonomy system at all allows us to change—or perhaps amend—the case for what data should live in the system.

As I noted, some data is simply inappropriate or difficult to manage in taxonomies. Any data that is meant for use in one or more systems that can be reasonably and securely managed in taxonomies, on the other hand, is potentially fair game for centralized management. The larger the platform footprint or reuse across systems is another good indicator that the data must be centralized for publication to multiple systems, one system to be carried with assets or data to other systems, or both.

Starting with data that is reused across multiple systems is one way to separate platform purpose from data use within the platform. Descriptive metadata for geographical locations, product names and attributes, organizational activities and processes, and topics for rich and textual content are just a few of the use cases which likely see data of the same type being used in many scenarios for many purposes. Concepts used for search typeahead suggestions, search results filtering, navigation, and to describe the assets displayed to users are frequently going to be the same. If they are the same, then providing them from common taxonomies not only fixes the concept label and synonyms, but allows for more advanced use cases like querying knowledge graphs, similarity search, and product recommendations. Tying separate but similar concepts together after the fact is a lost opportunity to measure concept and content performance and derive a host of analytic insights from user actions and internal planning decisions.

Another use case consideration is the speed at which the data changes. Rapidly changing data can include time-based or transactional data which is created for one-time use or reflects a moment in time. While it is possible to include rapidly moving data in the taxonomies themselves, data with high velocity can also have high volatility. Rapidly created and changing data can disrupt the purpose of controlled semantic structures by introducing inaccuracies or representing fleeting moments of truth. Using slower-moving taxonomic structures as a semantic layer over quickly changing data can help provide veracity from a consistent source. Taxonomies are still in play, but are not disrupted as a reliable, stable source of truth.

Though not absolute, concepts which don’t hold any meaning when presented alone should also be questioned before living in taxonomies. Numbers representing widths, lengths, distances, or product types, for example, have meaning in context but are unclear when presented alone. Navigational taxonomies or taxonomy-driven search filters may have values important to the end-user experience but don’t have good context when viewed within the context of the larger graph. If using models for machine learning, these types of values may add noise and should either not be built into the taxonomies in the first place or excluded from subgraphs made available for model use.

Any time concepts are added to taxonomies, consider the overall model, the many use cases it will serve, and whether the data makes sense in the total system and data framework.

Semantics and Risk

02/22/2024 10:47 / Leave a comment

“There are no facts, only interpretations.” – Friedrich Nietzsche

Behind every creation–a work of art, musical composition, outstanding sporting performance–there is a creator. Behind a creator’s creation are copyright laws and license agreements detailing how that creation can be bought, sold, represented, and reused. In retail, the works and likenesses of artists and creators appear on merchandise which may be globally distributed. Whether it is the recognizable Jordan Jumpman icon or Bob Marley’s face on a t-shirt, there are contractual obligations and regulations which must be followed to avoid risk and exposure for both the creator and the contracting company. The use of names, images, and likenesses is captured in contractual language, defining the way an image or other work can be used and providing the legal foundations to make sure royalties are paid to the artist and their representative company.

In many organizations that have arrangements with artists or athletes, there are dedicated systems to manage contract documents specially designed to handle the content and also apply the appropriate type of metadata to describe these assets. The digital assets protected by these contracts, such as works of art, images, video, audio, and the like, can be stored in a digital asset management (DAM) system, which specializes in maintaining technical, administrative, structural, descriptive, preservation, usage, and rights metadata. In digital visual works, these rights and permissions may be captured as rights metadata accompanying the image file. Similarly, audio works can carry this information on physical media, and, more commonly, embedded in the individual audio files and the bound collection of an album. Rights and permissions are thus often associated to the artistic creation in both physical and electronic formats.

The metadata to populate values in these platforms can be created and maintained in a taxonomy management system to ensure the application of consistent values in one or more consuming systems. Taxonomies can contain centralized, standardized values describing asset usage and rights and can also provide descriptive metadata values; that is, the metadata that describes what the content is about or what is represented in it. In the world of semantic technologies, taxonomies and ontologies model domains to provide meaning which can otherwise be lost when flattening or decontextualizing metadata attributes. Since taxonomy and ontology management systems were not designed to be content or digital asset management systems, the taxonomy values and the ontology structures which define their use can potentially become detached from the content they describe. The semantic model lives in one system for application across one or more consuming applications while the objects themselves live in those systems, likely containing a mix of system-specific metadata, like technical descriptions of asset size and format and descriptive metadata coming from a taxonomy.

Semantic modeling is an act of establishing veracity. Taxonomies and ontologies model the domains of the organization, including describing the concepts in taxonomies and the relationships between them. Agreeing on what preferred label form to use, which concepts are synonyms, and establishing hierarchical and associative relationships between concepts are all actions to model the truth; or, more accurately, your truth based on the knowledge domain and how it will be used. For instance, modeling fixed relationships between an authority file of athlete names and associated taxonomy concepts could have the following named entities:

Athlete name has team Team name

Athlete name has product Product name

Team name has geography Geographical location

Each of these statements asserts a truth between one or more concepts at any given time. When that truth changes, but the model doesn’t change to reflect it, there is a drift between the current legal standing stated in contracts and how this is represented in the organization’s schema. When creative works or artist likenesses are surfaced on a front end user experience, such as linking a search term to a product or presenting audio or video content in a streaming service, the assumption is that the party responsible for the platform is presenting content for which it is legally liable.

Ironically, the attempt to standardize taxonomy values and their relationships for application across different systems to mitigate the risk of inaccurate metadata values being applied to content and to drive functionality like search and personalized recommendations can in fact introduce new risks. When defining the organization’s domains, it is important to bear in mind how that schema is applied to the actual “things” in other systems and what modeling a full domain truth might mean for how those assets are discovered and used. Attempts to model accurately and truthfully, when applied to content, can inadvertently reveal or persist truths which should not be exposed or have changed meaning over time.

For example, selling athlete merchandise associated with a team is common practice, since fans support teams and the athletes who play for those teams. However, facts which seem immutable, like team names and rosters, can change over time. I live in Oakland, California, home of the Raiders. Well, former home of the Raiders, who are the Las Vegas Raiders who were the Oakland Raiders who were the Los Angeles Raiders who were the Oakland Raiders. How many Raiders players have come and gone over the years and have played for multiple teams throughout their careers? Keeping on top of this kind of high-velocity data changing at scale can be extremely difficult. The same can be said of company-sponsored athletes. Contracts change and products once associated to or endorsed by an athlete may no longer be associated to that athlete. Selling one of these products out of contract can have serious financial and reputational repercussions.

Taxonomy and ontology governance processes must be firmly established and followed in order to make it possible to represent the company domain accurately while mitigating risk. A tight working relationship between the taxonomy team and business representatives in legal and/or marketing is a first priority. Whenever a contract changes, a request must be submitted to reflect this change in the semantic models so tagging moving forward is accurate. Changes to semantic modeling to reflect the new truth doesn’t only happen in the taxonomy management system. The change must be propagated to consuming systems and content on a known schedule. If the change is immediate, the tagging must be changed immediately. If the change goes into effect on a given date or at the end of an established period, the change must be made and pushed to meet that date.

Changing available concepts and tagging practices from a fixed point moving forward is relatively easy compared to other requirements, which may include the untagging or retagging of content to ensure it is not retrieved by search or otherwise discovered by consumers in the user experience. If an association between a team and an athlete is no longer valid, the company may have the opportunity to sell product through an established period of time, allowing the liquidation of as much outdated product stock as possible. After that time, however, discovering a product and purchasing it with the contract no longer being in effect can open the organization to significant risk.

The same risk applies to content not yet ready for prime-time. Any new product release, if tagged and made available to the user experience before a release date, can damage a company’s product launch and severely impact the time and planning that went into organizing an effective campaign. Internally, the risk still exists. If artist or product information is pre-planned in the organization’s metadata, there must be processes in place to keep this information known only to those who need to know it. That means allowing permissioned access to planned artist and product concepts to only those who need to know to avoid information leaks ahead of a release date. While organization’s have contracts stipulating how internal employees handle private company information, it is possible for a product release statement or other content to be discovered internally and intentionally or unintentionally shared to social media sources.

The examples I’ve provided here are in the domains of copyrighted work, but modeling against risk and exposure is even more critical in industries like medicine, health and safety, and manufacturing, just to name a few. While modeling truth is the ultimate goal of creating and curating semantic models, all concepts and their relationships which may be used to power a user experience, such as seeking prescription medicine or medical advice, safety procedures and equipment, and best practices in manufacturing, need to be carefully evaluated before being put into production in an application. While financial and reputational risks and penalties can be damaging, they pale in comparison to inappropriately modeled information people rely on to make potentially life and death decisions.

True to their name, semantic models are meant to be meaningful and act as a source of truth. Actively maintaining taxonomies and ontologies to be in line with truth as it changes includes a vital set of governance processes which must be carefully planned and executed to ensure the company is not exposed to risk. Unfortunately, there may be domains which can’t be modeled in taxonomies because the supporting processes don’t exist or can’t be maintained and executed upon efficiently enough to demonstrate low risk. If these domain areas are important to the organization, a strong case must be made to establish and maintain the processes and resources necessary to exploit beneficial modeling before the company is put at risk in the first place. An organization might not view semantic models as a potential source of risk, so it is up to semantic architects, taxonomists and ontologists, to be conscious of modeling repercussions and actively pursue governance processes to ensure their work is an accurate and up to date source of truth.