Data Cleansing & Categorization

Use Artificial Intelligence in procurement

The smarter way to bundle your suppliers and materials

Consolidation and clusters combine master data records on suppliers, categories, materials, and services into meaningful groups. Behind the scenes, these methods analyze text information such as a supplier’s name or a material description in addition to other attributes of the master data record. These attributes can range from addresses or bank information to weight, measurements, or color.

The Orpheus algorithms for consolidating and clustering master data break down the individual fields of a master data record into text fragments and attribute values. Redundant, incomplete, or unprecise elements are either replaced (if possible) or deleted completely.

Supplier names, for example, often contain different abbreviations for the same company form. Although the standard abbreviation for a limited company in Germany is “GmbH”, some use “mbH” or write “Gesellschaft mit beschr. Haftung” instead. Equivalence classes for these types of ambiguities can trace back all of these terms and abbreviations to the same root (in this example, “GmbH”).

Once the prep work is finished, the master data record is available and is automatically matched with the other master data records. If two data records are 100% identical, this is recognized as a duplicate. If they are similar but not identical, a cluster algorithm decides if both data records should be consolidated into a group (i.e. a cluster) or not.

Real-world example: Supplier consolidation

Step 1: Automated matching

Each master data record is compared field by field with the contents of other relevant data records. Each match is given a score, which ideally adds up to 100% if all fields are identical. This method identifies several suggestions, which may have varying scores, for each supplier record that is checked. Suggestions with a strong match – for example, 90 % or higher – can be accepted automatically and linked to the reference data record. The other automatic suggestions – for example, those with a score between 80 % and 90 % – flow into step 2.

Step 2: Scanning and browsing

A data operator at Orpheus (or the client) reviews the data records where the suggestions have an acceptable quality (e.g. an 80 % - 90 % confidence rate) for possible matches. The data operator marks the selection that seems most logical or rejects the suggestion without making a match. This semi-automated matching significantly increases the hit rate from step 1 with relatively little work. We compare this to the 80-20 rule. Running this second step, in other words, is definitely worth the effort.

DataCategorizer - automated consolidation, discover pooling potentials

Consolidating and clustering with DataCategorizer

DataCategorizer supports this process by identifying duplicate master data records (e.g. suppliers, materials), consolidating similar ones into clusters, and organizing them in hierarchies.

The example below illustrates the basic functionality of DataCategorizer. The application has identified different spellings for the same supplier “Firma Utilities GmbH” and has grouped these “children” under a single “parent” or cluster. Future reports will now primarily use the name of the consolidated parent (i.e. “Firma Utilities GmbH”). Users can, however, analyze the assigned children (i.e. “cluster members”) if a consolidated view is not desired.

This consolidation is a two-step process. Step 1 compares two supplier data records to identify and group any duplicates. Step 2 attempts to assign subsidiaries (i.e. children) to the respective “parent” companies.

DataCategorizer can apply different comparison strategies to identify duplicates. Suppliers can, for example, be viewed as the same if their names and addresses match. Alternatively, it can also compare the records using the Data Universal Numbering System (DUNS). The figure on the right illustrates these two variants. Users, however, can make these comparisons using any number of attributes that are available.

DataCategorizer - automated consolidation, discover pooling potentials
DataCategorizer - automated consolidation, discover pooling potentials

Technology made for big data

The consolidation and cluster module is designed as a client-server application. The server compares thousands to millions of master data records (e.g. including supplier data records) fully automatically and recommends potential matches of children and parents as well as the best way to group them.

Users can view the server-based consolidation iterations and make manual changes from the client application. This is also where the fine-tuning of the supplier hierarchies takes place. If the server could not identify any matching data records, users can create new hierarchies and match data records with different names by hand. After all, it is not always possible to identify legal relationships among companies simply based on their names.

The DataCategorizer consolidation module helps to cluster supplier and material data partially or fully automatically as well as manually consolidate it.

The fully automated clustering requires zero user interaction. The consolidation process can be scheduled periodically or fully automated. The user only has to control the results and, if necessary, correct them manually.

DataCategorizer - automated consolidation, discover pooling potentials

Creating master data hierarchies

This process consolidates different forms or spellings of a supplier or materials into groups (clusters) and arranges them into multi-leveled, tree-like structures.

This process, for example, creates parent-child relationships among suppliers. A reliable match, however, is not always possible without further information outside the system. Many companies that are subsidiaries of a corporate group have completely different names. ATRADA AG, for example is a subsidiary of TELKOM AG. Neither the name nor the address of both companies, however, provides any clues about this relationship. The only option here is to check the list of subsidiaries, for example, by researching the internet or contracting an expert in this field.

Parts and services can also be consolidated into equivalence classes. Product families of certain vendors include, for example, different models of computers, furniture, or vehicles. Allocating these parts into hierarchies supplements the classification based on a standard schema (e.g. eCl@ss or UN/SPSC). The benefits of analyzing this parallel dimension, however, rarely justify the means due to the massive amount of time and effort.

Creating master data hierarchies and processing automatic suggestions manually

The consolidation and cluster algorithms of DataCategorizer are designed to automatically sort, consolidate, and cluster a maximum number of master data records. From the client application, users can validate the server’s suggestions, manually arrange individual data records in the hierarchy, and examine the data records that could not be matched automatically.

DataCategorizer - automated consolidation, discover pooling potentials

Connect with our experts