Troubling Misclassifications - What is it about?

Welcome to this two-part series on misclassifications of categorical variables! Sounds exciting, does it not? Well, with humans putting labels on everything from food and spices to entire economies and cultures, we find classes everywhere. Classifications can help us make sense of the world but also carry plenty of examples where they lead to oversimplification or are used for discriminatory purposes. Taking the position, however, that we are doing something good with classifications, it is important that the classifications we have collected are the correct ones! While it is obviously desirable to have error-free data, we will in this post examine what the consequences might be if we are not so fortunate. This is followed by a second post where an alternative statistical method is presented that sometimes can overcome the problems induced by the misclassifications!

Before we get started, a short note on what we mean by classifications. In this post series, a classification is a set of labels (classes) that are mutually exclusive and exhaustive. That is, an entity must belong to exactly one of the classes. So not to two classes, nor to zero classes. This differs from a categorization such as music genres where a song is usually categorized in multiple genres, or from botanical categorizations where a banana is both a fruit and a berry (botanical species groupings are, however, an example of a classification!).

For examples of classifications with important societal implications, we could for instance mention education level or company classification. These are useful to classify because they can help decision-makers track the progress of policies or better understand how the economy develops. For company classification, there is, within the EU, a standardized system called NACE codes. These NACE codes provide a “classification of economic activities”1 and will be used as the main example further down. Two examples of NACE codes are 96.2 which corresponds to “Hairdressing, beauty treatment, day spa and similar activities” and 56.1 which corresponds to “Restaurants and mobile food service activities”.

NACE Codes and Total Revenue

So let’s dive into an actual example with some actual data. What we will be looking at is a dataset of company entries containing the yearly revenue and the NACE code for each company. Using this data we are particularly interested (or will pretend to be ;)) in the total yearly revenue of each company class. I sadly can not provide any details on the source of this data for the integrity of the individual companies but it will work well for illustrative purposes and permission has been granted for this use case. What is important to know about this data is that its NACE codes have been audited so we can assume that the codes given here are correctly classified. This is important because this means that we can now study the effect of misclassifications by artificially adding them while having access to the ground truth. This dataset will be referred to as just the dataset throughout the post.

With a new dataset at hand, what is the first thing you do? You visualize it! Below I have plotted a probability plot of the log 10 revenue for four different NACE company classes. These are Beauty Treatment companies, Restaurants, Architects and shops specializing in Selling bicycles and mopeds. The probability plot is made by applying kernel density estimation to each class and then scaling each distribution such that the areas of all distributions sum to one in proportion to the number of occurrences in the dataset. If you have not heard of kernel density estimation, which is totally reasonable, you can think of it as a kind of smooth histogram! Looking at this plot we can for example see how beauty treatment companies tend to have revenues between 3 000 € and 300 000 € while shops specializing in selling bicycles and mopeds tend to have revenues between 100 000 € and 3 000 000 €. We also see how the area of the beauty treatment companies seems to be bigger than the area of the shops specializing in selling bicycles and mopeds. This indicates that the former is more frequently represented in the dataset than the latter.

0.0 0.2 0.4 3 4 5 6 7 l o g 10 yearly revenue probability Company Classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds

So far all good. Because this data is audited we could use it straight as it is to calculate for example the total revenue among restaurants (around 2 billion € in this dataset). But to explore what could happen in reality, without the auditing, we will artificially misclassify some of the samples. It turns out that if we misclassify about $25\%$ of these samples evenly among the other classes (so in this case around $25/3\approx 8.3\%$ of each sample ends up on each of the other three samples) we could end up with something like the lower right plot.

0.0 0.2 0.4 3 4 5 6 7 l o g 10 yearly revenue probability True classification 3 4 5 6 7 l o g 10 yearly revenue Missclassified Company Classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds

Now, it is not strange that if this happens, a statistic such as the total revenue per class would change. Exactly this is again seen in the plot below where the total restaurant revenue reduces by almost 0.4 billion € and the total revenue for beauty treatment stores goes up by around 0.3 billion €. Interestingly, the effect of misclassifying shops selling biclyes and mopeds as other companies, and the misclassifying of other companies as shops selling biclyes and mopeds, happens to cancel each other out in this scenario and we see only a small change of that class’ total revenue.

0B € 0.5B € 1B € 1.5B € 2B € True classification Misclassified Total Yearly Revenue Company Classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds

The above plot showed what happened for a specific outcome of misclassifications. In reality, any other outcome of specific misclassifications could have happened and thus it is more representative to look at the spread of outcomes. Such a spread can be visualized with a violin plot which is also what we see below. The dots inside the violin plots are the means of the outcomes and we see in general that these do not align with the estimands Estimand is the word for what we want to estimate. In this case, it is the total revenue of each class. . If we refer to the procedure of simply using the potentially misclassified classes as naive estimators, then this difference shows the bias of the naive estimators (plural estimators because it is one estimator per class). Seeing such bias raises the follow-up question: Can we have unbiased naive estimators and do these perform better?

0B € 0.5B € 1B € 1.5B € 2B € True classification Misclassified Total Yearly Revenue Company Classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds

To spoil the answer: Sometimes we can! Exactly how, I will write about in the following post, but as a teaser, we can look at how the plot below shows how our new method EM Bootstrap seems to yield unbiased albeit higher variance estimates compared to the naive estimators. See you then!

0B € 0.5B € 1B € 1.5B € 2B € Point value Naive EM Bootstrap Total Yearly Revenue Company classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds Mean True EM Bootstrap Naive

Key takeaways

  • Misclassifications of categorical data mean that some of the observed classes are wrongly labeled.
  • NACE is the standard company classification system in the EU.
  • Using an audited dataset, where we are certain of the classifications, we can artificially apply misclassifications to study the impact on various statistics.
  • Misclassifying 25% of the samples in our company dataset had a substantial impact on the total revenue per class statistics.