Misclassifications Part 2 - Statistical Model

Welcome back to post two of the misclassification series! I am glad you made it here! In the last post, we presented how misclassifications cause statistics such as class totals to become biased. In this post, we will model the problem mathematically and present an algorithm that will yield unbiased estimates. As a heads up, this post is quite mathematically heavy and I can’t describe every step for the sake of conciseness. I will, however, try to put some links for background reading when relevant. Other than that, an intuitive understanding of the problem from the last post will help understand the modeling choices!

What is modeling? What is our model?

So briefly what do we mean with modeling? The process of modeling is to take some, usually real-world, process and to represent this with mathematical definitions and relations. This is done both to precisely describe the problem and to gain access to a plethora of existing theories and algorithms to help solve the problem. In our problem, we have classifications, misclassifications and a continuous variable of interest, so these are quantities we need to represent together with their relations.

Classes and misclassifications

In the last post, we introduced NACE codes that classified companies into various economic activities. Here we will treat the class (NACE code) of an entity (company) as an outcome of a random variable $Z$. $Z$ specifies one of $K$ classes, $\mathrm{Range}({Z}) = \{1, \dots, K\}$ and while it might seem odd to treat $Z$ as a random, doing so gives us the flexibility to talk about the uncertainty of the class.

With $Z$ being the true class we will let $\hat{Z}$ be the potentially misclassified random categorical variable that we observe (also taking on values $1,\dots, K$). We assume in the modeling that the misclassification only is dependent on the true class $Z$ and thus, we can define a misclassification matrix $P$ whose entries $P_{a_1,a_2}$ describe the probability of of a true class $Z=a_1$ becoming misclassified as $\hat{Z}=a_2$ . That is, we have $P$ as

\[ P:=\begin{bmatrix} P_{11} & P_{1,2} & \dots & P_{1,K} \\ P_{21} & P_{2,2} & \ddots & P_{2,K} \\ \vdots & \ddots & \ddots & \vdots \\ P_{K,1} & \dots & \dots & P_{K,K} \end{bmatrix} \in \mathbb{R}^{K\times K}_{\geq 0}, \quad \quad P_{a_1,a_2} := p(\hat{Z} = a_2 | Z=a_1) \]

where $\mathbb{R}^{K\times K}_{\geq 0}$ denotes $K\times K$ matrices with positive entries. Furthermore, because the rows specify conditional distributions spanning the whole outcome space, $a_2\in \{1,\dots,K\}$, we also have that the rows of $P$ sum to 1. This turns out to be an important constraint if you wanna derive the algorithm correctly (used in the appendix of the post)!

With the $K$ classes we will also need to specify the probability that a company is in each of the classes if you know nothing else about it. We denote this quantity with $\alpha_k$ and is defined as

$$ \alpha_k := p(Z=k) $$

also having the constraint of $\sum_{k=1}^K \alpha_k =1$.

A continuous companion

While classifications such as NACE codes are helpful for grouping entities into categories, their practical use is often intertwined with other variables. Here, we introduce $Y$, a practically continuous With practically continuous we are referring to variables such as currency that actually are discrete (1 cent being the smallest unit for euros and dollars), but practically can be considered continuous without inducing significant error random variable that might represent a company’s yearly revenue or any other quantity of interest. In our model, $Y$ is associated with the true class $Z$, so its distribution varies depending on the class.

We assume that for a given true class $Z = k$, the conditional density $p(Y = y \mid Z = k)$ follows a Gaussian Mixture Model (GMM). This choice enables us to model almost arbitrary data distributions for each class because of the flexibility of GMMs (formally in the sense of that they are dense in the $L_2$ function space). Writing the normal distribution with mean $\mu_{k,j}$ and standard deviation $\sigma_{k,j}$ as $\phi(y; \mu_{k,j}, \sigma_{k,j})$, we can write the conditional density as:

$$ p(Y = y \mid Z = k) = \sum_{j=1}^{q_k} \xi_{k,j} \phi(y; \mu_{k,j}, \sigma_{k,j}), $$

where $\xi_{k,j}$ represents the weight of Gaussian mixture component $j$ for class $k$, and $q_k$​ specifies the number of Gaussian components used for class $k$. The weights satisfy $\sum_{j=1}^{q_k} \xi_{k,j} = 1$ and the $q_k$ can vary between the different classes (allowing two classes to potentially have very different distributions). A higher $q_k$​ allows for a more complex and flexible representation while a smaller $q_k$​ corresponds to a simpler model. In this work, we will choose $q_k$​ based on the Bayesian Information Criterion (BIC) that tries to find the model complexity that hopefully will generalize the best in the sense of not overfitting on the observed data.

Within GMMs, a sample can be seen as being an outcome from one of the mixture components. To represent which component an entity’s $Y$ was sampled from, we introduce a random variable $M$. For a given true class $Z = k$, $M$ specifies the Gaussian component $j$ (where $j \in \{1, \dots, q_k\}$) from which the observed $Y$ was drawn. The distribution of $M$ for class $k$ is determined by the weights $\xi_{k,j}$, such that:

$$P(M = j \mid Z = k) = \xi_{k,j}.$$

Given $M = j$ and $Z = k$, the distribution of $Y$ then becomes:

$$p(Y = y \mid Z = k, M = j) = \phi(y; \mu_{k,j}, \sigma_{k,j}),$$

Model summary

This was the statistical model! Good job getting so far 💪😤. Two last things, when working in a context of $N$ data points we will denote the random variables with a subindex $i$ like $Y_i$, $Z_i$ and $\hat{Z}_i$. Secondly, all the parameters mentioned will sometimes just be referred to as $\theta$ to make the notation more compact.

As a summary and as a way to specify some more assumption details we can note that:

  • The random variables are $Y$, $Z$, $\hat{Z}$ and $M$. Or, when working in a context of $N$ data points they are $\{Y_i\}_{i=1}^N$, $\{Z_i\}_{i=1}^N$, $\{\hat{Z}_i\}_{i=1}^N$, and $\{M_i\}_{i=1}^N$.
  • The parameters are $P$, $(\alpha_1, \dots, \alpha_K)$, $(\sigma_{1,1}, \dots, \sigma_{1, q_1}, \sigma_{2,1}, \dots \sigma_{2,q_2},\dots,$$\sigma_{K,1}, \dots \sigma_{K, q_K})$, ${ξ_{a,b}}$ and ${μ_{a,b}}$ where ${ξ_{a,b}}$ and ${μ_{a,b}}$ are indexed in the same way as ${σ_{a,b}}$. These are collectively referred to using $\theta$.
  • The dependencies between the random variables are specified by:
    • $p(Y=y | Z=k, M=m; \theta) = \varphi(y; \mu_{k,m}, \sigma_{k,m})$
    • $p(M=b | Z=a) = \xi_{a,b}$
    • $p(\hat{Z} = a_2 | Z=a_1) = P_{a_1,a_2}$
    • $p(Z=a) = \alpha_a$
    • $\hat{Z} | Z$ is independent from $Y$ and $M$.
    • $(Y_i, Z_i, \hat{Z}_i, M_i)$ is assumed to be independent for different $i$.

The reward of doing all of this is that we now have something concrete and well-defined to work with. With this new notation, we will do a few things:

  1. We will look back on the last post and define relevant estimators properly
  2. We will argue what probably would be needed to get an unbiased property
    1. Spoiler: it is $p(z_i|y,\hat{z}_i; \theta)$
  3. And finally we will present an algorithm that yields us unbiased estimates

Looking back on the last post - The estimators

From the last post, we had a dataset with $\log_{10}$ yearly revenues and the NACE code classification of a range of companies. We will denote these yearly revenues with $y_i$ and the true NACE code with $z_i$ for each company $i$. This can be seen as the real-world realizations of the $Y_i$ and $Z_i$ variables.

If we allow us to use the true classes $z_i$ then we can easily calculate the total yearly revenue, $R_k$, for class $k$ by the sum $R_k =\sum_{i=1}^N \mathbb{1}(z_i=k) 10^{y_i}$ where $\mathbb{1}(\cdot)$ denotes the indicator function. Outside of a case study with manually audited samples, however, the $z_i$, are not available and instead, we only have access to $\hat{z}_i$. Using $\hat{z}_i$ we can define the naive estimator as $\hat{R}_k =\sum_{i=1}^N \mathbb{1}(\hat{z}_i=k) 10^{y_i}$ which arguably is the easiest estimator we can construct. In the last post, we empirically saw how this estimator $\hat{R}_k$ is biased when estimating $R_k$ and thus the question arises whether we can do better… Let’s call this hopefully improved estimator $R^*$. So, any ideas?

As potentially spoiled before, when not having access to $z_i$, the second best thing we can have are the probabilities of $z_i$ given what we have observed of $y_i$ and $\hat{z}_i$. In notation that is $p(z_i|y,\hat{z}_i; \theta)$. $\hat{R}_k$ was unbiased because it relied on $\hat{z}$, so we can imagine that if we instead work with the probabilities of $z_i$, we can come up with an unbiased analytical or sampling approach for our $R^*$ estimator.

Working backwards from the goal of calculating $p(z_i|y, \hat{z}_i; \theta)$, what is unknown here are the parameters $\theta$ that make up the model. The parameters included: the probability of misclassifications between classes, the proportions of each class, and details on the shape of each class’ probability distribution. A common approach when you have data and you want to estimate some parameters is that you would perform maximum likelihood estimation. This essentially chooses the parameters that make the observed outcome the most probable. The problem with this approach is that it is usually intractable to perform in the case of latent (hidden) variables. The latent variables here are the $Z_i$ that we never observe which makes maximum likelihood unapplicable. Out of luck!

Thankfully, this is a reasonably common scenario and if you ever have had to work with latent variables you probably also have been in contact with the great expectation–maximization (EM)  algorithm that will be today’s savior!

Expectation and Maximization

While the expectation maximization algorithm is a wonderful algorithm, we can not delve too much into it here. What should be known about it, however, is that it is an iterative algorithm that helps us find local optimums of the likelihood function with respect to the parameters. Essentially it tries to do the same as the maximum likelihood approach with the advantage of being applicable to problems with latent variables, but with the tradeoff that it only finds a local optimum and not necessarily the global one.

Using the EM algorithm we can go from an initial guess of the parameters and iteratively go towards a new set of parameters that better describes the observed data. The whole methodology so far is summarised in the image below. In the top row of the image we see how there is a true state but because of ‘reality’ we might just end up observing the upper right state. Using the observed state we fit a model to it to be used as an initial guess for the EM algorithm (the modeling arrow). Lastly, we run the EM algorithm to hopefully end up with a state that is similar to the true distribution (EM arrow and Assumption approximation).

0.0 0.2 0.4 0.6 3 4 5 6 7 probability True classification 3 4 5 6 7 Missclassified 0.0 0.2 0.4 0.6 3 4 5 6 7 l o g 10 yearly revenue probability Final statistical model 3 4 5 6 7 l o g 10 yearly revenue Initial statistical model Company Classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds EM Reality Modeling Assumption

Too good to be true - The need for audit samples

While this explains how the algorithm could work, it turns out that in my testing the EM algorithm often has difficulty converging to the desired parameters and thus ends up predicting an incorrect underlying distribution. To aid the EM algorithm we will have to make one further addition to the model, we need to provide true classes for some of the samples. This means that we actually know $z_i$ for let’s say $5\%$ of the samples. We will call these samples that the algorithm knows the true classification of, the audit samples, (). Note that in the last post, we also talked about audited samples but then it was in the context that for our specific dataset we were confident that the classes there were the true classes.

I observed that audit samples were not necessary when the underlying distributions of the classes were well separated but were needed for more involved scenarios. Regarding the magnitude of practical complications stemming from this, I would think that it differs significantly between domains. In the case of NACE codes for companies, it is, as I have understood it, common for national statistics institutes to manually go through x% of the data to assess its quality. If that is the case, no additional work would be needed and it is just good that our algorithm uses all of the provided information. For stakeholders that normally wouldn’t audit part of their data, however, this could be a large additional economic expense preventing the usability of the algorithm. While still untested, I can imagine that if we add more auxiliary data variables than $y$, the need for manually edited samples reduces.

Getting unbiased estimates

Okay, so assuming we have our audit samples and have applied the EM algorithm, what do we do with the newfound $\theta$ parameters? Conceptually we now have “perfect knowledge” of the problem in some sense since all aspects of the generating distribution are known. This is true At least under the assumption that our mixture of GMM can fit the underlying distribution well enough and that the parameters that have converged to the global optimum (which there is no guarantee of but the audit samples help with) or a good enough local one.

With “perfect knowledge” one would think that there is some analytical expression for an unbiased estimator of the statistic in question. And maybe there is… if you find one, tell me! Instead of going down that route, what I have done is take a computational scheme laid out by Li et al.1 (the article is behind a paywall I’m afraid) that, at least for some statistics, have been proven to yield unbiased estimates. The methodology in these posts is actually based on the work presented by Li et al. except that I have generalized it to multiple classes while Li showed its use for 2 classes.

Li et al. called the algorithm the “EM Bootstrap Algorithm” and you can see it here below (with slight modifications from the original regarding its presentation):

The EM Bootstrap Algorithm 1: Input: Observations for ,loopiterations ,andstatistic function 2: Estimatemodelparameters from theEMalgorithm,using and 3: Calculate foreveryunit in thedataset 4: for do 5: Sample from foreveryunit in thedataset 6: Calculatethecorresponding based on 7: for do 8: Generate by (part of ), from foreveryunit in thedataset 9: Calculatethecorresponding based on 10: endfor 11: endfor 12: Calculate Bias estimate and Var estimate Var basedon and simulations usingstandardsamplemeanandvariance 13: Output:Bias estimate , Var estimate

Some comments to help digest this:

  1. $S_1$, $S_2$ control the sampling sizes. In the proof of unbiased estimates you let these tend to infinity but in practice for me doing more than 100 iterations made little difference.
  2. The algorithm starts after the EM algorithm. So we are in total chaining two algorithms.
  3. $\zeta$ represents an arbitrary statistic. In our case, it is like $\hat{R}$ but uses whatever classes $z$ is provided and not the observed classes from the dataset.
  4. The outer loop over $S_1$ can be seen as simulating possible real world classes $z$ by repeatedly sampling from $p(z_i|y_i, \hat{z}_i; \theta)$. This is then used to calculate the $\tilde{\zeta}$ which represents that simulated world’s true statistic.
  5. Given the outer simulated state of the world, the inner loop then simulates the possible misclassifications that could have happened using the misclassification matrix $P$. With these misclassifications, the naive estimator $\tilde{\zeta}^*$ is calculated.
  6. To get the bias of the naive estimator the $\tilde{\zeta}^*$ is averaged over possible misclassifications and its difference to $\tilde{\zeta}$ is averaged over possible $z$ classes.
  7. The algorithm does not output $R^*_k$ but instead, it outputs the bias (and its variance) of $\hat{R}_k$. But with the bias we can form $R^*_k$ as $R^*_k=\hat{R}_k - \textbf{Bias}_{\textbf{estimate}}$. The algorithm outputs the bias because its magnitude is often of interest but if we only wanted an unbiased estimator we actually don’t need the inner loop and we could have averaged just over $\tilde{\zeta}$

Okay so now we have a $R^*_k$ that is proven to be unbiased for some basic statistics (such as totals for each class and class proportions) and hopefully also unbiased for non-linear statistics such as our $R_k$ computation which requires exponentiation. This ‘hopefully’ is not particularly reassuring so let us look at some empirical data!

Empirical evaluation

Using the dataset with NACE codes and yearly revenues of companies we would now want to evaluate the effectiveness of our new algorithm. Doing an empirical evaluation in our case mainly concerns choosing parameters and looking at different statistics. In bullet points, I would lay out the considerations as:

  • Which parameters do we have?
    • The misclassification probabilities $P$ (about $K^2/2$ number of parameters)
    • The parameters $\alpha_a$, $\sigma_{a, b}$, ${ξ_{a,b}}$ and ${μ_{a,b}}$ where ${ξ_{a,b}}$ and ${μ_{a,b}}$ are all implicitly chosen by our chose of dataset
  • How should we vary the parameters?
    • For simplicity, we can parameterize the diagonal of $P$, call it $P_D$, and set all other off-diagonal values such that all rows sum to one (as they must do). It would have been interesting to also study non-symmetric misclassification probabilities since our algorithm can handle such cases but I leave that for someone else to do (such a kind gesture).
    • For the other parameters we would have to look at different datasets or potentially take subsamples from our existing one
  • What statistics are we gonna look at?
    • Because the total yearly revenue has been our staple example and is a quantity of realistic interest we will of course look at $R_k$.
    • The exponentiation in $R_k$ comes from us working with $log_{10}$ data. This might not always be the case so it makes sense to also look at the plain domain total $T_k=\sum_{i=1}^N \mathbb{1}(k=z) y_i$ even if this scenario does not have any intuitive interpretation (it is the sum of log revenues in a class, or equivalently, the log of the product of all revenues in a class).

Results 📈!

At the end of the last post I showed you this picture:

0B € 0.5B € 1B € 1.5B € 2B € Point value Naive EM Bootstrap Total Yearly Revenue Company classification 96.02.2 Beauty treatment 56.10.1 Restaurants 71.11.1 Architect 47.64.1 Selling bicyles and mopeds Mean True EM Bootstrap Naive

And now having gone through all of this we might appreciate more that our algorithm, labeled “EM Bootstrap”, seems to be doing its job! To make the plot a bit less cluttered and give space to vary the $P_D$, which was set to $75\%$ in the above plot, we can make a plot like the below:

P D = 0.6 P D = 0.75 P D = 0.9 Beauty Restaurants Architect Mopeds/Bikes Violin Mean Violin Mean Violin Mean 0B € 0.2B € 0.4B € 0.6B € 0B € 0.5B € 1B € 1.5B € 2B € 0B € 0.2B € 0.4B € 0.6B € 0B € 0.2B € 0.5B € 0.8B € Total Yearly Revenue Type True EM Bootstrap Naive

In the above plot, we are changing the misclassification probability over the columns and looking at the different classes over the rows. Thus the previous compact plot contained the information that now exists in the middle column of this plot. See for example how the restaurants (previously light green) now is on the second row and we see how their true revenue of $2$ billion euros (seen as a red square) is estimated as 1.6 billion by the naive estimate (seen in blue) but is more correctly estimated by our EM Bootstrap algorithm in orange. The spread of these violin plots represents the spread in outcomes the algorithm produces depending on which exact misclassification happened to be made by the probabilities of $P$.

While we see how the orange dots (which represent the mean of the orange violin plots) tend to align well with the red squares, we also see some troubling increased spread of our algorithm compared to the naive approach. This is especially prominent in the company classes of architects and shops selling mopeds bikes. These classes also happened to be the smallest of the classes in terms of representation in the dataset. The percentage of data points owing to each class were: Beauty treatment - 57.13%, Restaurants - 25.59%, Architect - 9.66% and Selling bicycles and mopeds - 7.62%. I suspect the increased spread comes from the EM algorithm not converging properly with the more sparse data.

Lastly, a clear trend, that also is expected, is that we see how the naive method performs worse for higher misclassification probabilities (remember that the diagonal of $P$ represents the probability of a sample not being misclassified) while it converges to the true value if few samples are misclassified. This emphasizes that we can reap the biggest benefits of our model if we are in a scenario where we suspect there are major misclassification odds.

To inspect the algorithm’s performance for $T_k$ we look at a similar plot but for this statistic (excuses for the change of plot theme):

P D = 0.6 P D = 0.75 P D = 0.9 Beuauty Restaurants Violin Mean Violin Mean Violin Mean 0 10000 20000 30000 0 5000 10000 15000 20000 0 3000 6000 9000 0 2500 5000 7500 10000 Sum l o g 10 revenue Type True EM Bootstrap Naive Architecht Mopeds/Bikes

We see many of the same trends here for $T_k$ as we did for $R_k$ so I won’t reiterate them. What is noteworthy however is that the spread of our algorithm is narrower relative to the naive approach than it was in the case of $R_k$. Thus it seems like the non-linearities of $R_k$ are blowing up part of the variance and our algorithm seems to be more reliable for less non-linear statistics.

What did we learn?

We’ve come a long way from simply noticing how misclassifications disturb our statistics (first blog post) to building a full-on model that actually fixes them. With the help of some mathematical heavy lifting, the EM algorithm, and the EM Bootstrap sampling, we’ve figured out how to turn biased estimates into unbiased ones — a bit like magic considering we can’t directly observe the true classes!

What’s exciting is seeing how the EM Bootstrap Algorithm performs in practice. It’s not perfect (the variance is too large for my taste), but it can be a huge improvement over sticking with naive estimates when you suspect large misclassification rates and have access to a small audited subsample.

If I would work further with this method I would want to revisit the modeling choice of using the GMMs. The GMMs are a flexible tool to model arbitrary distributions but for me, it feels like it is the wrong approach when we have no reason to believe the underlying distribution to resemble a GMM. Actually, the reason we worked with $\log_{10}$ revenue data as a y variable all the time is because this made the GMM fitting more reliable. Preferably I would want a non-parametric approach where we can relax the model assumptions and only include $P$ and $\alpha_i$ as parameters. But when doing this we can not use the regular EM algorithm so that part would have to be reworked also. If anyone knows of this being done, then tell me!

Thanks for sticking through all this! I hope you have gained some new insights into how misclassifications can be tackled in different ways than how machine learning approaches would do it (it would be interesting to compare this approach to directly using the classes of an ML clustering algorithm). And even if you never will use this approach ever in your life, I hope that the journey was the destination and that you will show up for some of my later posts!


Appendix - EM Derivations

For the ones interested I will here shortly post derivations on the EM algorithm from our model. If you are interested in more specifics then you can contact me.

The log-likelihood

In EM we iteratively maximize the complete log-likelihood function so we need its expression

$$ p(z_i, m_i, \hat{z}_i, y_i) = p(m_i, y_i, \hat{z}_i | z_i) p(z_i) = \alpha_{z_i} p(y_i, \hat{z}_i | m_i , z_i) p( m_i | z_i) = \alpha_{z_i} \xi_{z_i, m_i} p(y_i, \hat{z}_i | m_i , z_i) $$

Looking at this last density and using that the classification error is independent of $y$ if the class $z_i$ is given we can write:

$$ p(y_i, \hat{z}_i | m_i , z_i) = p(y_i | m_i, z_i) p(\hat{z}_i | m_i, z_i) = \varphi(y_i, \mu_{z_i, m_i}, \sigma^2_{z_i,m_i}) P_{z_i, \hat{z}_i} $$

So for the complete likelihood, we end up with:

$$ p(z_i, m_i, \hat{z}_i, y_i) = \alpha_{z_i} \xi_{z_i, m_i} P_{z_i, \hat{z}_i} \varphi(y_i, \mu_{z_i, m_i}, \sigma^2_{z_i,m_i}) = \omega_{z_i, m_i, i} $$

Where we have defined:

$$ \omega_{k,j,i} \triangleq P_{k,\hat{z}_i} \xi_{k,j} \alpha_k \varphi(y_i; \mu_{k, j}, \sigma^2_{k, j}) = P_{k,\hat{z}_i} \xi_{k,j} \alpha_k \frac{1}{\sqrt{2\pi \sigma_{k,j}^2}} \exp\left\{-\frac{(y_i-\mu_{k,j})^2}{2 \sigma_{k,j}^2}\right\} $$

Writing out the log-likelihood, we have:

$$ \log p(z_i, m_i, \hat{z}_i, y_i) = \log \alpha_{z_i} + \log \xi_{z_i, m_i} + \log P_{z_i, \hat{z}_i} - \log \sigma_{z_i,m_i} - \frac{(y_i - \mu_{z_i, m_i})^2}{2 \sigma_{z_i, m_i}^2} + \mathrm{const.} $$

The E step

We are looking for the distribution of the latent variables given the observed data and our current iteration of $\theta^t$. For us, the latent variables are $(z_i, m_i)$ and the observed data is $(\hat{z}_i, y_i)$ for $i=1,\dots,N$. There are $K$ classes, so $z_i$ and $\hat{z}_i$ take values in $\{1,\dots, K\}$ and $m_i$ takes values in $1,\dots q_{z_i}$, where $q_j$ are considered fixed.

$$ p(z_i, m_i | \hat{\mathbf{z}}, \mathbf{y} ) = p(z_i, m_i| \hat{z}_i, y_i) = p(\hat{z}_i, y_i | z_i, m_i) \frac{p(z_i, m_i)}{p(\hat{z}_i, y_i)} = p(\hat{z}_i, y_i | z_i, m_i) \frac{p(m_i|z_i) p(z_i)}{\sum_{j=1}^K p(\hat{z}_i, y_i | z_i=k) p(z_i=k)} $$

Now with the classification error being independent of the variable $y$, we can write the above as:

$$ p(z_i, m_i | \hat{\mathbf{z}}, \mathbf{y} ) = p(\hat{z}_i | z_i, m_i) p( y_i | z_i, m_i) \frac{p(m_i|z_i) p(z_i)}{\sum_{j=1}^K p(\hat{z}_i | z_i=k) p(y_i | z_i=k) p(z_i=k)} $$

With $\hat{z}_i$ and $m_i$ assumed to be conditionally independent given $z_i$, we have $p(\hat{z}_i | z_i, m_i)= p(\hat{z}_i | z_i)=P_{z_i,\hat{z}_i}$. This yields then:

$$ p(z_i, m_i | \hat{\mathbf{z}}, \mathbf{y} ) = P_{z_i,\hat{z}_i} \xi_{z_i, m_i} \alpha_{z_i} \frac{\varphi(y_i; \mu_{z_i, m_i}, \sigma^2_{z_i, m_i})}{\sum_{k=1}^K P_{k, \hat{z}_i} \alpha_k p(y_i | z_i=k) } $$

Note that we have:

$$ p( y_i | z_i = k ) = \sum_{j=1}^{q_k} p( y_i | m_i=j, z_i = k)p( m_i=j | z_i=k ) = \sum_{j=1}^{q_k} \xi_{k,j} \varphi(y_i, \mu_{k,j}, \sigma^{2}_{k,j}) $$

so:

$$ p(z_i, m_i | \hat{\mathbf{z}}, \mathbf{y} ) = P_{z_i,\hat{z}_i} \xi_{z_i, m_i} \alpha_{z_i} \frac{\varphi(y_i; \mu_{z_i, m_i}, \sigma^2_{z_i, m_i})}{\sum_{k=1}^K P_{k, \hat{z}_i} \alpha_k \sum_{j=1}^{q_k} \xi_{k,j} \varphi(y_i, \mu_{k,j}, \sigma^{2}_{k,j}) } $$

Or using the definition of $\omega$ before, we can write it as:

$$ p(z_i, m_i | \hat{\mathbf{z}}, \mathbf{y} ) = \frac{\omega_{z_i, m_i, i}}{\sum_{k=1}^K \sum_{j=1}^{q_k} \omega_{k, j, i} } $$

Working with iterations of time for the parameters, we can write:

$$ A_{k,j,i}^{(t)} = p(z_i=k, m_i=j | \hat{\mathbf{z}}, \mathbf{y} ; \theta^{(t)}) = \frac{\omega_{k, j, i}^{(t)}}{\sum_{a=1}^K \sum_{b=1}^{q_k} \omega_{a, b, i}^{(t)} } $$

Note that $A_{k,j,i}^{(t)}$ has the property:

$$ \sum_{k=1}^K \sum_{j=1}^{q_k} A_{k,j,i}^{(t)} = 1, \quad \forall i \in \{1, \dots, N\} $$

since it is summing over the whole range of the random variables.

The M step

With the distribution of the latent variables specified, we can perform the M step. We have

$$ Q(\theta, \theta^{(t)}) = \mathbb{E}_{\mathbf{z}, \mathbf{m} | \hat{\mathbf{z}}, \mathbf{y}; \theta^{(t)}} \left[\sum_{i=1}^N \log p(z_i, m_i, \hat{z}_i, y_i; \theta)\right] = \sum_{i=1}^N \mathbb{E}_{z_i, m_i | \hat{z}_i, y_i; \theta^{(t)}} \left[ \log p(z_i, m_i, \hat{z}_i, y_i; \theta)\right] = $$$$ = \sum_{i=1}^N \sum_{k=1}^K \sum_{j=1}^{q_k} A_{k,j,i}^{(t)} \log p(k, j, \hat{z}_i, y_i; \theta) $$

Using the expression of the log-likelihood from before we get

$$ Q(\theta, \theta^{(t)}) = \sum_{i=1}^N \sum_{k=1}^K \sum_{j=1}^{q_k} A_{k,j,i}^{(t)} \log \omega_{k, j, i; \theta} = \sum_{i=1}^N \sum_{k=1}^K \sum_{j=1}^{q_k} A_{k,j,i}^{(t)} \left\{ \log \alpha_{k} + \log \xi_{k, j} + \log P_{k, \hat{z}_i} - \log \sigma_{k,j} - \frac{(y_i - \mu_{k, j})^2}{2 \sigma_{k, j}^2}\right\} + \mathrm{const.} $$

We want to maximize this with respect to $\theta$ parameters and respect the constraints of $P$, $\xi$ and $\alpha$. This maximized parameter choice is what we choose as the new parameters, $\theta^{(t+1)} = \arg\max_{\theta} Q(\theta, \theta^{(t)})$. We will show a more explicit calculation for $\alpha$ and shorter for the other parameters. To solve the constrained optimization problem, we optimize the Lagrangian. We calculate

$$ \partial_{\alpha_{a}} \left( Q(\theta, \theta^{(t)}) + \lambda(\sum_{k=1}^K \alpha_k - 1) \right) = \frac{1}{\alpha_a} \sum_{i=1}^N \sum_{j=1}^{q_a} A_{a,j,i}^{(t)}+ \lambda = 0 \implies $$

Let $C_a = \sum_{i=1}^N \sum_{j=1}^{q_a} A_{a,j,i}^{(t)}$ then

$$ \alpha_a = - \frac{C_a}{\lambda} \implies \sum_{a=1}^K \alpha_a = 1 = - \frac{1}{\lambda} \sum_{a=1}^K C_a = - \frac{1}{\lambda} \sum_{i=1}^N \sum_{a=1}^k \sum_{j=1}^{q_a} A_{a,j,i}^{(t)} = \{\text{two inner sums summing to 1}\} = $$$$ =-\frac{N}{\lambda} \implies \lambda = - N \implies $$$$ \alpha_a^{(t+1)} = \frac{\sum_{i=1}^N \sum_{j=1}^{q_a} A_{a,j,i}^{(t)}}{N} $$

Doing similar steps for $P_{a_1,a_2}$, we get

$$ \partial_{P_{a_1,a_2}} \left( Q(\theta, \theta^{(t)}) + \lambda(\sum_{k=1}^K P_{a_1,k} - 1) \right) = \frac{1}{P_{a_1,a_2}} \sum_{i=1}^N \mathbb{1}(\hat{z}_i=a_2) \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)}+ \lambda = 0 \implies $$$$ P_{a_1,a_2} = - \frac{\sum_{i=1}^N \mathbb{1}(\hat{z}_i=a_2) \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)}}{\lambda} \implies \sum_{c_2=1}^K P_{a_1,a_2} = 1 = - \frac{1}{\lambda} \sum_{a_2=1}^K \sum_{i=1}^N \mathbb{1}(\hat{z}_i=a_2) \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)} = - \frac{1}{\lambda} \sum_{i=1}^N \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)} \implies $$$$ P_{a_1,a_2}^{(t+1)} = \frac{\sum_{i=1}^N \mathbb{1}(\hat{z}_i=a_2) \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)}}{\sum_{i=1}^N \sum_{j=1}^{q_a} A_{a_1,j,i}^{(t)}} $$

And for $\xi_{a,b}$:

$$ \partial_{\xi_{a,b}} \left( Q(\theta, \theta^{(t)}) + \lambda( \sum_{b=1}^{q_a} \xi_{a,b} -1) \right) = \frac{1}{\xi_{a,b}} \sum_{i=1}^N A_{a,b,i}^{(t)} + \lambda = 0 \implies $$$$ \xi_{a,b} = -\frac{\sum_{i=1}^N A_{a,b,i}^{(t)}}{\lambda} \implies \sum_{j=1}^{q_a} \xi_{a,j} = 1 = - \frac{1}{\lambda} \sum_{i=1}^N \sum_{j=1}^{q_a} A_{a,j,i}^{(t)} \implies $$$$ \xi_{a,b} = \frac{\sum_{i=1}^N A_{a,b,i}^{(t)}}{\sum_{i=1}^N \sum_{j=1}^{q_a} A_{a,j,i}^{(t)}} $$

And for $\mu_{a,b}$:

$$ \partial_{\mu_{a,b}} Q(\theta, \theta^{(t)}) = \sum_{i=1}^N A_{a, b, i} \frac{1}{2 \sigma^2_{a,b}} \partial_{\mu_{a,b}} (2y_i \mu_{a,b} - \mu_{a,b}^2) = \sum_{i=1}^N A_{a, b, i} \frac{1}{2 \sigma^2_{a,b}} (2y_i - 2 \mu_{a,b}) = \frac{1}{\sigma_{a,b}^2} \sum_{i=1}^N A_{a, b, i} (y_i - \mu_{a,b}) =0 $$$$ \implies \sum_{i=1}^N A_{a, b, i} y_i = \mu_{a,b} \sum_{i=1}^N A_{a, b, i} \implies $$$$ \mu_{a,b}^{(t+1)} = \frac{\sum_{i=1}^N A_{a, b, i}^{(t)} y_i }{\sum_{i=1}^N A_{a, b, i}^{(t)}} $$

And lastly for $\sigma$:

$$ \partial_{\sigma_{a,b}} Q(\theta, \theta^{(t)}) = \sum_{i=1}^N A_{a,b,i}^{(t)} \left\{ - \frac{1}{\sigma_{a,b}} + \frac{(y_i-\mu_{a,b})^2}{\sigma_{a,b}^3} \right\}= \frac{1}{\sigma_{a,b}^3}\sum_{i=1}^N A_{a,b,i}^{(t)} (y_i-\mu_{a,b})^2 - \frac{1}{\sigma_{a,b}} \sum_{i=1}^N A_{a,b,i}^{(t)} = 0 \implies $$$$ \frac{1}{\sigma_{a,b}^2} = \frac{\sum_{i=1}^N A_{a,b,i}^{(t)}}{\sum_{i=1}^N A_{a,b,i}^{(t)} (y_i-\mu_{a,b})^2} \implies $$$$ \sigma_{a,b}^{(t+1)} =\sqrt{ \frac{\sum_{i=1}^N A_{a,b,i}^{(t)} (y_i-\mu^{(t+1)}_{a,b})^2}{\sum_{i=1}^N A_{a,b,i}^{(t)}}} $$

  1. Li, Yanzhe, Sander Scholtus, and Arnout van Delden. “A method for estimating the effect of classification errors on statistics for two domains.” Survey Methodology 49.2 (2023): 569-606 ↩︎