Mathematics ∙ Computer Science ∙ Ideas

Advent of Code 2021, Day 7: Minimising MSE + AAD

The problem at hand

It is the 7th day of the 2021 Advent of Code programming challenge! For the second challenge, we are given a dataset of $n$ points $X = (x_1, x_2, …, x_n) \in \mathbb{Z}^n$. We are then asked to find $$\gamma = \text{argmin}_{x\in\mathbb{R}} \sum_{i=1}^n \text{dist}'(x,\, x_i),$$ or, in other words, the value of $x \in \mathbb{R}$ that minimises the total sum of a distance function from it to all datapoints $$\text{dist}'(x,\, x_i) = 1 + 2 + … + |x-x_i| = \frac{1}{2}|x-x_i|\left(|x-x_i|+1\right).$$

Substituting, we see that what we wish to find is actually1 $$\gamma = \text{argmin}_{x\in\mathbb{R}} \frac{1}{2}\sum_{i=1}^n |x-x_i|\left(|x-x_i|+1\right).$$

Initial manipulation

Note that, to simplify things, we can equivalently view the problem as wanting to find the global minimum to a function $f : \mathbb{R} \rightarrow \mathbb{R}$ defined as $$f(x) = \sum |x-x_i|\left(|x-x_i|+1\right) = \sum (x-x_i)^2 + \sum |x-x_i|.$$

This is further equivalent to the global minimum of $$g(x) = \frac{1}{n}f(x) = \frac{1}{n}\sum (x-x_i)^2 + \frac{1}{n}\sum |x-x_i| = MSE + AAD,$$

since $\frac{1}{n}$ is just a constant factor! MSE is the mean squared error usually found in linear regressions, and AAD is the less commonly known average absolute deviation.

So really, the problem we are given is no different than that of finding the value that minimises the sum of the MSE and the AAD for a given dataset!

Solution

It is a well-known fact that the value that minimises the MSE of a dataset is the mean, and that the value that minimises the AAD for a dataset is the median. But what about their sum? Is it the mean? The median? Can it even be efficiently computed?

To answer this, first let $\mu_x = \frac{1}{n}{\sum_{i=1}^n x_i}$ be the mean of the given dataset. Also remember that $\frac{d}{dx}|x| = \text{sgn}(x)$, where $$\text{sgn}(x) = \begin{cases} -1, & x < 0 \\
0, & x = 0\\
1, & x > 0 \end{cases}$$ is the sign of $x \in \mathbb{R}$.


$\text{Claim.}$ The value $\gamma$ that minimises MSE + AAD must satisfy $\gamma \in [\mu_x - 0.5, \mu_x + 0.5]$.

$\text{Proof}.$ To find the global minimum of $g \geq 0$, we first take its derivative: $$ \begin{align*} g'(x) &= \left(\frac{1}{n}\sum_{i=1}^n(x-x_i)^2 + \frac{1}{n}\sum_{i=1}^n|x-x_i|\right)' \\
&= \frac{2}{n}\sum_{i=1}^n (x-x_i)+\frac{1}{n}\sum_{i=1}^n\text{ sgn}(x-x_i) \\
&= 2x-\frac{2}{n}\sum_{i=1}^n x_i+\frac{1}{n}\sum_{i=1}^n\text{ sgn}(x-x_i) \\
&= 2x - 2\mu_x + \frac{1}{n}\sum_{i=1}^n \text{ sgn}(x-x_i). \end{align*} $$ Therefore, we know that the global minimum $\gamma$ must satisfy $$g'(\gamma) = 0 = 2\gamma - 2\mu_x + \frac{1}{n}\sum_{i=1}^n \text{sgn}(\gamma-x_i),$$ or, in other words, $$\begin{equation} \tag{1} \gamma = \mu_x - \frac{1}{2}\cdot\frac{1}{n}\sum_{i=1}^n\text{ sgn}(\gamma - x_i). \end{equation}$$

Note however that $\forall i : -1 \leq \text{sgn}(\gamma - x_i) \leq 1$, and so $$-n \leq \sum_{i=1}^n\text{sgn}(\gamma - x_i) \leq n.$$ Plugging this in $(1)$, we finally obtain the relation $$\mu_x - 0.5 \leq \gamma \leq \mu_x + 0.5.$$ $$\tag*{$\blacksquare$}$$


Therefore, the value in $\mathbb{R}$ that minimises MSE + AAD is always within $0.5$ of the dataset mean! To find the value in $\mathbb{Z}$ that minimises MSE + AAD (which is really what the problem was asking in the first place), it should be enough to only consider two values: $\lfloor\mu_x\rfloor$, and $\lceil \mu_x \rceil$.

Compare the $AAD + MSE$ for both of the two, and choose the one that minimises it!


  1. What the problem really asks us to find is the minimum over $\mathbb{Z}$ (which is why $\text{dist}'$ is well-defined in the first place), but for analytical purposes it is useful to consider this minimum over $\mathbb{R}$. ↩︎

#maths #programming #proofs #statistics