Category Archives: probability theory

The sum of two normal distributed samples

Let $X_1$ and $X_2$ be independently and identically drawn from the distribution $\mathcal{N}(0, \sigma^2)$ with mean $0$ and (unknown) variance $\sigma^2$. We approximate the sample mean by
$$
\overline{X} = \frac{1}{2} (X_1 + X_2),
$$
and the sample variance by
$$
S^2 = \frac{1}{n – 1} \sum (X_i – \overline{X})^2 = (X_1 – \overline{X})^2 + (X_2 – \overline{X})^2.
$$
The random variable
$$
\frac{\overline{X}}{\sigma / \sqrt{2}}
$$
is a normal distribution with mean $0$ and standard deviation $1$. The approximated random variable
$$
\frac{\overline{X}}{S / \sqrt{2}}
$$
has a Student t-distribution with 1 degree of freedom. Be aware that a Student t-distribution with 1 degree of freedom, the random variable doesn’t have a mean or standard deviation. The “true” random variable and the “approximated” random variable have strong different behavior with respect to the mean and standard deviation.

This result is easily verified numerically

import numpy as np
rng = np.random.RandomState(37)

sigma = 0.42  # Some (unknown) standard deviation
x1 = rng.normal(0, sigma, size=10000)
x2 = rng.normal(0, sigma, size=10000)

xm = (x1 + x2)/2
s2 = (x1 - xm)**2 + (x2 - xm)**2
y = xm/np.sqrt(s2)/np.sqrt(2)

print(y.mean())  # Result: 0.9905..., definitely not 0
print(y.std())   # Result: 91.0561..., not even close to 1

# The "approximation" is not even close to the "true" random variable.
y_true = xm/sigma/np.sqrt(2)
print(np.sqrt(((y - y_true)**2).mean()))  # Result: 91.0412...

This should be a warning when trying to calculate the mean and variance on only 2 samples of a normal distribution. When $n > 2$ this behavior doesn’t happen, as for higher degrees of freedom the Student t-distribution has a mean and variance.

The secretary problem

You have a single position to fill for a secretary. There are $n$ applicants for the position. After interviewing each of the applicants you can immediately rank the applicant from worst to best with the previous interviewed applicants. However, immediately after the interview you must make the decision to hire the applicant. You only have one chance to hire the applicant directly after the interview. The question is: how many applicants must you interview before hiring the current applicant?

If you make your decision too early you might not hire the best applicant. But if you wait to long you might have rejected the best applicant already. More realistic, this is a problem that happens quite regular in real life as well:

  • If you want to buy a house, how many potential houses must you visit before buying your favorite house? If you reject a house, someone else will buy it.
  • How often should you change jobs before hoping you’ll end up with the best job you will get? If you resign a job, you (usually) can’t return to your old job.

There is a mathematical optimum for this problem. Suppose we have $n$ applicants. Our strategy is as follows: we reject the first $r – 1$ applicants. Then we continue with the $n – r$ other applicants. When there is a applicant better than the first $r – 1$ applicants, we take this applicant as our best choice. The goal is to find the optimal $r$ such that the probability that we chose the best applicant is optimal. Let $P(r)$ be the probability that we select the best applicant after rejecting the first $r – 1$ applicants, then
$$
\begin{align*}
P(r) &= \sum_{i = r}^{n} P(\text{$i$ is the best applicant} \cap \text{we choose $i$}) \\
&= \sum_{i = r}^{n} P(\text{$i$ is the best applicant}) P(\text{we choose $i$} | \text{$i$ is the best applicant}).
\end{align*}
$$
We have $P(\text{$i$ is the best applicant}) = \frac{1}{n}$. The second part of the equation is a bit more work. Given $i$ is the best applicant, what is the probability that we would choose $i$? This is only possible if applicants $r, r+1, …, i-1$ were worse than one of the first $r – 1$ applicants. Therefore, the second best applicant before $i$ must be in the first $r – 1$ applicants. The probability of this is $\frac{r – 1}{i – 1}$. We now have
$$
\begin{align*}
P(r) &= \sum_{i = r}^{n} \frac{1}{n} \frac{r\, -\, 1}{i\, -\, 1} = \frac{r\, -\, 1}{n} \sum_{i = r}^{n} \frac{1}{i\, -\, 1}.
\end{align*}
$$

We can try to compute the sum for each $r$ and find the optimal $P(r)$, but there is another way. We try to approximate the sum with a continuous function. Take $x = \frac{r – 1}{n}$, for large $n$ we have
$$
P(r) = \frac{r\, -\, 1}{n} \sum_{i = r}^{n} \frac{1}{i\, -\, 1} \approx x \int_{x}^{1} \frac{dt}{t} = – x\, \mathrm{log}(x) = \tilde{P}(x).
$$
$\tilde{P}(x)$ takes its optimum when its derivative is zero. I.e.
$$
\begin{align*}
0 = \tilde{P}'(x) &= -\mathrm{log}(x)\, -\, 1 \\
\mathrm{log}(x) &= -1, \qquad x = \frac{1}{e}.
\end{align*}
$$
Because $x = \frac{r – 1}{n}$ we find that $r = \frac{n}{e} + 1$, so the optimal strategy is to reject the first $\left[\frac{n}{e}\right]$ (round down) applicants.

So what are the mathematical optimal solutions to the real-life questions asked above? It depends per person, but you should first figure out how many options (applicants) you want to take at maximum. If you have found your maximum you can use the secretary problem to figure out how many options you have to reject by default. Then, you should select the first try that’s better than the options you have rejected before.

  • When should you bid on a house that you visited? Depending on how much you like visiting houses, but you probably don’t want to visit more than 10 houses at maximum. This means that you should reject the first $\left[\frac{10}{e}\right] = 3$ houses straight. Afterwards, every other house you visit that’s better than the first 3 I should try to buy.
  • When should you change jobs? First compute how many jobs you could have at maximum in my life? You work for around 45 years and I think you should work at least 2.5 years for the same employer before being able to say if you like the job or not. This means that you can have around 18 jobs maximum in your working life. This means that you should switch jobs after 2.5 years for the first $\left[\frac{18}{e}\right] = 6$ jobs. Afterwards you should continue switching jobs, until you find a job that’s better than any of the previous jobs. Another way to look at this problem is saying that the first $\left[\frac{45}{e}\right] = 16$ years of your working life you can switch jobs regardless. Afterwards you should become more selective and stick to any job that’s better than any of the previous jobs. You will probably not get anything better.

What is the main takeaway from the secretary problem? We probably make decisions too soon. We stick too long with our current situation. Mathematically it’s probably better to reject offers more often and only after the first $\left[\frac{n}{e}\right]$ options we should become more picky. On the other hand we can’t continue rejecting all our options thinking something better will eventually appear. There is a clearly defined threshold that tells you went to stop looking for other options.

The Two Girl Paradox

A family has two children. You know that one of the two children is a girl. What is the probability that the other child is a girl as well? A classical problem for a probability course exam. Unfortunately, the problem is undefined, because the problem can be intrepid in multiple ways which lead to different results.

Let $G$ be the case that we know that one of the children is a girl. Let $GG$ be the case that both children in the family are a girl. By Bayes rule
$$
P(GG|G) = \frac{P(G|GG) P(GG)}{P(G)} = \frac{1 \times 1/4}{3/4} = \frac{1}{3}.
$$
Because $P(G|GG) = 1$, $P(GG) = 1/4$, and $P(G) = 1/2$. Assuming that the ratio boy girl is exactly fifty/fifty.

Let’s go back to the child of which we know that she’s a girl. She must have a name (we assume that every child has a name). Let her name be $N$ and let $G_N$ be the case that a girl is named $N$. The probability that a girl is called $N$ is set to be $P(G_N) = p$. $p$ is close to zero but not equal to zero, because there are many girl names and none of these names are used too often. Note that if a family has two girls the chance that at least one of the girls is called $N$ is
$$
P(G_N|GG) = 1 – P(\textrm{not }G_N|GG) = 1 – (1 – p)(1 – p) = 2p – p^2.
$$
By Bayes rule
$$
P(GG|G) = P(GG|G_N) = \frac{P(G_N|GG) P(GG)}{P(G_N)} = \frac{(2p – p^2) \times 1/4}{p} = \frac{1}{4} (2 – p).
$$
Because $p$ positive but close to zero we find that $P(GG|G) \approx \frac{1}{2}$. A different results than the result in the previous paragraph.

The estimator for the variance of $m$ normal distributions

Fix a variance $\sigma^2 > 0$ and let $n > 0$ and $\mu_1, \mu_2, …, \mu_n \in \mathbb{R}$. Given $m = m_1 + m_2 + … + m_n$ samples
$$
\begin{split}
X_{11}, X_{12}, …, X_{1m_1} &\sim N(\mu_1, \sigma^2), \\
X_{21}, X_{22}, …, X_{2m_2} &\sim N(\mu_2, \sigma^2), \\
&\vdots \\
X_{n1}, X_{n2}, …, X_{nm_n} &\sim N(\mu_n, \sigma^2), \\
\end{split}
$$
where $m_1, m_2, …, m_n > 1$. What is the best estimator for the variance $\sigma^2$? In other words, we draw samples from $n$ normal distributions with the same variance, but with not necessary equal means. What is the best way to estimate the variance for the $n$ normal distributions?

Main result

Surprisingly there is an answer. It is surprisingly, because we are looking for the best estimator for the variance. Beforehand there is no grantee that a best solution actually exists. Moreover the answer is simple, though a bit harder to prove. Define $X = (X_{ij})$ to be the joint distribution of $\{X_{ij}\}$. The unique uniformly minimum-variance unbiased estimator (UMVUE) for $\sigma^2$ is given by
$$
S(X) = \frac{1}{m – n} \sum_{i = 1}^n \sum_{j = 1}^{m_i} (X_{ij} – \overline{X}_{i})^2,
$$
where $\overline{X}_i = \frac{1}{m_i} \sum_{j = 1}^{m_i} X_{ij}$. In other words, the best estimator for $\sigma^2$ is $S(X)$.

An example

Fix $\sigma = 0.35$. Suppose
$$
\begin{split}
X_1 &= (1.69, 1.35, 1.75, 1.45, 1.77) \sim N(1.5, \sigma^2), \\
X_2 &= (2.27, 1.73, 1.46) \sim N(2.0, \sigma^2), \\
X_3 &= (-0.28, 0.32, 0.69, 0.14) \sim N(0.1, \sigma^2)
\end{split}
$$
We compute
$$
\begin{split}
m &= |X_1| + |X_2| + |X_3| = 5 + 3 + 4 = 11, n = 3, \\
\overline{X}_1 &\approx 1.60, \overline{X}_2 \approx 1.82, \overline{X}_3 \approx 0.21.
\end{split}
$$
According to the main result we approximate the variance $\sigma^2$ as follows
$$
\begin{split}
\frac{1}{m – n} \sum_{i = 1}^n \sum_{j = 1}^{m_i} (X_{ij} – \overline{X}_{i})^2
&= \frac{1}{11 – 3} ( (1.69 – 1.6)^2 + (1.35 – 1.6)^2 + (1.75 – 1.6)^2 \\
&\qquad + (1.45 – 1.6)^2 + (1.77 – 1.6)^2 + (2.27 – 1.82)^2 \\
&\qquad + (1.73 – 1.82)^2 + (1.46 – 1.82)^2 + (-0.28 – 0.21)^2 \\
&\qquad + (0.32 – 0.21)^2 + (0.69 – 0.21)^2 + (0.14 – 0.21)^2) \\
&\approx 0.12
\end{split}
$$
Then the standard deviation is approximated by $\sigma \approx 0.346$. Not bad!

The proof

Before we can start with the proof of the main result we have to introduce some Definitions and Theorems from statistics.

Sufficient statistic

A statistic T(X) is sufficient for the underlying parameter $\theta$ if and only if the conditional probability distribution of the data $X$, given statistic $T(X)$, doesn’t depend on parameter $\theta$. We have the following equivalent definition for sufficiency of a statistic $T(X)$.

Theorem (Fisher-Neyman factorization)

Let $f_{\theta}(x)$ be the probability density function for $X$. $T(X)$ is sufficient for underlying parameter $\theta$ if and only if there exists non-negative functions $g$ and $h$ such that
$$
f_{\theta}(x) = h(x) g_{\theta}(T(x)),
$$
where $h(x)$ doesn’t depend on $\theta$, and $g_{\theta}(T(x))$ depends only on $\theta$ and $T(x)$, but not on $x$.

Complete statistic

A statistic T(X) is complete if and only if for every distribution of the data $X$ and every measurable function $g$
$$
\forall \theta: E_{\theta}(g(T)) = 0 \rightarrow \forall \theta: P_{\theta}(g(T) = 0) = 1.
$$
In other words, if the expected value of $g(T)$ is zero, $g(T) = 0$ almost everywhere on $X$.

Theorem (Lehmann-Scheffe)

The Lehmann-Scheffe theorem provides us the tool to prove when a sufficient and complete statistic for $\theta$ is a UMVUE for $\theta$. Let $X = (X_1, X_2, …, X_n)$ be random samples from a distribution with probability distribution $f_{\theta}(x)$. Suppose $T$ is a sufficient and complete statistic for $\theta$. If $E(T(X)) = \theta$, then $T(X)$ is the unique uniformly minimum-variance unbiased estimator (UMVUE) for $\theta$.

Lemma 1

Given independent samples $X_1 \sim N(\mu_1, \sigma^2)$, $X_2 \sim N(\mu_2, \sigma^2)$, then the statistic $S(X_1, X_2) = X_1 X_2$ and $T(X_1, X_2) = X_1 + X_2$ are complete.

Proof

We prove the completeness of $S$. The completeness of $T$ is left as an exercise. Fix $\theta = (\mu_1, \mu_2, \sigma^2)$. Let $g$ be a measurable function such that
$$
E_{\theta}(g(T(X))) = E_{\theta}(g(X_1 + X_2)) = 0.
$$
The probability distribution of $T(X_1, X_2)$ is
$$
f_{\mu_1 + \mu_2, 2\sigma^2}(x) = (4\pi\sigma^2)^{\frac{1}{2}} \mathrm{exp} \left( – \frac{(x – \mu_1 – \mu_2)^2}{4 \sigma^2} \right).
$$
Therefore
$$
0 = E_{\theta}(g(T(X))) = (4\pi\sigma^2)^{\frac{1}{2}} \int g(x_1 + x_2) \mathrm{exp} \left(-\frac{(x_1 + x_2 – \mu_1 – \mu_2)^2}{4 \sigma^2}\right) dx_1 dx_2.
$$
The $\mathrm{exp}$ term is always greater than $0$, therefore $g(x) = 0$ almost everywhere. Therefore $P_{\theta}(g(x) = 0) = 1$.

Lemma 2

Given independent samples $X_1 \sim N(\mu_1, \sigma^2)$, $X_2 \sim N(\mu_2, \sigma^2)$, …, $X_n \sim N(\mu_n, \sigma^2)$, then $\sum_i X_i$ and $\sum_i X_i^2$ are complete.

Proof

Both statements follow from induction on $i$ and applying Lemma 1.

Proof of the main result

Let $X_{ij}$ as defined in the main result. Define $X = (X_{ij})$ to be the joint distribution of $\{X_{ij}\}$ with probability distribution
$$
\begin{split}
f(x) &= f_{\mu_1, \mu_2, …, \mu_n, \sigma^2}(x) \\
&= (2\pi)^{-\frac{m}{2}} \sigma^{-m} \mathrm{exp} \left(-\frac{1}{2\sigma^2} \sum_{i,j} (x_{ij} – \mu_i)^2 \right) \\
&= (2\pi)^{-\frac{m}{2}} \sigma^{-m} \mathrm{exp} \left(-\frac{1}{2\sigma^2} \sum_i m_i \mu_i^2\right) \\
&\qquad \times \mathrm{exp} \left(-\frac{1}{2\sigma^2} \sum_{i,j} x_{ij}^2\right) \mathrm{exp} \left(\sum_i \frac{\mu_i}{\sigma^2} \sum_j x_{ij}\right).
\end{split}
$$
Notice that the equation above satisfies the requirements of the Fisher-Neyman factorization for
$$
T(X) = (\sum_{j} X_{1j}, \sum_{j} X_{2j}, …, \sum_{j} X_{nj}, \sum_{i,j} X_{ij}^2).
$$
Therefore $T(X)$ is sufficient statistic for $(\mu_1, \mu_2, …, \mu_n, \sigma^2)$. From Lemma 1 and 2 it follows that $T(X)$ is a complete statistic. Define statistic $S$ from $T$ as
$$
\begin{split}
S(X) &= \frac{1}{m – n} \left( \sum_{i,j} X_{i,j}^2 – \sum_i \frac{1}{m_i} \sum_j X_{i,j} \right) \\
&= \frac{1}{m – n} \sum_i \left( \sum_j X_{i,j}^2 – \overline{X}_i^2 \right) \\
&= \frac{1}{m – n} \sum_i \left( \sum_j (X_{i,j} – \mu_i)^2 – m_i (\overline{X}_i – \mu_i)^2 \right).
\end{split}
$$
With Lemma 2 $S$ is complete and sufficient. Moreover
$$
\begin{split}
E(S) &= \frac{1}{m – n} \sum_i \left( \sum_j E((X_{i,j} – \mu_i)^2) – m_i E((\overline{X}_i – \mu_i)^2) \right) \\
&= \frac{1}{m – n} \sum_i \left( \sum_j \sigma^2 – m_i \frac{\sigma^2}{m_i} \right) \\
&= \frac{1}{m – n} \sum_i (m_i – 1) \sigma^2 = \sigma^2,
\end{split}
$$
because
$$
\begin{split}
E((X_{i,j} – \mu_i)^2) &= \mathrm{Var}(X_{i,j}) = \sigma^2, \\
E((\overline{X}_{i,j} – \mu_i)^2) &= \mathrm{Var}(\overline{X}_{i,j}) \\
&= \frac{1}{m_i^2} \left(\sum_j \mathrm{Var}(X_{i,j}) + \sum_{j \neq k} \mathrm{Cov}(X_{i,j}, X_{i,k}) \right) \\
&= \frac{1}{m_i^2} \sum_j \sigma^2 = \frac{\sigma^2}{m_i}.
\end{split}
$$
Therefore $S$ is a complete, sufficient, statistic estimator for $\sigma^2$. From the Lehmann-Scheffe Theorem it follows that $S$ is a UMVUE for $\sigma^2$. QED

The product of two uniform distributions

Recently someone asked me the following question, what is the distribution of the product of two uniform distributions? Not knowing the answer immediately we discussed this problem and concluded that this is not a trivial question. Below I show you my solution to this problem.

Numerical approach

We first want to have some intuition about what we expect to obtain. We do this by generating 10000 pairs of random uniform samples between 0 and 1. Then we multiply each of the pairs and plot the histogram of the results. We can do this with the following piece of python code.

import numpy as np
import matplotlib.pyplot as plt

rs = np.random.RandomState(37)

X = rs.rand(10000)
Y = rs.rand(10000)
Z = X*Y

hist = np.hstack(Z)
plt.hist(hist, bins = 'auto')
plt.show()

The results are quite surprising

It seems that the product of two uniform distributions are distributed on a logarithmic scale. Now the question is, can we find the explicit distribution function, and can we prove the correctness of the distribution function.

The distribution function of the product of two uniform distributions

Let $X$ and $Y$ be uniform distributions between $0$ and $1$. Define $Z = XY$. We write $f_X$ and $f_Y$ for the distributions $X$ and $Y$, respectively. Let $0 \leq T \leq 1$, then
$$
\begin{split}
P(Z < T) &= \int_0^1 P(xY < T) f_X(x) dx = \int_0^1 P(Y < T/x) f_X(x) dx
\end{split}
$$
If $x < T$, then $P(Y < T/x) = 1$, hence we split the integral
$$
\begin{split}
P(Z < T) &= \int_0^T f_X(x) dx + \int_T^1 P(Y < T/x) f_X(x) dx \\
&= \int_0^T f_X(x) dx + \int_T^1 \frac{T}{x} f_X(x) dx \\
&= \int_0^T 1 dx + \int_T^1 \frac{T}{x} dx \\
&= T – T \log(T).
\end{split}
$$
Hence
$$
f_Z(x) = \frac{d}{dx} (x – x\log(x)) = 1 – (1 + \log(x)) = -\log(x).
$$

The distribution function of the product of an arbitrary product of uniform distributions

We can do better. Let $X_1$, $X_2$, …, $X_n$ be uniform distributions. What is the distribution function for $Z_n = X_1 X_2 \ldots X_n$? We will show, in general, that
$$
f_{Z_n}(x) = \frac{(-1)^{n-1}}{(n-1)!} \log(x)^{n-1}, \quad 0 \leq x \leq 1.
$$
For $n = 2$ we already showed that $f_{Z_2}(x) = -\log(x)$. Assume that we can show this formula for all $k \leq n$. We show that the formula holds for $n+1$. By induction this proves that the formula holds for all $n$. We use the somewhat known identities, seeĀ https://en.wikipedia.org/wiki/List_of_integrals_of_logarithmic_functions,
$$
\begin{split}
\int \log(x)^n dx &= x \sum_{k=0}^n (-1)^{n-k} \frac{n!}{k!} (\log(x))^k, \\
\int \frac{\log(x)^n}{x} dx &= \frac{\log(x)^{n+1}}{n+1}.
\end{split}
$$
Using
$$
\begin{split}
P(Z_n < T) &= \int_0^1 P(x X_n < T) f_{Z_{n-1}}(x) dx \\
&= \int_0^1 P(X_n < T/x) f_{Z_{n-1}}(x) dx \\
&= \int_0^T f_{Z_{n-1}}(x) dx + \int_T^1 \frac{T}{x} f_{Z_{n-1}}(x) dx,
\end{split}
$$
where $Z_{n-1} = X_1 X_2 \ldots X_{n-1}$, we have
$$
\begin{split}
P(Z_{n+1} < T) &= \int_0^T f_{Z_{n}}(x) dx + \int_T^1 \frac{T}{x} f_{Z_{n}}(x) dx \\
&= \int_0^T \frac{(-1)^{n-1}}{(n-1)!} \log(x)^{n-1}dx
+ \int_T^1 \frac{T}{x} \frac{(-1)^{n-1}}{(n-1)!} \log(x)^{n-1}dx \\
&= \frac{(-1)^{n-1}}{(n-1)!} \left(
\int_0^T \log(x)^{n-1}dx + \int_T^1 \frac{T}{x} \log(x)^{n-1} dx
\right) \\
&= \frac{(-1)^{n-1}}{(n-1)!} \left(
T \sum_{k=0}^{n-1} (-1)^{n-k} \frac{n!}{k!} (\log(T))^k – \frac{T}{n} \log(T)^n
\right)
\end{split}
$$
The differential of $P(Z_{n+1} < T)$ is a telescope sum, i.e. using
\begin{equation*}
\frac{d}{dx} x\log(x)^k = k\log(x)^{k-1} + \log(x)^k,
\end{equation*}
we have
$$
\begin{split}
\frac{d}{dx} \left(
x \sum_{k = 0}^{n} (-1)^{n-k} \frac{n!}{k!} \log(x)^k
\right) &= \sum_{k = 0}^{n} (-1)^{n-k} \frac{n!}{k!} (k\log(x)^{k-1} + \log(x)^k) \\
&= \sum_{k = 0}^{n-1} (-1)^{n-k+1} \frac{n!}{k!} \log(x)^k
+ \sum_{k = 0}^{n} (-1)^{n-k} \frac{n!}{k!} \log(x)^k \\
&= \log(x)^{n}
\end{split}
$$
Therefore
$$
\begin{split}
f_{Z_{n+1}}(x) &= \frac{d}{dx} P(Z_{n+1} < x) \\
&= \frac{(-1)^{n-1}}{(n-1)!} \left(
\log(x)^{n-1}
– \log(x)^{n-1}
– \frac{1}{n} \log(x)^{n}
\right) = \frac{(-1)^n}{n!} \log(x)^n.
\end{split}
$$
Hence by induction we showed that $f_{Z_n}(x)$ holds for all $n$.