# Illustrate Bayesian inference.
GOAL: Illustrate Bayesian inference.
Probability is the codification of logical thought in the presence of
uncertainty. It the ONLY consistent system for logical inference with
uncertainty (TODO: Cox's Theorem, Fisher's Information, Cramer-Rao
bound, KL co-ordinate free invariant information geometry)
When interpreting the world you get observations $O$ and you want to
decide what model of truth $T$ best explains those observations. This
inference of model from data sometimes allows you to understand and
predict things about the world (The Scientific Method).
To do this inference, you do experiments to determine what you observe
given you have fixed and know the truth, $P(O|T)$, the
likelihood / density (interpreting as function of T / O). Bayes rule then turns that around to compute $P(T|O)$ the
posterior (after-data) of the true hypothesis given what you observe
and your prior about what might be true before seeing any data:
$P(T)$:
$$P(T|O) = P(O|T)*P(T) / P(O)$$
This is inspired from MacKay's book in which he shows the simple
intuitive beauty of Bayes rule: http://www.inference.org.uk/mackay/itprnn/ps/47.59.pdf
For example, say I want to predict the species for a given tree based
on its height.
First I construct some probability distributions. I go out and look at
only maple trees and get their heights. I then I look at only dogwood
trees and get their heights. I capture my observations in functions
$P(height|maple), P(height|dogwood)$. Maples are generally taller (70
feet) than dogwoods (30 feet). Now I have a tree that is 69 feet; what
species is it? I want to compute the posterior probability of it being
each species of tree: $P(maple | 69 feet), P(dogwood | 69 feet)$. I
might also have a prior that I might see more maples because I'm
farther north. I look at which posterior has greater probability and
decide that species of tree. Skipping all the details, this procedure
would likely call Maple because 69 feet is much closer to the average
70 of Maple than the average of 30 feet of Dogwood. The details make
this all very precise and quantitative.
Probability distributions forces you to place rational bets across all
possibilities. This allows you precisely weigh how often different
events happen. Bayesian probability mean you write down the
probability of everything. After that it is _mechanical_ to infer
different quantities.
================================
# Simple Binomial Example
Let's work through a simple example: https://en.wikipedia.org/wiki/Binomial_distribution
Flip a coin $N$ times where the probability of heads is $p$. How often
do you see $k$ heads where $k$ is between $[0,1,...,N]$? The binomial
distribution tells you this:
$$ P(k|N,p) = choose(N, k) p^k (1-p)^{(N-k)} $$
The three terms are 1) probability of heads to the power of the number
of observed heads $(p^k)$, 2) probability of tails to the power of
number of observed tails $(1-p)^{(N-k)}$, 3) how many ways are there
are to choose number of observed heads out of the total $(choose(N,k)
= N!/(k! * (N-k)!))$ or the binomial coefficient. These distributions
are carefully constructed to sum to 1.0. The powers of probability are
simply because probabilities multiply for independent events.
This is mathematical truth. If you independently flip coins, binomial
describes exactly the outcomes.
================================
# Generate Some Data
Let's generate some data from the model using R. Binomial data is
generated using "rbinom".
Set the model parameters to $N=10$ flips, $p=0.3$ or 30% for
heads. Let's run 100 experimental trials.
```
N=10
p=0.3
trials= 100
obs = rbinom(trials, size=N, prob=p)
png("tableobs.png")
plot(table(observed))
title("counts of observed values")
dev.off()
```
The count of about 30 at observed=3 says that out of the 100
experiments you ran, ~30 of those experiments showed 3 heads and 7
tails when you flipped the coin ten times.
================================
# Inference
OK, we have some data. Let's forget that we know the probability of
heads that generated the data (0.3). Let's infer is value from the
data itself where we know we flipped a coin 10 times for 100 trials.
To make things simple, let's assume the true probability is quantized
so that it can only take on 1/10'th values. There are nine different
hypotheses to test:
```
hyp = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
```
We want to know the posterior probability of each hypothesis given the
data. This says after you are given the data, what is the probability
of each of the hypotheses. You can use this and do things like take
the maximum one as the best bet as to which one is the correct one.
How do we do that? Write down Bayes rule:
$$P(T|O) = P(O|T)*P(T) / P(O)$$
Here $T$ or truth is over the possible hypotheses and $O$ is the
observed data. Let's fill out the terms.
We assume a model. Here it is binomial but it could be any model that
assigns a probability to a data point given some parameters of the
model. (Note: here the model "reflects" reality. What happens when the
model or class of models is limited and doesn't exactly reflect
reality? In machine learning, is your hypothesis concept class capable
of representing the problem? ). R binomial probability is "dbinom".
$$P(O|T) = P(k|N,p) = choose(N,k) p^k (1-p)^{(N-k)} = dbinom(k,size=N,prob=p)$$
The observed data is $k$ the number of heads and the truth is the
setting of the model parameters $N,p$. When you sum over all
possibilities of the observed data (0:N) you get 1.0, or that you've
placed rational bets on all possibilities.
$P(T)$ is the prior. This is your a-priori belief that each of the
hypotheses is true. In our case we don't have any real information and
therefore assign uniform probability over the nine hypotheses.
$P(O)$ is a normalizing factor to make sure the posterior sums to 1.0
over the different hypotheses. It is $\sum_i P(O|T_i)*P(T_i)$.
We have all the ingredients!
================================
# Inference Calculation: Likelihoods
Compute some numbers!
Both the likelihood and the posterior are 2d arrays $[O,T]$. In this
case, $O$ can take on 11 possible values (0:10 the number of heads out
of 10 flips) and $T$ can take on 9 values (0.1:0.9, the 9 hypothesis
values that we decided to consider). Let's use R to compute these
things.
```
# storage [hpy,obj]
likelihood = matrix(nrow=9, ncol=11)
# step through all hyptheses
for (hypii in 1:length(hyp)){
# step through all possibilties for running one trial,
# the number of observed heads
obspossible = seq(0,N)
for (obsii in 1:length(obspossible)){
# compute the likelihood for this observed, hypothesis combination
likelihood[hypii,obsii] = dbinom(obspossible[obsii], size=N, prob=hyp[hypii])
}
}
write.table( likelihood, file="likelihood.tsv", sep="\t", col.names=F,row.names=F)
write.table( hyp, file="likelihood_rowval.tsv", sep="\t", col.names=F,row.names=F)
write.table( obspossible, file="likelihood_colval.tsv", sep="\t", col.names=F,row.names=F)
```
OK, now we have the likelihood for each possible hypothesis over every
possible observed outcome using a simple mechanical computation.
Let's make sure that our distributions properly sum to 1.0. The sum
across the columns should be 1.0 for each of the 9 hypotheses:
```
apply(likelihood, 1, sum)
[1] 1 1 1 1 1 1 1 1 1
```
GOOD! Each hypothesis places a distribution over all observed
possibilities that sum to 1.0
Let's look at all the hypotheses:
```
for (hh in 1:length(hyp)){
fn = paste("hypoth.",hh,".png",sep="")
png(fn)
plot(obspossible, likelihood[hh,],type="b")
title(paste("hypothesis p=",hyp[hh]))
dev.off()
}
mycol=rainbow(9)
png("hypoth.all.png")
plot(obspossible, likelihood[1,],col=mycol[1],type="b",ylim=c(0,0.4))
for (hh in 1:length(hyp)){
points(obspossible, likelihood[hh,],col=mycol[hh],type="b",ylim=c(0,1))
}
dev.off()
```
These are plots of the likelihood over the observations 0:10 given the
hypothesis probability.
SlideIt:
1
A plot shows that given a particular hypothesis probability of heads
say 0.1 and flip a coin 10 times, the probability distribution over
the outcomes of a total of 0 heads, 1 heads, ... 10 heads given my
probabilistic model. For binomial and p=0.1 it says there is about 40%
probability of seeing 1 heads out of the 10 flips.
================================
# Likelihood Dual / Posterior
Let's look at the dual or along the other axis of the matrix we
constructed. This is the value of the likelihood at each
hypothesis=(0.1,...0.9) given a single observation of 0:11. First note
that they do not sum to 1.0:
```
apply(likelihood, 2, sum)
[1] 0.4914342 0.8287079 0.9061899 0.9099927 0.9091388 0.9090730 0.9091388
[8] 0.9099927 0.9061899 0.8287079 0.4914342
```
```
for (oo in 1:length(obspossible)){
fn = paste("obs.",oo,".png",sep="")
png(fn)
plot(hyp, likelihood[,oo],type="b")
title(paste("obs=",obspossible[oo]))
dev.off()
}
mycol=rainbow(11)
png("obs.all.png")
plot(hyp, likelihood[,1],col=mycol[1],type="b",ylim=c(0,0.4))
for (oo in 1:length(obspossible)){
points(hyp, likelihood[,oo],col=mycol[oo],type="b",ylim=c(0,0.4))
}
dev.off()
```
These are the un-normalized distributions of the different hypotheses
given a single observation of obs:
SlideIt:
1
Note that these are very similar to the likelihood plots from
before. There is a dual symmetry. The dual conjugate to
the binomial
distribution is the beta
distribution: beta distribution. The
distributions are EXACTLY the same except for the normalizing factor
and interpretation! (Which might be hard to tell from the equations
at first glance)
A plot shows that given a particular observation from my experiment,
say 0 heads out of 10 flips, show me the probabilities over the
different hypotheses (0.1, 0.2, ..., 0.9). This is the dual of the
likelihood plots shown first. We are taking slices through the
probability distributions:
Note these "dual" distributions don't sum to 1.0! They are still
values between 0 and 1 and can be compared against each other because
we've taken effort to make sure our models return proper probability
distributions (viz: examine the maximum value). However, we can't
interpret them as probability _distributions_ yet.
***This is very important!:***
How can we interpret these numbers as probability distributions?
Bayes rule says to simply multiply by prior and normalize.
- Going one way, we can propose a multiple hypothesis models to make
predictions about how the outcome of an experiment will turn out (the
likelihood)
- We can also quantify how much we believe each model is likely to be
true (the prior).
- We can then run experiments and let Nature tell us the results (the
observations).
- We can then mechanically plug these resulting observations into our
models and make _inferences_ about what how likely each of the
different models is to be ***true*** (the posterior).
- This ***is*** the scientific method and codifies logical thought
under uncertainty.
Note that for uniform prior, this multiplying by prior and normalizing
is the same as dividing by the sum, as the prior cancels out, and does
not change the shape. For non-uniform prior, more probability will be
put on those with higher a-prior probability. We do this in the next
section.
Rather than cuts parallel to the two axes, you can also plot the
likelihood as a 2d image:
```
png("like.2d.png")
image(likelihood)
dev.off()
```
Or in 3d: three.html
================================
# Uniform Prior
Say we see a single of observation of 3 out of 10.
First let's pull out the probability of observing a single 3 (out of
10) under the different hypotheses (obspossible[4]= 3 because of the 0
obs)
```
Pobs3 = likelihood[,4]
[1] 0.057395628 0.201326592 0.266827932 0.214990848 0.117187500 0.042467328
[7] 0.009001692 0.000786432 0.000008748
sum(Pobs3)
[1] 0.9099927
which.max(Pobs3)
3
png("Pobs3.png")
plot(hyp, Pobs3,type="b")
dev.off()
```
Good! If you observe a single observation of 3 (out of 10) then the
hypothesis with the highest likelihood is p=0.3 with probability 26.7%
. This is the maximum likelihood choice. The next most likely is p=0.4
at 21.5%. Here we can compare probabilities through things like maximum
even though they do not sum to one and are not a distribution
Let's look at the true posterior with uniform prior properly
normalized:
```
sum(Pobs3)
[1] 0.9099927
# doesn't sum to 1 because over hypotheses
# uniform prior and normalize
prior = rep(1/9,9)
posterior = prior*Pobs3
posterior = posterior/sum(posterior)
posterior
[1] 6.307262e-02 2.212398e-01 2.932199e-01 2.362556e-01 1.287785e-01
[6] 4.666777e-02 9.892049e-03 8.642179e-04 9.613264e-06
png("Pobs3post.png")
plot(hyp, posterior,type="b")
dev.off()
```
As you can see, the uniform prior posterior has changed the scale, but
the shape remains unchanged. But now summing over all hypotheses gives
1.0 so it is a nice distribution. Note that getting the "partition
function" normalizations are correct so it sums to 1.0 is key if you
want probabilities. With more complicated models, this can sometimes be
tricky. So much so that fields like Bayes decision theory construct
functions that look at ratios so the these normalizations cancel out.
With a real probability distribution, you can also also talk about the
maximum a posterior choice (MAP), which for uniform prior is the
maximum-likelihood choice. There is the mean posterior estimate as
alternative to the maximum likelihood estimate:
```
sum(hyp*posterior)
[1] 0.3330378
```
The mean posterior estimate is 0.333 which is close the maximum
likelihood estimate. Note it doesn't lie right on a hypothesis
0.3, 0.4.
This is the Bayesian way. You keep around the distributions of
everything carrying full information and then possibly take reductions
like max or mean if you want a single estimate.
================================
# Informative Prior
Let's say we believe a-priori that the coin is biased and will come up
heads 30% of the time. Let's put a prior probability 76% on the 0.3
hypothesis and uniform over others.
```
hypPrior= c( 0.03, 0.03, 0.76, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03 )
posterior = likelihood # set up storage
for (hypii in 1:length(hyp)){
for (obsii in 1:length(obspossible)){
posterior[hypii,obsii] = likelihood[hypii,obsii]*hypPrior[hypii]
}
}
for (obsii in 1:length(obspossible)){
# normalize
posterior[,obsii] = posterior[,obsii]/sum(posterior[,obsii])
}
for (oo in 1:length(obspossible)){
fn = paste("post3.",oo,".png",sep="")
png(fn)
plot(hyp, posterior[,oo],type="b")
title(paste("prior=3 76% obs=",obspossible[oo]))
dev.off()
}
apply(posterior,2,sum)
[1] 1 1 1 1 1 1 1 1 1 1 1
```
SlideIt:
1
The sums over hypotheses are to 1. You can see how the prior really
favors the 0.3 hypothesis. Only when you seen an observation of 7 or
greater does the likelihood overcome the prior and change the most
likely call.
================================
# Multiple Observations Of Same Outcome
Before we had the probability given a single observation: $P(obs |
hyp)$ or likelihood[hyp,obs]
For multiple observations if you assume independence then you can
multiply together:
$$P({obs} | hyp ) = \prod_N P(obs_N | hyp)$$
( Multiplying makes probabilities smaller because they are less than or
equal to 1.0. With large number of observations, these probabilities
can get very small (Think 10^-100 for 100 observations each at 10%
probability). Are they still a distribution because they are so small?
Yes, because the input space is much larger for the multiple
observations as you have to sum over all possibilties of _all_
observations. See appendix. )
Bayes rule applies in the same way to multiple observations.
Let's say you see 5 observations of 3. What is the likelihood under
the 9 different hypotheses for these multiple observations? This is
just probability of each hypothesis given the single observation of 3
(Prob3 = $P(3|H_i)$) to the 5th power because observations are
independent. For example, for just one hypothesis
$$ P( (3,3,3,3,3) |H) = P(3|H) P(3|H) P(3|H) P(3|H) P(3|H) = (P(3|H))^5 $$
```
Pobs3^5
[1] 6.228652e-07 3.307545e-04 1.352560e-03 4.593036e-04 2.210072e-05
[6] 1.381258e-07 5.910453e-11 3.008194e-16 5.123230e-26
# With uniform prior, shape is UNCHANGED
prior = rep(1/9,9)
posterior = prior*Pobs3^5
posterior = posterior/sum(posterior)
[1] 2.876338e-04 1.527396e-01 6.246005e-01 2.121025e-01 1.020592e-02
[6] 6.378530e-05 2.729397e-08 1.389158e-13 2.365864e-23
png("Pobs3pow5.png")
plot(hyp, posterior,type="b")
points(hyp, Pobs3/sum(Pobs3) ,type="b",col="red")
title("Black: posterior 5 observations of 3/10.\nRed: posterior of single 3/10 observation")
dev.off()
```
The maximum likelihood choice is again p=0.3 with probability
1.3e-03. But you can see that the likelihood is peaking up on 0.3 and
falling off rapidly on other hypotheses because of the
multiplication. The posterior is 62.5% at the 0.3 hypothesis
With a large number of 3-observations 100, almost all the probability
falls on 0.3. Note that with such small probabilities you must be
careful of floating point underflow. In this case computing the
posterior puts machine 100% probability on 0.3 with "rounding error"
at the other hypotheses. As shown later, you usually solve this
problem by storing log(probability) where multiplication is addition
and addition involves moving in and out of log using exponential and
log functions (slower than multiplying probabilities).
```
Pobs3^100
[1] 7.724693e-125 2.455354e-70 4.198837e-58 1.745811e-67 7.729100e-94
[6] 6.389894e-138 2.706543e-205 3.682287e-311 0.000000e+00
prior = rep(1/9,9)
posterior100 = prior*Pobs3^100
posterior100 = posterior100/sum(posterior100)
[1] 1.839722e-67 5.847701e-13 1.000000e+00 4.157844e-10 1.840772e-36
[6] 1.521825e-80 6.445934e-148 8.769779e-254 0.000000e+00
sum(posterior)
[1] 1
png("Pobs3pow5c100.png")
plot(hyp, posterior100,type="b",col="blue")
points(hyp, Pobs3/sum(Pobs3) ,type="b",col="red")
points(hyp, posterior ,type="b")
title("Black: posterior 5 observations of 3/10.\nRed: posterior of single 3/10 observation\nBlue: 100 observations")
dev.off()
```
In general the likelihood under these multinomial distributions is the
probability to the number of observations:
$$P( obs | hyp) = \prod_i (hypP_i) ^ {obs_i}$$
================================
# Multiple observations Of Different Outcomes
We've seen if you observe the same value several times then the
likelihood peaks up on the more probable hypotheses looking at single
observation. But what if you observe multiple observations at
different points, where does the likelihood peak up? Say you observed
the following observations after running 102 different experiments:
```
myobs = c(3,11,24,27,20,10,4,2,1,0,0)
sum(myobs)
[1] 102
cbind(obspossible,myobs)
obspossible myobs
[1,] 0 3
[2,] 1 11
[3,] 2 24
[4,] 3 27
[5,] 4 20
[6,] 5 10
[7,] 6 4
[8,] 7 2
[9,] 8 1
[10,] 9 0
[11,] 10 0
```
What is the likelihood of this ensemble of data under the different
hypotheses? For each hypothesis take a product of the probability
estimated by that hypothesis to the number of times it is seen in the
data.
```
mylike = sapply(1:9, function(hh){ prod(likelihood[hh,]^myobs)})
[1] 1.447318e-156 1.882205e-97 1.239293e-82 2.775077e-90 2.104950e-115
[6] 1.317852e-158 2.080566e-225 0.000000e+00 0.000000e+00
mylike/sum(mylike)
[1] 1.167858e-74 1.518773e-15 1.000000e+00 2.239241e-08 1.698509e-33
[6] 1.063389e-76 1.678832e-143 0.000000e+00 0.000000e+00
-log2(mylike)
[1] 517.6874 321.3146 272.0886 297.5010 380.9479 524.4665 746.3768 Inf
[9] Inf
```
The maximum likelihood is p=0.3 that is 8 orders of magnitude greater
than p=0.4 given the data (The posterior gets machine 100% due to
underflow). Because probabilities get so small with larger number of
observations, taking -log2 not only makes numerics easier but also can
be interpreted as bits. Even here, you run into underflow issues that
make normalization tricky. Here p=0.3 has (- 297.5010 272.0886)=24.5
bits in support of it over the p=0.4 hypothesis. Roughly seeing this
as a log odds ratio, the probability of making a classification error
is $2^{-24.5}=4.2E-08$. Thinking in bits is useful!
When there are two hypotheses, examining $LR= log_2 P1/P2$, the log
probability ratio, has nice properties in Bayes Decision Theory. You
decide P1 is the winner if LR is positive otherwise P2. $LR$ gives the
bits in support over the alternative.
================================
# Maximum Likelihood and KL
From before we saw the likelihood from multiple independent
observations:
$$P( obs | hyp) = \prod_i (hypP_i) ^ {obs_i}$$
Say $hypP$ is a set of hypotheses ${hypA, hypB, ...}$. Maximum
likelihood is:
$$\max_X \prod_i (hypX_i) ^ {obs_i}$$
Take the negative log and now minimizing as log in monotonic and negative flips
$$\min_X - \sum_i obs_i \log (hypX_i)$$
Divide by total number of observations $N$, a constant
$$\min_X - \sum_i (obs_i/N) \log (hypX_i)$$
Introduce Kullback-Liebler between distributions P and Q
$$KL(P|Q) = \sum_i P_i * \log( P_i/Q_i )$$
$$KL(P|Q) = \sum_i P_i \log( P_i) - \sum_i P_i \log( Q_i )$$
Fix $P$ to be the truth and find min KL to $hypX$, the different hypotheses to truth
$$min_X KL(P|hypX) = \sum_i P_i \log( P_i) - \sum_i P_i \log( hypX_i )$$
Note the first term the entropy of $P$ is constant
$$min_X - \sum_i P_i \log( hypX_i )$$
This is exactly like maximum likelihood except $obs_i/N$ is replaced
by $P_i$. As $N$ gets big $(obs_i/N) \rightarrow P_i$ by the strong
law of large numbers. Finding the maximum likelihood hypothesis given
the observed set of data is the same as computing KL between the
hypothesis' distributions and the distribution from the observed set
of data.
The nice thing about thinking in KL minimization is that you can apply
a bound from Hoeffding related to Bayes decision theory:
$$ P(\mbox{Err}) = P( \log \frac {p_1(O)}{p_2(O)} > 0 | O \sim p_2 ) < \exp(-2N KL(p_2, p_1)^2) $$
This says the that the probability of making a misclassification error
by choosing the maximum posterior through the log posterior ratio,
goes down exponentially fast with the number of data points $N$ and
the square KL divergence in the competing hypotheses $p1,p2$. So with
large data and hypotheses that don't have the same shape you will make
very few errors.
So taking your observations and estimating a distribution and
comparing that distribution through KL to a set of competing
distributions is the same as maximum likelihood! KL is looking for the
same shape of distribution.
Let's compute KL for the data in the previous:
```
kl = function(pp, qq){
sum(pp*log2(pp/qq))
}
# estimate distribution from observed data with 0.01 Laplace pseudo count
myobsP = (myobs+0.01)/sum(myobs+0.01)
# compute KL from observed to hypothesis distributions
sapply(1:length(hyp), function(hh){ kl(myobsP, likelihood[hh,])})
[1] 2.43565192 0.50803132 0.02383006 0.27166534 1.08857409 2.49441972
[7] 4.66870578 8.10767875 14.48080813
png("kl.png")
plot(obspossible,likelihood[2,],type="b",col="red")
points(obspossible,likelihood[3,],type="b",col="green")
points(obspossible,likelihood[4,],type="b",col="blue")
points(obspossible,myobsP,type="b",lwd=2)
title("data = black. hypothesis=0.3=green 0.4=blue 0.2=red\nKL(black,green)=0.0238 ,blue)=0.27 ,red)=0.5")
dev.off()
```
Yes the minimum KL is the maximum likelihood answer! You can see that
green and black have almost the exactly same shape. So with multiple
observations over several values, the likelihood peaks up on the
hypothesis that has the most similar shape as measured by KL!
TODO: what if the data has nothing to do with your models and the
"fit" is horrible?
KL is the number of bits required to communicate the differences
between two distributions. This example has 102 data points. Let's
construct the number of bits between the top two hypotheses:
```
102*(0.27166534-0.02383006)
[1] 25.2792
```
The top hypothesis has 25.2792 bit in support of it over the 2nd
alternative as measured by KL. This is very close to the maximum
likelihood difference in bits! Min KL is max likelihood. TODO: why not
exactly the same.
================================
# Appendix: Multiplying Independent Distributions is Still a Distribution
Let's consider multiple Bernoulli trials of flipping a biased coin: $
P(H)=90 \%, P(T)=10 \% $
For a single flip of a single coin it is obviously a distribution:
$\sum_{c1} P(C) = 0.9 + 0.1 = 1.0$
For two independent flips we want the joint which is a 2d matrix
covering possibilities [[HH, HT],[TH,TT]] where we multiply to Bernoulli together
$$P(c1,c2) = P(c1) P(c2) = [[0.9 \times 0.9, 0.9 \times 0.1], [0.9 \times 0.1, 0.1 \times 0.1]]$$
The probabilities get smaller because of the multiplying but we have a
larger number of terms in the sum to account for the different coins.
It is a distribution: $\sum_{c1,c2} P(c1,c2)= (0.81 + 0.09 + 0.09 + 0.01) = 1.0$
For 3 coins, the joint is a 3d matrix. For $N$ coins, the joint is a
N-d matrix which can be very large. For 100 coins there would be
2^100 terms in the sum to evaluate! However, note that the
probabilities simply depend on the number of heads and tails for
Bernoulli:
$$\sum_{c1,c2} P(c1,c2)= 0.9^2 0.1^0+ 0.9^1 0.1^1 + 0.9^1 0.1^1 + 0.9^0 0.1^2 $$
$$= choose(2,2) 0.9^2 0.1^0 + choose(2,1) (0.9 0.1) + choose(2,0) 0.9^0 0.1^2$$
This is exactly the binomial distribution! It sums to 1.0:
$$ 1 = 1^N = (p + 1-p)^N = \sum_{k=0}^N choose(N, k) p^k (1-p)^{(N-k)} $$
This is motivation why multiplying independent distributions is still
a distribution and sums to 1 over all possibilities.