Probability (Appendix C)

본 포스팅은 Simon J.D. Prince 의 Deep Learning 교재를 스터디하며 정리한 글임을 밝힙니다.

https://udlbook.github.io/udlbook/

1. Random variables and probability distributions

random variable $x$ 는 어떤 값인데, 뭐가 나올지는 모르는 값 (quantity that is uncertain.) 이다. 이는 discrete 할 수도, continuous 할 수도 있다. 만약 random variable $x$ 의 몇가지 예시들을 본다면, 이들은 모두 그 값이 다를 것이다. 그리고 서로 다른 값을 취할 상대적인 편파도 (relative propensity to tatke different values) 를 우리는 probability distribution $Pr(x)$ 라고 한다.

Discrete variable에 대해서, 해당 distribution 은 가능한 모든 ourcome $k$ 에 대하여 probability $Pr(x=k) \in [0,1]$ 을 갖는다. Continuous variable에 대해서는 $x$ 의 domain의 $a$ 값에 대하여 non-negative probability, $Pr(x=a) \ge 0$ 을 갖는다. 그리고 이의 probability density function (PDF) 의 적분값은 1임. 여기서 density 는 임의의 점 $a$ 에서 1보다 큰 값을 가질 수 있다.

본 챕터에서는 모든 random variable 을 continuous 하다고 가정하며, 이는 discrete 하게 확장할 수 있다. 인테그랄을 시그마로만 바꾸면 됨.

Joint Probability

random variable $x, y$ 가 있다고 하자. joint probability $Pr(x,y)$ 는 $x, y$ 가 어떤 값들의 특별한 조합을 취하는 경향(propensity)을 나타낸다.

joint probability 는 $x, y$ 가 어떤 조합으로 더 많이 존재하는지, propensity, 에 대한 distribution이다. 이를 더 많은 variables 로 확장 시켜 $Pr(x, y, z)$ 로도 표현할 수 있으며, multiple random variables를 vector $\bold{x}$ 에 저장하여 $Pr(\bold{x})$ 로도 표현할 수 있다. 여기서 vector $\bold{x}$ 는 하나의 점으로 볼 수 있으므로 이를 직선에 쭉 나열해놓으면 2차원 평면에 확률 분포를 그릴 수 있다. 여기서 한 발 더 나아가 $Pr(\bold{x, y})$ 로도 확장할 수 있다.

Marginalization

만약 joint distribution $Pr(x, y)$ 를 알고 있다면, 다른 variable 에 대하여 적분함으로써 marginal distribution, $Pr(x), Pr(y)$ 를 recover 할 수 있다.

\int Pr(x,y)\cdot dx=Pr(y)\\\int Pr(x,y)\cdot dy=Pr(x)

위와 같은 연산을 marginalization 이라고 한다. 또한 이를 다른 변수에 관계 없이 갖는 분포를 계산했다고도 여길 수 있다. 이를 $Pr(x,y,z)$ 로도 확장할 수 있다. $y$ 에 대하여 적분하여 joint distribution $Pr(x,z)$ 를 recover 할 수 있다.

Condition probability and likelihood

conditional probability $Pr(x|y)$ 는 우리가 $y$ 의 값을 알고 있다고 가정했을 때, random variable $x$ 가 어떤 값을 가질지에 대한 확률을 뜻한다. conditional probability $Pr(x|y)$ 는 joint distribution $Pr(x,y)$ 를 고정된 $y$ 값에 대하여 슬라이스하여 쉽게 구할 수 있다. 그리고 슬라이스된 distribution 을 해당 $y$ 가 나타날 확률, (슬라이스의 적분값)로 나누어 분포의 합이 1이 되도록 한다.

conditional probability는 아래와 같이 계산될 수 있다.

Pr(x|y)=\frac{Pr(x,y)}{Pr(y)}, Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}

$Pr(x|y)$ 를 $x$ 에 대한 함수로 여기면 이들의 모든 합은 1이 되어야한다. 만약 이를 $y$ 에 대한 함수로 본다면 이를 likelihood of $x$  givne $y$  로 여기며 합이 1은 아니다.

Bayes’ theorem

$Pr(x,y)$ 를 계산하는 식으로부터 우리는 아래와 같은 정리를 얻을 수 있다.

Pr(x,y)=Pr(x|y)Pr(y)=Pr(y|x)Pr(x)

위 식을 아래와 같이 다시 정리할 수 있다.

Pr(x|y)=\frac{Pr(y|x)Pr(x)}{Pr(y)}

위 식은 $Pr(x|y)$ 와 $Pr(y|x)$ 간의 관계를 정립한다. 이를 Bayes’ theorem 이라고 한다.

$Pr(y|x)$ 를 likelihood 라고 하고, $Pr(x)$ 를 prior probability 라고 하며, $Pr(y)$ 를 evidence 라고 하며, $Pr(x|y)$ 를 ( $y$ 를 관측하고 나서의 $x$ 에 관한) posterior probability 라고 한다.

Independence

만약 random variable $y$ 가 $x$ 에 대해 아무 상관도 없다면, 아무 말도 할 수 없다면 혹은 반대라도, 우리는 $x, y$ 가 서로 independence 하다 라고 한다. 그렇다면 우리는 아래와 같이 정리할 수 있다.

Pr(x|y)=Pr(x), Pr(y|x)=Pr(y)

더 나아가 joint disribution 을 아래와 같이 계산할 수 있다.

Pr(x,y)=Pr(x)Pr(y)

b), c) 와 같이 conditional distribution 이 동일하다.

2. Expectation

함수 $f[x]$ 와 확률 분포 $Pr(x)$ 에 대해 생각해보자. 확률 분포 $Pr(x)$ 에 대한 함수 $f[x]$ 의 expected value는 아래와 같이 정의된다.

\Bbb{E}_x[f[x]] = \int f[x]Pr(x)dx.

이는 아래와 같이 multi-varialbles 로도 확장이 가능하다.

\Bbb{E}_{x,y}[f[x,y]]=\int \int f[x,y]Pr(x,y)dxdy.

Rules for manipulating expectations

\Bbb{E}[k]=k\\\Bbb{E}[k\cdot f[x]]=k\cdot \Bbb{E}[f[x]]\\\Bbb{E}[f[x]+g[x]]=\Bbb{E}[f[x]]+\Bbb{E}[g[x]]\\ \Bbb{E}_{x,y}[f[x]]\cdot g[y]]=\Bbb{E}_x[f[x]]\cdot \Bbb{E}_y[g[y]], \text{if x, y independent.}

위 rules 는 vector format 으로도 그대로 일반화 될 수 있다.

\Bbb{E}[\bold{A}]=\bold{A}\\ \Bbb{E}[\bold{A}\cdot \bold{f[x]}]=\bold{A}\Bbb{E}[\bold{f[x]}]\\ \Bbb{E}[\bold{f[x]+g[x]]}=\Bbb{E}[\bold{f[x]]}+\Bbb{E}[\bold{g[x]}]\\ \Bbb{E}_{\bold{x,y}}[\bold{f[x]^Tg[y]}]=\Bbb{E}_{\bold{x}}[\bold{f[x]}]^T\Bbb{E}_{\bold{y}}[\bold{g[y]}], \text{if x, y independent}

where, $\bold{A}$ constant matrix, $\bold{f[x]}$ function of vector $\bold{x}$ that returns vector. $\bold{g[y]}$ is also same for vector $\bold{y}$ .

여기서 vector $\bold{x}$ 에 대한 expectation, $\Bbb{E}[\bold{x}]$ 는 vector $\bold{x}$ 의 각 $i$ 번째 원소 $x_i$ 에 대한 expectations 들을 담는다. 즉 $\Bbb{E}[\bold{x}]$ 의 $i$ 번째 원소 $x_i$ 에 대한 expectation 인 것임. 아래 레퍼런스를 따름.

Mean, Variance and Covariance

몇몇 특정 함수, $f[\bullet]$ 에 대한 expectation은 특별한 이름이 붙는다. 이들은 보통 복잡한 확률 분포를 요약하기 위한 용도로 사용된다.

만약 $f[x]=x$ 라면 $\Bbb{E}[x] = \mu$ , 즉 mean이 된다. 이는 확률 분포의 중심점 (center)의 측정값이다. 만약 $f[x]=(x-\mu)^2$ 라면, $\Bbb{E}[(x-\mu)^2]$ 를 variance, $\sigma^2$ 라고 한다. 이는 확률 분포가 얼마나 퍼져있는지에 대한 값을 나타낸다. Standard deviation, $\sigma$ 는 variance 에 루트를 씌운 것이다. 이도 역시 얼마나 퍼져있는지를 나타낸다.

이름에서도 알 수 있듯이, covariance, $\Bbb{E}[(x-\mu_x)(y-\mu_y)]$ 는 $x,y$ 가 얼마나 co-vary 한지를 나타낸다. covariance 는 두 변수의 variance 가 모두 크거나 $x$ 가 증가할 때 $y$ 도 증가하면 큰 값을 갖는다. 만약 $x,y$ 가 independent 하다면 covariance는 0을 갖는다. 하지만 covariance 가 0인 사실이 $x,y$ 가 independent 함을 보장하지는 않는다.

Multiple random variables 를 담은 vector $\bold{x} \in \R^D$ 에 대한 covariance는 $D\times D$ covariance matrix, $\Bbb{E}[(\bold{x-\mu_x})(\bold{y-\mu_y})^T]$ 로 표현된다. 여기서 vector $\bold{\mu_x}=\Bbb{E}[\bold{x}]$ 이다. covariance matrix 의 $(i,j)$ 의 원소값은 variable $x_i, x_j$ 사이의 covariance 를 나타낸다.

Variance identity

\Bbb{E}[(x-\mu)^2]=\Bbb{E}[x^2]-\Bbb{E}[x]^2.

생략

Standardization

z = \frac{x-\mu}{\sigma}

Vector format 에서는 아래와 같이 standardization 한다.

\bold{z}=\Sigma^{-\frac{1}{2}}(\bold{x-\mu)}) = \frac{\bold{x}-\bold{\mu}}{\sqrt{\Sigma}}

3. Normal probability distribution

본 교재에서 사용하는 확률 분포는 Bernoulli distribution, categorical distribution, Poisson distribution, von Mises distribution 그리고 mixture of Gaussians 등의 분포를 사용하지만 가장 많이 쓰이는 것은 normal, Gaussian distribution 이다.

Univariate normal distribution

scalar random variable $x$ 에 대한 univariate normal distribution 은 2개의 파라미터, $\mu, \sigma^2$ ,가 있다.

Pr(x)=Norm_x[\mu, \sigma^2]=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}.

여기서 $\mu=0, \sigma=1$ 인 normal distribution 을 standard normal distribution 이라고 한다.

Multivariate normal distribution

multivariate normal distribution 은 vector $\bold{x} \in \R^D$ 에 대하여 normal distribution을 나타낸다. 이는 mean vector, $\bold{\mu} \in \R^{D\times 1}$ 와 symmetric positive definite $D\times D$ covariance matrix $\Sigma$ 로 정의된다.

Norm[\bold{\mu, \Sigma}]=\frac{1}{(2\pi)^{D/2}|\bold{\Sigma}|^{1/2}}e^{-\frac{(\bold{x-\mu})^T\Sigma^{-1}(x-\mu)}{2}}.

Quadratic term $\frac{\bold{(x-\mu)^T\Sigma^{-1}(x-\mu)}}{2}$ 는 vector $\bold{x}$ 가 matrix $\bold{\Sigma}$ 의 비율에 따라 mean $\bold{\mu}$ 로 부터 멀어질 수록 작아지는 scalar 값을 return 하는 것을 의미한다.

여기서 $\bold{\Sigma}$ 는 아래와 같이 3가지의 유형, Spherical, Diagonal, Full covariance을 가질 수 있다.

\bold{\Sigma}_{spher}=\begin{bmatrix}\sigma^2 & 0\\0 & \sigma^2\end{bmatrix}, \bold{\Sigma}_{diag}=\begin{bmatrix}\sigma_1^2&0\\0&\sigma_2^2\end{bmatrix},\bold{\Sigma}_{full}=\begin{bmatrix}\sigma^2_{11}&\sigma^2_{12}\\\sigma^2_{21}&\sigma^2_{22}\end{bmatrix}

a-b)의 경우는 covariance matrix의 diagonal elements 들이 같은 값을 가지는 경우이다. 이는 isocontours (등고선) 가 원으로 나타나며 이러한 covariance matrix를 spherical covariance 라고 한다. c-d)의 경우 diagonal elements 들의 값이 각각 다르다. 이러한 경우 isocontours 가 축에 고정되어 있는 타원을 형성한다. 이러한 경우에는 diagonal covariance 라고 하며, 모든 positions에 값이 있다면 이를 full covariance 라고 한다. `[❗️covariance matrix 는 symmetric 하다.❗️]`

만약 covariance matrix 가 spherical 하거나 diagonal 하다면 bivariate 의 경우 두 변수 $x_1, x_2$  는 independent 하다.

Product of two normal distribution

설명 생략

Change of variable

생략

4. Sampling

Univariate distribution $Pr(x)$ 로 부터 sampling 하기 위해서는 먼저 CDF, $F[x]=\int Pr(x)dx$ 를 계산해야한다. 그리고 uniform distribution, $[0,1]$ 으로 부터 $z^*$ 를 뽑고 이 값을 $F[x]^{-1}$ (inverse of CDF) 에 대하여 평가해야한다. 따라서 sample $x^*$ 는 아래 식과 같이 뽑힐 수 있다.

x^*=F^{-1}[z^*]

Sampling from normal distributions

normal distribution 에서의 sampling 은 엄청 간단하게 가능하다.

Sampling from univariate normal distribution
$x = \mu+\sigma z.$

Sampling from multi-variate normal distribution
$\bold{x}=\bold{\Sigma^{-1/2}z+\mu}$

Ancestral Sampling

만약 joint distribution 이 여러 conditional probability의 곱으로 나타나있다면 우리는 바로 joint distribution 에서 sampling 하는 것이 아니라 ancestral sampling 을 이용할 수 있다.

기본적인 아이디어는 먼저 root variable(s)을 sampling 하고 sampling 된 root variable(s) 에 기반해 다음으로 연결되는 (subsequent) 다음 conditional distribution 에서 sampling 하는 것이다. 이러한 과정을 ancestral sampling 이라고 한다. joint distribution 이 아래와 같이 계산 된다고 하자.

Pr(x,y,z)=Pr(x)Pr(y|x)Pr(z|y)

$Pr(x,y,z)$ 에서 sampling 하기 위해 먼저 $x^*$ 를 $Pr(x)$ 에서 sampling 한다. 그리고 $y^*$ 를 $Pr(y|x^*)$ 에서 sampling 한다. 마지막으로 $z^*$ 를 $Pr(z|y^*)$ 에서 sampling 한다.

Distance between probability distributions

Supervised learning 은 samples가 나타내는(imply) 이산확률분포와 모델이 나타내는 이산확률분포의 distance 를 minimize 하는 것으로 보통 표현되고

Unsupervised learning 은 real example 의 확률분포와 모델에서 sampling 된 확률분포간의 거리를 최소화하는 것으로 표현된다.

두 가지 경우 모두 두 확률 분포 간의 거리를 구해야한다. 본 섹션에서는 분포간의 거리를 재는 방법(measures) 들의 properties 를 살펴본다.

Kullback-Leibler divergence

확률 분포 $p(x), q(x)$ 간의 거리를 구하는 가장 일반적인 방법은 Kullback-Leibler divergence (KL divergence) 이다. 이는 아래와 같이 정의된다.

D_{KL}[p(x)||q(x)]=\int p(x)log\bigg[\frac{p(x)}{q(x)}\bigg]dx.

이 distance 는 항상 0보다 크거나 같으며, $-log[y] \ge 1-y$ 임을 생각하면 쉽게 알 수 있다.

\begin{align} D_{KL}[p(x)||q(x)] &= \int p(x)log\bigg[\frac{p(x)}{q(x)}\bigg]dx \\&= -\int p(x)log\bigg[\frac{q(x)}{p(x)}\bigg]dx \\&\ge \int p(x)\bigg(1-\frac{q(x)}{p(x)} \bigg)dx \\&= \int p(x) - q(x)dx \\&=1-1 = 0 \end{align}

KL divergence 는 $q(x)$ 에 0이 들어가면 무한대가 나오기 때문에 문제가 생기지 않도록 주의해야한다.

Jensen-Shannon divergence

KL divergence 는 symmetric 하지 않다. (예를 들어, $D_{KL}[p(x)||q(x)] \ne D_{KL}[q(x)||p(x)]$ .) Jensen-Shannon divergence는 식을 아래와 같이 정의 함으로써 symmetric 한 measure는 정의했다.

💡

즉 KL divergence 의 symmetric version 이 Jensen-Shannon divergence 인 셈.

D_{JS}[p(x)||q(x)] = \frac{1}{2}D_{KL}\bigg[ p(x)||\frac{p(x)+q(x)}{2}\bigg] + \frac{1}{2}D_{KL}\bigg[ q(x)||\frac{p(x)+q(x)}{2}\bigg]

식을 보면 이는 두 분포의 평균을 구하기 위해 $p(x), q(x)$ 의 mean divergence 를 계산한다.

Frèchet distance

Frèchet distance, $D_{Fr}$ 는 두 분포 $p(x), q(x)$ 의 distance 를 계산하기 위해 아래와 같은 식으로 정의 된다.

D_{Fr}[p(y)||q(y)]=\sqrt{\min_{x, y}\bigg[\int \int p(x)q(y)|x-y|^2dxdy \bigg]}

? $y$ 는 뭔데 왜 설명이 없어

Distances between normal distributions

자주 multivariate normal distributions 사이의 distance를 계산해야하는 경우가 많음.

KL divergene는 아래와 같이 계산될 수 있음.

Frèchet, Wasserstein distance 는 아래와 같이 계산될 수 있다.

Uploaded by N2T

[Deep Learning] Fitting Models (0)	2023.10.09
Loss functions (4)	2023.10.02
Basic Maths (Appendix A) (0)	2023.09.28
Deep Neural Network (0)	2023.09.28
Shallow Neural Networks (0)	2023.09.28

ABOUT ME

JoonoJoono JoonoJoono

1. Random variables and probability distributions

Joint Probability

Marginalization

Condition probability and likelihood

Bayes’ theorem

Independence

2. Expectation

Rules for manipulating expectations

Mean, Variance and Covariance

Variance identity

Standardization

3. Normal probability distribution

Univariate normal distribution

Multivariate normal distribution

Product of two normal distribution

Change of variable

4. Sampling

Sampling from normal distributions

Ancestral Sampling

Distance between probability distributions

Kullback-Leibler divergence

Jensen-Shannon divergence

Frèchet distance

Distances between normal distributions

'Deep Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Random variables and probability distributions

Joint Probability

Marginalization

Condition probability and likelihood

Bayes’ theorem

Independence

2. Expectation

Rules for manipulating expectations

Mean, Variance and Covariance

Variance identity

Standardization

3. Normal probability distribution

Univariate normal distribution

Multivariate normal distribution

Product of two normal distribution

Change of variable

4. Sampling

Sampling from normal distributions

Ancestral Sampling

Distance between probability distributions

Kullback-Leibler divergence

Jensen-Shannon divergence

Frèchet distance

Distances between normal distributions

'Deep Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바