본 포스팅은 Simon J.D. Prince 의 Deep Learning 교재를 스터디하며 정리한 글임을 밝힙니다.

Notation

Scalars, Vectors, Matrices and Tensors

Scalar는 small or capital letters, $a, A, \alpha$ 로 표현된다. Column vector 는 small bold letters, $\bold{a}, \bold{\phi}$ , 로 표현되며 row vectors는 이의 Transpose 인 $\bold{a^T}, \bold{\phi^T}$ 로 표현된다. Matrices, Tensors 는 capital bold, $\bold{B, \Phi}$ 체로 표현된다.

Variables and parameters, Sets

생략, 별 다른 내용은 없음. 원래 알던것

Functions

함수는 <함수이름>[] 의 꼴로 나타냄. 예를 들어 log[x] 와 같이 표현. 만약 함수가 vector를 return 한다면 small bold, 행렬이나 텐서를 return 하면 capital bold로 시작하는 이름을 갖는다. $\bold{y} = \bold{mlp}[x,\phi]$ , $\bold{Y}=\bold{Sa}[\bold{X, \phi}]$ 와 같이 표현한다.

Mnimizing and Maximizing

$min_{x}[f[x]]$ : 임의의 변수 $x$ 에 대하여 $f[x]$ 의 최솟값을 반환. $max_{x}[f[x]]$ 는 반대.

$argmin_{x} [f[x]]$ 는 $f[x]$ 를 최소화 하는 $x$ 를 return 한다. 만일 $y = argmin_{x}[f[x]]$ 라면, $min_{x}[f[x]] = f[y]$ .

Probability Distributions, Asymptotic notation

생략

Mathmatics

Functions

function은 set $\mathcal{X}$ 로부터 `set $\mathcal{Y}$ 로의 mapping 을 의미한다.

injection 은 $\mathcal{X}$ 의 모든 elements 들이 $\mathcal{Y}$ 의 부분집합에 모두 맵핑 되는 것이다.

surjection 은 반대로 $\mathcal{Y}$ 의 모든 원소들에 대응 하는 $\mathcal{X}$ 의 원소들이 있는 경우이다.

bijection or bijective mapping, 즉 일대일 대응 은 injective 하면서 surjective 한 경우를 의미한다.

diffeomorphism 은 bijection 의 특별한 케이스인데, forward, reverse mapping 이 모두 미분 가능한 경우를 의미한다.

Lipschitz Constant

💡

Def: 임의의

z_1, z_2

에 대하여 다음식을 만족하면 함수

f[z]

는 Lipschitz coninuous 하다.

||f[z_1]-f[z_2]|| \le \beta||z_1-z_2||

여기서 $\beta$ 를 Lipschitz constant 라고 하며 distance metric에 대하여 함수의 gradient의 최댓값을 결정한다. 만일 Lipschitz constant 가 1 이하라면, 함수는 contraction mapping 이며, Banach’s theorem 에 따라 임의의 point 에서 inverse를 구할 수 있다. Lipschitz constant $\beta_1, \beta_2$ 를 갖는 두 함수의 곱은 $\beta_1\beta_2$ 의 Lipschitz constant를 가지며, 두 함수의 합은 $\beta_1+\beta_2$ 의 Lipschitz constant 를 갖는다.

Linear Transformation, $\bold{f[z]}=\bold{Az+B}$ 의 Lipschitz constant 는 matrix $\bold{A}$ 의 eigen value 의 최대값이다.

Convexity

임의의 두 점을 이었을 때, 만약에 일직선으로 그을 수 있고 (중간에 걸리지 않고), 그 선 위의 모든 점이 주어진 함수 위에 있다면 (lies above) 그 함수를 convex 이다. 반대는 concave이다. 정의에 따라 각 convex, concave는 적어도 하나 이상의 minimum, maximum을 갖는다.

convex인 어떤 함수라도 Gradient Descent 는 global minimum을 찾는 것을 보장한다.

Special Functions

exponential function $e^x$ 는 $\Bbb{R} \rightarrow \Bbb{R^+}$ 로의 mapping 이며, logarithm function $log[x]$ 는 $\Bbb{R^+} \rightarrow \Bbb{R}$ 의 mapping이다.

gamma function은 factorial function을 continuous values로 확장한 함수이다. 이는 아래와 같이 정의됨. $\Gamma[x] = (x-1)!$ 이다.

\Gamma[x] = \int_0^{\infin}t^{x-1}e^{-t}dt.

Dirac delta function $\delta[\bold{z}]$ 는 총 면적이 $1$ 의 크기를 갖고 있으며, 이는 point $\bold{z=0}$ 에 있다. $N$ 개의 원소를 갖는 데이터셋은 $1/N$ 으로 scailing 한 $N$ 개의 delta function 이라고 볼 수 있다. 보통 화살표로 그려지며 아래와 같은 property를 갖는다.

\int f[\bold{x}]\delta[\bold{x-x_0}]d\bold{x} = f[\bold{x_0}]

Stirling’s formula

Stirling’s formula 는 아래와 같은 식으로 factorial function을 approximation 한다.

x! \approx \sqrt{2\pi}x(\frac{x}{e})^x

Binomial coefficients

Binomial coefficients 는 $\dbinom{n}{k}$ 와 같이 쓰고 “n choose k” 라고 읽는다. 식은 생략.

Autocorrelation

연속 함수 $f[z]$ 의 Autocorrelation $r[\tau]$ 는 아래와 같이 정의된다.

r[\tau]=\int_{-\infin}^{\infin}f[t+\tau]f[t]dt

여기서 $\tau$ 는 time lag (혹은 offset) 이다. $r[0]$ 는 1이다. Autocorrelation은 어떤 함수와 임의의 offset $\tau$ 를 갖는 자기 자신과의 correlation을 나타낸다. 만일 어떤 함수가 천천히 바뀌고 예측 가능하다면 autocorrelation 은 $\tau$ 가 커질 수록 천천히 작아지며 만일 어떤 함수가 빠르고 예측 불가하게 바뀐다면 $\tau$ 가 커질 수록 빠르게 0에 가까워진다.

Vector, Matrices, Tensors and Transpose

생략

Vector and matrix norms

vector $\bold{z}$ 에 대하여 $\mathcal{l_p}$ norm은 아래와 같이 정의됨.

||\bold{z}||_p = \bigg( \displaystyle\sum_{d=1}^D|z_d|^p \bigg)^{1/p}

$p=2$ 일 때는 우리가 잘 아는 Euclidean norm (l2 norm) 이다. 보통 아래첨자 p를 생략함. $p=\infin$ 이면 vector의 원소들의 절댓값 중 최댓값을 return 한다.

Norms 은 행렬에서도 비슷하게 계산될 수 있다. 예를 들어 행렬 $\bold{Z}$ 에 대한 $\mathcal{l_2}$ norm은 아래와 같이 계산할 수 있다. (Frobenius norm 으로도 알려져 있다.)

||\bold{Z}||_F=\bigg( \displaystyle\sum^{I}_{i=1}\displaystyle\sum^{J}_{j=1}|z_{ij}|^2 \bigg)^{1/2}

Product of matrices

C_{ij}=\displaystyle\sum_{d=1}^{D2}A_{id}B_{dj}

where, $A \in \R^{D_1 \times D_2}, B \in \R^{D_2 \times D_3}$

Dot product of vectors

\bold{a}^T\bold{b} = \bold{b}^T\bold{a}=\displaystyle\sum_{d=1}^{D}a_db_d.

\bold{a}^T\bold{b} = ||\bold{a}|| \space ||\bold{b}||cos[\theta]

Inverse

생략.

Subspace

가령, 행렬 $\bold{A} \in \R^{D_1 \times D_2}$ 가 있다고 하자. 이때 $\bold{Ax} \in \R^{D_1}$ 는 모든 D1차원의 공간에 모두 다다를 수 없다. 예를 들어 $\bold{A} = \begin{bmatrix} 1 & 1\\ 1 & 1\end{bmatrix}$ 이고 하고, $\bold{x} = \begin{bmatrix} a\\ b\end{bmatrix}$ 라고 하면 $\bold{Ax} = \begin{bmatrix}a+b\\a+b\end{bmatrix}$ 이므로 $y=x$ 그래프 위에만 도달할 수 있다. 예를 들어 $\begin{bmatrix}3\\3\end{bmatrix}$ 는 될 수 있지만, $\begin{bmatrix}3\\4\end{bmatrix}$ 는 될 수 없다. 다시 말해, $\R^2$ 에 모두 도달할 수 없다.

이렇게 $\bold{Ax}$ 가 도달할 수 있는 space를 subspace라고 하며 그 중에서도 column space 라고 한다. ( $\bold{xA}$ 는 row space를 구성, $\bold{Ax=0}$ 은 Null space.)

Uploaded by N2T

Loss functions (4)	2023.10.02
Probability (Appendix C) (0)	2023.09.28
Deep Neural Network (0)	2023.09.28
Shallow Neural Networks (0)	2023.09.28
Introduction & Supervised Learning - Deep Learning (0)	2023.09.28

ABOUT ME

JoonoJoono JoonoJoono

Notation

Scalars, Vectors, Matrices and Tensors

Variables and parameters, Sets

Functions

Mnimizing and Maximizing

Probability Distributions, Asymptotic notation

Mathmatics

Functions

Lipschitz Constant

Convexity

Special Functions

Stirling’s formula

Binomial coefficients

Autocorrelation

Vector, Matrices, Tensors and Transpose

Vector and matrix norms

Product of matrices

Dot product of vectors

Inverse

Subspace

'Deep Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Notation

Scalars, Vectors, Matrices and Tensors

Variables and parameters, Sets

Functions

Mnimizing and Maximizing

Probability Distributions, Asymptotic notation

Mathmatics

Functions

Lipschitz Constant

Convexity

Special Functions

Stirling’s formula

Binomial coefficients

Autocorrelation

Vector, Matrices, Tensors and Transpose

Vector and matrix norms

Product of matrices

Dot product of vectors

Inverse

Subspace

'Deep Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바