본문 바로가기
Google Machine Learning Bootcamp 2022/Improving Deep Neural Networks

2. Optimization Algorithms

by 사향낭 2022. 7. 14.

Mini-batch Gradient Descent

 

 

Batch Gradient Descent - Training once by m training sets 

 

Mini-batch Gradient Descent - Training t times by m / t training sets

 

 

Understanding Mini-batch Gradient Descent

 

 

Batch Gradient Descent - mini-batch size is m

 

Stocastic Gradient Descent - mini-batch size is 1

 

In practice, mini-batch size is in-between 1 and m

 

 

Exponentially Weighted Averages

 

 

\( v_t \) = \( \beta v_{t - 1} \) + \( (1 - \beta) \) \( \theta_t \)

 

\( v_t \) is approximately average over \( \frac{1}{1 - \beta} \) days temperature

 

 

Understanding Exponentially Weighted Averages

 

 

\( (1 - \epsilon)^{\frac{1}{\epsilon}} = \frac{1}{e} \)

 

 

Bias Correction in Exponentially Weighted Averages

 

 

\( v_t = \frac{ \beta v_{t - 1} + (1 - \beta) \theta_t }{ 1 - \beta^t } \)

 

 

Gradient Descent with Momentum

 

 

Momentum: For each iteration t,

 

\( v_{dw} = \beta v_{dw} + (1 - \beta) dw \), \( v_{db} = \beta v_{db} + (1 - \beta) db \)

 

\( w = w - \alpha v_{dw} \), \( b = b - \alpha v_{db} \)

 

People usually don't use bias correction.

 

 

RMSprop (Root Mean Square)

 

 

On iteration t:

 

Compute dw, db on current mini-batch

 

\( S_{dw} = \beta S_{dw} + (1 - \beta) dw^2 (element-wise) \) <- small

 

\( S_{db} = \beta S_{db} + (1 - \beta) db^2) <- large

 

\( w := w - \alpha \frac{dw}{\sqrt{S_{dw}}} \), \( b := b - \alpha \frac{db}{\sqrt{S_{db}}} \)

 

 

Adam Optimization Algorithm

 

 

Momentum + RMSprop

 

\( \alpha \): needs to be tune

 

\( \beta_1 \): 0.9

 

\( \beta_2 \): 0.999

 

\( \epsilon \): \( 10^{-8} \)

 

Adam: Adaptive moment estimation

 

 

Learning Rate Decay

 

 

\( \alpha = \frac{1}{1 + \text{decay_rate} * \text{epoch_num}} \alpha_0 \)

 

\( \alpha = 0.95^{\text{epoch_num}} \alpha_0 \)

 

\( \alpha = \frac{k}{\sqrt{\text{epoch_num}}} \alpha_0 \)

댓글