4. Transformer Network

Attention + CNN: Self-Attention, Multi-Head Attention

\( A(q, K, V) \) = attention-based vector representation of a word

= \( \sum_i \frac{ \text{exp} (q*k^{<i>}) }{ \sum_j \text{exp} (q*k^{<j>}) } v^{<i>} \)

= \( \text{softmax} (\frac{QK^T}{\sqrt{d_k}}) V \)

\( \text{head}_i = \text{Attention} (W_i^Q Q, W_i^K K, W_i^V V) \) for i in (1, ..., h) (h : # of heads)

\( \text{MultiHead} (Q, K, V) = \text{concat} (\text{head}_1 \text{head}_2 ... \text{head}_h) W_o \)

Multi-Head attention, Positional Encoding, residual connection. masking 등을 이용하여 Transformer model을 만들 수 있다.

3. Sequence Models & Attention Mechanism #2 (0)	2022.08.15
3. Sequence Models & Attention Mechanism #1 (0)	2022.08.15
2. Natural Language Processing & Word Embeddings #3 (0)	2022.08.12
2. Natural Language Processing & Word Embeddings #2 (0)	2022.08.12
2. Natural Language Processing & Word Embeddings #1 (0)	2022.08.12

Life Story