Prerequisite
Transformer
Positional Encoding vs Positional Embedding
단어 간 Similarity 계산 방법
Embedding (representation learning): mapping discrete type to a point in the vector space (the dense vector/real-value vector). When the discrete types are words, the dense vector representation is called “word embedding”
- tokenization: separation
- encoding: discrete type to integer
- embedding: discrete type to real-value (dense vector space)
- Distributional representation (Count-based representation, Distributional semantics): the meaning of a word depends on the words that surround it (its context), and words which have similar contexts must be related to each other. Not learned from the data but heuristically constructed.
- eg: TF-IDF, BOW, Co-occurrence matrix (SVD, LSA, HAL, HPCA)
- curse of dimensionality: dimensionality reduction (SVD, PCA)
- 🆚 Denotation: The denotation of a word is the dictionary meaning, or its precise, literal meaning, devoid of emotion, attitude and colour.
- Distributed representation (predictive, word embeddings): They are dense vectors, and the meaning and other properties of a word are distributed across different dimensions of this vector (real-value vector). This captures similarity between related words. learning-based or prediction-based.
- eg: Word2vec (CBOW, Skip-gram) , Fasttext, GloVe are distributed representations for large vocabulary sizes. (Static Word Embedding)
- 🆚 Local representation (Symbol-based representation, Discrete symbolic representation): One-to-one correspondence between concepts and representation elements. Fixed vocabulary size, no in-built notion of similarity between them. eg: One-hot encoding
- For word representation such as Word2Vec, it belongs to both distributional representation and distributed representation.
Distributional contextual representation: the order (syntactic/contextual) and positioning (distributional semantic) of words. Syntactic (grammar) and semantic (meaning) aspects of language (Cloze task). (op. Word2vec, GloVe: learn semantic representation solely from the distributional properties of large amounts of text)
- Word2Vec: one dense vector for each word (single meaning)
- ELMo, GPT, BERT (contextual embedding): multiple dense vectors for each word (multiple meaning)
if the model learns what the word to be predicted is, it will lose the learning of the context information, and if the model cannot learn which word will be predicted during the model training process, then it must be judged by learning the context information Only the words that need to be predicted can such a model have the ability to express features of sentences.