Image Via Unsplash
01/10
Tokenization splits larger strings into smaller pieces which are called tokens. Tokens can be sentences, parts of any word, or punctuation. The process of creating tokens is called Tokenization.
Image Via Unsplash
02/10
Normalization is the process in NLP that put all text in the same context. Like covering all text in the same case, number-to-word conversion, etc.
Image Via Unsplash
03/10
Stemming in the process of NLP gives the word’s origin by removing affixes. For example, the base word of eating eats and eaten is eat.
Image Via Unsplash
04/10
Lemmatization is like Stemming. But Lemmatization gives a root word instead of a root stem. For example, if we pass studies to the stemming it will give studi. But the Lemmatization will give the study as an output.
Image Via Unsplash
05/10
Corpus refers to the collection of text in the NLP. Corpus can be in one language or in the multi-languages.
Image Via Unsplash
06/10
Each sentence in the NLP is called a Document. When multiple documents merged together then it is called a Corpus.
Image Via Unsplash
07/10
The type of words that do not contribute to the understanding of the content is called Stop Words. For example ‘a’, ‘and’, ‘the’, etc are stop words in the English language.
Image Via Unsplash
08/10
Bag of words is the representation model which is used to simplify the context of the text. A bag of words gives the occurrence of each word of the text in order.
Image Via Unsplash
09/10
N-grams are another representation model for simplifying the context of the text. This model preserves the contiguous sequences of N items from the text. There can be 2-gram, 3-gram, etc.
Image Via Unsplash
10/10
Regular Expression or Regex is used to describe specific patterns for the set of text. Regex is the special text string itself.
Image Via Unsplash