10

Tokenization

[{"selector":"#anim-f096b31b-c615-4777-9f84-f3b2257e5abf [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(31.25191393278695%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-56898cfa-a6af-4048-b29f-7a669628ec8b","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 01/10 Tokenization splits larger strings into smaller pieces which are called tokens. Tokens can be sentences, parts of any word, or punctuation. The process of creating tokens is called Tokenization. Image Via Unsplash

Normalization

[{"selector":"#anim-7a55a690-0cf6-448e-9a46-8a87952a6ce5 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(12.490344777542203%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-15a69736-a22a-4b76-8abf-30313250aec9","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 02/10 Normalization is the process in NLP that put all text in the same context. Like covering all text in the same case, number-to-word conversion, etc. Image Via Unsplash

Stemming

[{"selector":"#anim-27de1351-ef7f-4cea-b4b0-357ac729b84e [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(31.249999886225726%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-42916b4b-19a1-494e-915e-1b73767a875d","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 03/10 Stemming in the process of NLP gives the word’s origin by removing affixes. For example, the base word of eating eats and eaten is eat . Image Via Unsplash

Lemmatization

[{"selector":"#anim-34b6795e-4808-4666-97a4-cc0a14766a1f [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(28.90624987200394%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-be315ca7-592b-4dda-84cb-7c4584522550","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 04/10 Lemmatization is like Stemming. But Lemmatization gives a root word instead of a root stem. For example, if we pass studies to the stemming it will give studi. But the Lemmatization will give the study as an output. Image Via Unsplash More Articles

Corpus

[{"selector":"#anim-62005fac-32d8-4cba-91b7-a7cbd75e9570 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(12.499999772451451%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-6c686394-9c28-4a22-b99c-08137b50b32a","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 05/10 Corpus refers to the collection of text in the NLP. Corpus can be in one language or in the multi-languages. Image Via Unsplash

Document

[{"selector":"#anim-872ccac5-4e6f-4eba-bd75-fd7337cc1214 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(31.240568296631277%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-5e98455a-4919-4e92-8f99-ee5ff6396745","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 06/10 Each sentence in the NLP is called a Document. When multiple documents merged together then it is called a Corpus. Image Via Unsplash

Stop Words

[{"selector":"#anim-04216b7c-c98d-48c7-a357-4e2ca1bc746e [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(31.249999886225726%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-7b2b2b9f-60a2-4a31-9988-2f951ecdc42a","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 07/10 The type of words that do not contribute to the understanding of the content is called Stop Words. For example ‘a’, ‘and’, ‘the’, etc are stop words in the English language. Image Via Unsplash

Bag of words

[{"selector":"#anim-ea670c3a-abb7-4a41-b5a6-7527586b7380 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(7.812499744007886%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-d70c0588-ebbc-415b-bc43-fd1ab87a4f99","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 08/10 Bag of words is the representation model which is used to simplify the context of the text. A bag of words gives the occurrence of each word of the text in order. Image Via Unsplash

N-grams

[{"selector":"#anim-a716ec93-2e9e-4d8a-9ef1-01e67776cd97 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(34.179340216068795%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-4f8951c5-a407-4687-9bac-d26b9fdfc3e0","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 09/10 N-grams are another representation model for simplifying the context of the text. This model preserves the contiguous sequences of N items from the text. There can be 2-gram, 3-gram, etc. Image Via Unsplash

Regular Expression

[{"selector":"#anim-b1d57504-e40f-4e86-b00c-2fc4020483b0 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(31.249999886225726%, 0, 0) translate(0%, 0%) scale(1)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":3000,"easing":"cubic-bezier(.14,.34,.47,.9)","fill":"forwards"}] [{"selector":"#anim-d03d1a5d-5bf7-44f7-8bc4-73d5f015a5f7","keyframes":{"opacity":[0,1]},"delay":300,"duration":1600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] 10/10 Regular Expression or Regex is used to describe specific patterns for the set of text. Regex is the special text string itself. Image Via Unsplash

SHARE IF YOU Liked this Story

[{"selector":"#anim-f6a04308-d6b3-4e47-8f6e-88fc1a86c3d8","keyframes":{"opacity":[0,1]},"delay":0,"duration":1000,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-3cbb682a-d16a-4e5e-b2f6-e4417763432d","keyframes":{"transform":["translate3d(0px, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] Arrow Learn more

10

Terminologies in

NLP