Skip to content

Latest commit

 

History

History
11 lines (9 loc) · 1.13 KB

File metadata and controls

11 lines (9 loc) · 1.13 KB

Index

light deep

Token

light It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens.

Why do we need tokenization?

Tokenization is the first step in any NLP pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document.

This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.