Index

Token

It’s the process of breaking a stream of textual data into words, terms, sentences, symbols, or some other meaningful elements called tokens.

Why do we need tokenization?

Tokenization is the first step in any NLP pipeline. A tokenizer breaks unstructured data and natural language text into chunks of information that can be considered as discrete elements. The token occurrences in a document can be used directly as a vector representing that document.

This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning. They can also be used directly by a computer to trigger useful actions and responses. Or they might be used in a machine learning pipeline as features that trigger more complex decisions or behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commonTerms.md

commonTerms.md

Index

Token

Why do we need tokenization?

Files

commonTerms.md

Latest commit

History

commonTerms.md

File metadata and controls

Index

Token

Why do we need tokenization?