Text-Preprocessing
Text preprocessing is the initial and crucial step in text analysis, aiming to clean and prepare raw text for further analysis. The common steps involved are:
-
Tokenization:
- Breaking down text into individual units, such as words or phrases, known as tokens.
- Example: The sentence "Data science is fascinating!" is tokenized into ["Data", "science", "is", "fascinating", "!"].
-
Stop Words Removal:
- Removing common words that usually do not contribute significant meaning, such as "and", "the", "is".
- Helps in reducing the dimensionality of the text data.
-
Stemming:
- Reducing words to their root form.
- Example: "running", "runner", and "ran" become "run".
-
Lemmatization:
- Reducing words to their base or dictionary form, considering the context.
- Example: "better" becomes "good".
-
Lowercasing:
- Converting all characters in the text to lowercase to ensure uniformity.
- Example: "Data Science" becomes "data science".
-
Removing Punctuation and Special Characters:
- Eliminating punctuation marks and other non-alphanumeric characters.
- Example: "Hello, World!" becomes "Hello World".
-
Text Normalization:
- Converting text to a standard format, such as expanding contractions ("can't" to "cannot").