Preprocessing Text

Tokenisation & Lowercasing

Partly spoken in tokenisation, it’s the process of taking sentences and splitting them up into word parts. You lose the nuance of Apple (company) v. apple (fruit).

Stopping:

Removing common and stop words like “the”, “at”, etc.
- Shrinks index and speeds up processing
- Can destroy meaning (“e.g. to be or not to be”)

Lemmatising:

Chopping off word endings to group words by their roots.
Increases recall (more relevant documents even if user didn’t type exact word)
Decreases precision (user gets stuff on “riving bank” if they searched “banking”)

Removing URLs or special characters:

Removes noise, unhelpful if you’re trying to search by URL.

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

Preprocessing Text

Tokenisation & Lowercasing

Stopping:

Lemmatising:

Removing URLs or special characters:

Graph View

Table of Contents