Tokenisation & Lowercasing
Partly spoken in tokenisation, it’s the process of taking sentences and splitting them up into word parts. You lose the nuance of Apple (company) v. apple (fruit).
Stopping:
Removing common and stop words like “the”, “at”, etc.
- Shrinks index and speeds up processing
- Can destroy meaning (“e.g. to be or not to be”)
Lemmatising:
- Chopping off word endings to group words by their roots.
- Increases recall (more relevant documents even if user didn’t type exact word)
- Decreases precision (user gets stuff on “riving bank” if they searched “banking”)
Removing URLs or special characters:
- Removes noise, unhelpful if you’re trying to search by URL.