Tokenisation & Lowercasing

Partly spoken in tokenisation, it’s the process of taking sentences and splitting them up into word parts. You lose the nuance of Apple (company) v. apple (fruit).

Stopping:

Removing common and stop words like “the”, “at”, etc.
- Shrinks index and speeds up processing
- Can destroy meaning (“e.g. to be or not to be”)

Lemmatising:

  • Chopping off word endings to group words by their roots.
  • Increases recall (more relevant documents even if user didn’t type exact word)
  • Decreases precision (user gets stuff on “riving bank” if they searched “banking”)

Removing URLs or special characters:

  • Removes noise, unhelpful if you’re trying to search by URL.