An example: words and tokens
The following table shows an example of how words are annotated in the BNC (if you don't remember what the BNC is, go back to unit 2 in section 1).
| word | pos | lemma |
| I | PP | i |
| left | VBD | leave |
| my | PNP | my |
| pack | NN | pack |
| behind | RB | behind |
| . | PUN | . |
| Of course | RB | of course |
| , | PUN | , |
| you | PP | you |
| ca | MD | can |
| n't | XNOT | not |
| just | RB | just |
| go | VB | go |
Each row in the table corresponds to a "word", while columns correspond respectively to:
- word forms;
- part of speech codes (e.g. VBD for verbs in the past tense, or RB for adveRBs);
- lemmas.
Are all words and lemmas exactly in the form you expected?
The technical term for the smallest meaningful item in a corpus is token, and the process of explicitly delimiting word tokens during annotation is known as tokenisation. A token can be a word, punctuation mark or a word combination (such as of course). Each line in this table corresponds to a token in the BNC. The size of a corpus is also frequently quoted in tokens. For instance, the BNC contains 112 millions of tokens, which translates into about 100 million "real" words.