An example: words and tokens

The following table shows an example of how words are annotated in the BNC (if you don't remember what the BNC is, go back to unit 2 in section 1).
word | pos | lemma |
I | PP | i |
left | VBD | leave |
my | PNP | my |
pack | NN | pack |
behind | RB | behind |
. | PUN | . |
Of course | RB | of course |
, | PUN | , |
you | PP | you |
ca | MD | can |
n't | XNOT | not |
just | RB | just |
go | VB | go |
Each row in the table corresponds to a "word", while columns correspond respectively to:
- word forms;
- part of speech codes (e.g. VBD for verbs in the past tense, or RB for adveRBs);
- lemmas.
Are all words and lemmas exactly in the form you expected?

The technical term for the smallest meaningful item in a corpus is token, and the process of explicitly delimiting word tokens during annotation is known as tokenisation. A token can be a word, punctuation mark or a word combination (such as of course). Each line in this table corresponds to a token in the BNC. The size of a corpus is also frequently quoted in tokens. For instance, the BNC contains 112 millions of tokens, which translates into about 100 million "real" words.