An example: words and tokens
Words, lemmas and spaces
Are all words and lemmas exactly in the form you expected?
The following table shows an example of how words are annotated in the BNC (if you don't remember what the BNC is, go back to unit 2 in section 1).
word | pos | lemma |
I | PP | i |
left | VBD | leave |
my | PNP | my |
pack | NN | pack |
behind | RB | behind |
. | PUN | . |
Of course | RB | of course |
, | PUN | , |
you | PP | you |
ca | MD | can |
n't | XNOT | not |
just | RB | just |
go | VB | go |
Each row in the table corresponds to a "word", while columns correspond respectively to:
- word forms;
- part of speech codes (e.g. VBD for verbs in the past tense, or RB for adveRBs);
- lemmas.
Are all words and lemmas exactly in the form you expected?
The technical term for the smallest meaningful item in a corpus is token, and the process of explicitly delimiting word tokens during annotation is known as tokenisation. A token can be a word, punctuation mark or a word combination (such as of course). Each line in this table corresponds to a token in the BNC. The size of a corpus is also frequently quoted in tokens. For instance, the BNC contains 112 millions of tokens, which translates into about 100 million "real" words.