An example: words and tokens

iDevice icon Words, lemmas and spaces

The following table shows an example of how words are annotated in the BNC (if you don't remember what the BNC is, go back to unit 2 in section 1).

wordpos
lemma
I
PP
i
left
VBD
leave
my
PNP
my
pack
NN
pack
behind
RB
behind
.
PUN
.
Of course
RB
of course
,
PUN
,
you
PP
you
ca
MD
can
n't
XNOT
not
just
RB
just
go
VB
go

Each row in the table corresponds to a "word", while columns correspond respectively to:

  1. word forms;
  2. part of speech codes (e.g. VBD for verbs in the past tense, or RB for adveRBs);
  3. lemmas.

Are all words and lemmas exactly in the form you expected?
IDevice Icon
The technical term for the smallest meaningful item in a corpus is token, and the process of explicitly delimiting word tokens during annotation is known as tokenisation. A token can be a word, punctuation mark or a word combination (such as of course). Each line in this table corresponds to a token in the BNC. The size of a corpus is also frequently quoted in tokens. For instance, the BNC contains 112 millions of tokens, which translates into about 100 million "real" words.