An example: words and tokens

Words, lemmas and spaces

The following table shows an example of how words are annotated in the BNC (if you don't remember what the BNC is, go back to unit 2 in section 1).

word	pos	lemma
I	PP	i
left	VBD	leave
my	PNP	my
pack	NN	pack
behind	RB	behind
.	PUN	.
Of course	RB	of course
,	PUN	,
you	PP	you
ca	MD	can
n't	XNOT	not
just	RB	just
go	VB	go

Each row in the table corresponds to a "word", while columns correspond respectively to:

word forms;
part of speech codes (e.g. VBD for verbs in the past tense, or RB for adveRBs);
lemmas.

Are all words and lemmas exactly in the form you expected?

The technical term for the smallest meaningful item in a corpus is token, and the process of explicitly delimiting word tokens during annotation is known as tokenisation. A token can be a word, punctuation mark or a word combination (such as of course). Each line in this table corresponds to a token in the BNC. The size of a corpus is also frequently quoted in tokens. For instance, the BNC contains 112 millions of tokens, which translates into about 100 million "real" words.

« previous | next »