Sentiment Symposium Tutorial: Stemming

Stemming is a method for collapsing distinct word forms. This could help reduce the vocabulary size, thereby sharpening one's results, especially for small data sets.

This section reviews three common stemming algorithms in the context of sentiment: the Porter stemmer, the Lancaster stemmer, and the WordNet stemmer.

My overall conclusion is that the Porter and Lancaster stemmers destroy too many sentiment distinctions. The WordNet stemmer does not have this problem nearly so severely, but it doesn't do enough collapsing to be worth the resources necessary to run it.

Porter stemmer

The Porter stemmer is one of the earliest and best-known stemming algorithms. It works by heuristically identifying word suffixes (endings) and stripping them off, with some regularization of the endings.

The Porter stemmer often collapses sentiment distinctions, by mapping two words with different sentiment into the same stemmed form. Table tab:porter provides examples of such collapsing relative to the disjoint Positiv/Negativ classess of the Harvard General Inquirer, a large gold-standard semantic resource containing extensive sentiment information.

Table tab:porter

Porter stemming. 36 instances in which an Harvard Inquirer Positiv/Negativ distinction is destroyed by the algorithm.

Positiv	Negativ	Stemmed
captivation	captive	captiv
common	commoner	common
defend	defendant	defend
defense	defensive	defens
dependability	dependent	depend
dependable	dependent	depend
desirable	desire	desir
dominance	dominate	domin
dominance	domination	domin
extravagance	extravagant	extravag
home	homely	home
pass	passe	pass
patron	patronize	patron
prosecute	prosecution	prosecut
affection	affectation	affect
capitalize	capital	capit
closeness	close	close
commitment	commit	commit

Positiv	Negativ	Stemmed
competence	compete	compet
competency	compete	compet
competent	compete	compet
conviction	convict	convict
defender	defendant	defend
desirous	desire	desir
impetus	impetuous	impetu
indulgence	indulge	indulg
objective	object	object
objective	objection	object
rational	ration	ration
subsidize	subside	subsid
temperance	temper	temper
temperate	temper	temper
tolerance	tolerable	toler
tolerant	tolerable	toler
tolerate	tolerable	toler
toleration	tolerable	toler

Lancaster stemmer

The Lancaster stemmer is another widely used stemming algorithm. However, for sentiment analysis, it is arguably even more problematic than the Porter stemmer, since it collapses even more words of differing sentiment. Table tab:lancaster illustrates with a randomly chosen selection of such collapses, again using the Harvard General Inquirer's Positiv/Negativ distinction as a gold standard.

Table tab:lancaster

Lancaster stemming. 50 randomly selected instances in which a Harvard Inquirer Positiv/Negativ distinction is destroyed by the algorithm.

Positiv	Negativ	Stemmed
apprehend	apprehensive	apprehend
arbitrate	arbitrary	arbit
arbitration	arbitrary	arbit
audible	audacious	aud
call	callous	cal
capitalize	capital	capit
captivation	capture	capt
captivation	captive	capt
comical	commiseration	com
comely	commiseration	com
comic	commiseration	com
commitment	commit	commit
competency	compete	compet
compliment	complicate	comply
compliment	complication	comply
consummate	consumptive	consum
content	conceal	cont
contentment	conceal	cont
conviction	convict	convict
credentials	credulous	cred
credibility	credulous	cred
cute	cut	cut
deference	defeat	def
defender	defensive	defend
defend	defensive	defend

Positiv	Negativ	Stemmed
defend	defendant	defend
dependability	dependent	depend
desirous	desire	desir
dominance	dominate	domin
famous	famished	fam
fill	filth	fil
flourish	floor	flo
meaningful	mean	mean
notoriety	notorious	not
notable	notorious	not
passionate	passe	pass
pass	passe	pass
patronage	patronize	patron
rational	ration	rat
refuge	refugee	refug
repentance	repeal	rep
repent	repeal	rep
ripe	rip	rip
savings	savage	sav
simplify	simplistic	simpl
simplicity	simplistic	simpl
suffice	sufferer	suff
temperate	temper	temp
tolerant	tolerable	tol
truth	truant	tru

WordNet stemmer

WordNet (Fellbaum 1998) has high-precision stemming functionality, but it is probably of limited use for sentiment analysis. To effect real change, it requires (word, part-of-speech tag) pairs, where the part-of-speech is a, n, r (adverb), or v. When given such pairs, it collapses tense, aspect, and number marking.

The only danger I know of for sentiment analysis is that it collapses base, comparative, and superlative adjective forms. Table tab:wordnet provides some illustrations.

Table tab:wordnet

WordNet stemming. Representative examples of what the stemmer does and doesn't do. Collapsing adjectival forms is the only worrisome behavior when it comes to sentiment.

Word	Stemmed
(exclaims, v)	exclaim
(exclaimed, v)	exclaim
(exclaiming, v)	exclaim
(exclamation, n)	exclamation
(proved, v)	prove
(proven, v)	prove
(proven, a)	proven
(happy, a)	happy
(happier, a)	happy
(happiest, a)	happy

Assessment

Classification accuracy

To assess the impact of the stemming algorithms, I use the experimental set-up as I used when assessing tokenizers. Here, though, I compare just the plain sentiment tokenizer with the sentiment tokenizer plus the Porter and Lancaster stemmers, applied to the output of sentiment tokenization. (Since the WordNet stemmer requires part-of-speech tagged data, and since its changes are minimal, I don't assess it here.)

The results of the experiment are given in figure fig:stemmer_accuracy. It looks like the stemmers benefit somewhat from their reduced vocabulary size when the amount of training data is small, though not enough to improve on the sentiment-aware tokenizer, and they are quickly out-paced as the training data grows.

Figure fig:stemmer_accuracy

Assessing stemming algorithms via classification. (Details on the experimental design.)

Speed

Table tab:tokenizer_speed

Tokenizer speed for 12,000 OpenTable reviews. The numbers are averages for 100 rounds. The average review length is about 50 words.

Tokenizer	Total time (secs)	Average secs/text
Whitespace	1.305	0.0001
Treebank	9.085	0.001
Sentiment	29.915	0.002
Sentiment + Porter stemming	49.471	0.004
Sentiment + Lancaster stemming	62.938	0.005