Stemming is a method for collapsing distinct word forms. This could help reduce the vocabulary size, thereby sharpening one's results, especially for small data sets.
This section reviews three common stemming algorithms in the context of sentiment: the Porter stemmer, the Lancaster stemmer, and the WordNet stemmer.
My overall conclusion is that the Porter and Lancaster stemmers destroy too many sentiment distinctions. The WordNet stemmer does not have this problem nearly so severely, but it doesn't do enough collapsing to be worth the resources necessary to run it.
The Porter stemmer is one of the earliest and best-known stemming algorithms. It works by heuristically identifying word suffixes (endings) and stripping them off, with some regularization of the endings.
The Porter stemmer often collapses sentiment distinctions, by mapping two words with different sentiment into the same stemmed form. Table tab:porter provides examples of such collapsing relative to the disjoint Positiv/Negativ classess of the Harvard General Inquirer, a large gold-standard semantic resource containing extensive sentiment information.
Positiv | Negativ | Stemmed |
---|---|---|
captivation | captive | captiv |
common | commoner | common |
defend | defendant | defend |
defense | defensive | defens |
dependability | dependent | depend |
dependable | dependent | depend |
desirable | desire | desir |
dominance | dominate | domin |
dominance | domination | domin |
extravagance | extravagant | extravag |
home | homely | home |
pass | passe | pass |
patron | patronize | patron |
prosecute | prosecution | prosecut |
affection | affectation | affect |
capitalize | capital | capit |
closeness | close | close |
commitment | commit | commit |
Positiv | Negativ | Stemmed |
---|---|---|
competence | compete | compet |
competency | compete | compet |
competent | compete | compet |
conviction | convict | convict |
defender | defendant | defend |
desirous | desire | desir |
impetus | impetuous | impetu |
indulgence | indulge | indulg |
objective | object | object |
objective | objection | object |
rational | ration | ration |
subsidize | subside | subsid |
temperance | temper | temper |
temperate | temper | temper |
tolerance | tolerable | toler |
tolerant | tolerable | toler |
tolerate | tolerable | toler |
toleration | tolerable | toler |
The Lancaster stemmer is another widely used stemming algorithm. However, for sentiment analysis, it is arguably even more problematic than the Porter stemmer, since it collapses even more words of differing sentiment. Table tab:lancaster illustrates with a randomly chosen selection of such collapses, again using the Harvard General Inquirer's Positiv/Negativ distinction as a gold standard.
Positiv | Negativ | Stemmed |
---|---|---|
apprehend | apprehensive | apprehend |
arbitrate | arbitrary | arbit |
arbitration | arbitrary | arbit |
audible | audacious | aud |
call | callous | cal |
capitalize | capital | capit |
captivation | capture | capt |
captivation | captive | capt |
comical | commiseration | com |
comely | commiseration | com |
comic | commiseration | com |
commitment | commit | commit |
competency | compete | compet |
compliment | complicate | comply |
compliment | complication | comply |
consummate | consumptive | consum |
content | conceal | cont |
contentment | conceal | cont |
conviction | convict | convict |
credentials | credulous | cred |
credibility | credulous | cred |
cute | cut | cut |
deference | defeat | def |
defender | defensive | defend |
defend | defensive | defend |
Positiv | Negativ | Stemmed |
---|---|---|
defend | defendant | defend |
dependability | dependent | depend |
desirous | desire | desir |
dominance | dominate | domin |
famous | famished | fam |
fill | filth | fil |
flourish | floor | flo |
meaningful | mean | mean |
notoriety | notorious | not |
notable | notorious | not |
passionate | passe | pass |
pass | passe | pass |
patronage | patronize | patron |
rational | ration | rat |
refuge | refugee | refug |
repentance | repeal | rep |
repent | repeal | rep |
ripe | rip | rip |
savings | savage | sav |
simplify | simplistic | simpl |
simplicity | simplistic | simpl |
suffice | sufferer | suff |
temperate | temper | temp |
tolerant | tolerable | tol |
truth | truant | tru |
WordNet (Fellbaum 1998) has high-precision stemming functionality, but it is probably of limited use for sentiment analysis. To effect real change, it requires (word, part-of-speech tag) pairs, where the part-of-speech is a, n, r (adverb), or v. When given such pairs, it collapses tense, aspect, and number marking.
The only danger I know of for sentiment analysis is that it collapses base, comparative, and superlative adjective forms. Table tab:wordnet provides some illustrations.
Word | Stemmed |
---|---|
(exclaims, v) | exclaim |
(exclaimed, v) | exclaim |
(exclaiming, v) | exclaim |
(exclamation, n) | exclamation |
(proved, v) | prove |
(proven, v) | prove |
(proven, a) | proven |
(happy, a) | happy |
(happier, a) | happy |
(happiest, a) | happy |
To assess the impact of the stemming algorithms, I use the experimental set-up as I used when assessing tokenizers. Here, though, I compare just the plain sentiment tokenizer with the sentiment tokenizer plus the Porter and Lancaster stemmers, applied to the output of sentiment tokenization. (Since the WordNet stemmer requires part-of-speech tagged data, and since its changes are minimal, I don't assess it here.)
The results of the experiment are given in figure fig:stemmer_accuracy. It looks like the stemmers benefit somewhat from their reduced vocabulary size when the amount of training data is small, though not enough to improve on the sentiment-aware tokenizer, and they are quickly out-paced as the training data grows.
Table tab:tokenizer_speed extends the tokenizer speed assessment given earlier with comparable numbers for the stemming algorithms.
Tokenizer | Total time (secs) | Average secs/text |
---|---|---|
Whitespace | 1.305 | 0.0001 |
Treebank | 9.085 | 0.001 |
Sentiment | 29.915 | 0.002 |
Sentiment + Porter stemming | 49.471 | 0.004 |
Sentiment + Lancaster stemming | 62.938 | 0.005 |