So far, the only structure we've imposed on our texts is to (carefully) turn them into lists of tokens. The present section explores practical methods for building on that basic structure by identifying semantic groupings and relationships that are relevant for sentiment.
Sentiment words behave very differently when under the semantic scope of negation. The goal of this section is to present and support a method for approximating this semantic influence.
The rules of thumb for how negation interacts with sentiment words are roughly as follows:
These observations suggest that it would be difficult to have a general a priori rule for how to handle negation. It doesn't just turn good to bad and bad to good. Its effects depend on the words being negated.
An additional challenge for negation is that its expression is lexically diverse and its influences are far-reaching (syntactically speaking). In all of the following, for example, we have something like neg-enjoy:
Paying attention to bigrams like not enjoy would partially address the challenge, but it would leave out a lot of the effects of negation.
We could capture the effects of negation fairly exactly if we had accurate semantic parses and were willing to devote extensive time and energy to implementing the relevant principles of semantic scope-taking. Though I myself love such projects, I suspect that it would not be worth the effort. The approximate method I develop next does extremely well.
The method I favor for approximating the effects of negation is due to Das and Chen 2001 and Pang, Lee, and Vaithyanathan 2002. It works as as follows:
Negation and clause-level punctuation are defined as follows:
(?: ^(?:never|no|nothing|nowhere|noone|none|not| havent|hasnt|hadnt|cant|couldnt|shouldnt| wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint )$ ) | n't
The sentiment tokenization strategy we've been using makes this straightforward, since it isolates the clausal punctation from the word-internal punctuation.
The algorithm captures simple negations:
No one enjoys it.
no one_NEG enjoys_NEG it_NEG .
It also handles long-distance effects:
I don't think I will enjoy it: it might be too spicy.
i don't think_NEG i_NEG will_NEG enjoy_NEG it_NEG : it might be too spicy .
The algorithm has literally turned enjoy into two tokens: enjoy outside of the scope of negation, enjoy_NEG inside the scope of negation. Thus, we needn't stipulate a relationship between the negated and un-negated forms; the sentiment analysis system can learn how the two behave.
The above regular expression for clause boundaries does not include commas, which are often used for sub-clausal boundaries. However, some applications might benefit from the more conservative behavior that comes from treating the comma as a boundary. This would prevent neg-marking of but I might in the following example:
I don't think I will enjoy it, but I might.
Figure fig:tokenizer_accuracy_neg extends the tokenizer assessment we did earlier with data on the sentiment-aware tokenizer plus negation marking. As you can see, negation marking results in substantial gains at all amounts of training data.
Negation marking is very difficult to implement for the other tokenizer strategies, since they either fail to reliably separate punctuation (Whitespace) or they separate too much of it (Treebank). Thus, I've not tried to do negation marking on their output.
I should add also that negation marking is most important when the texts are short. For them, a single never very good might be the only indication of negativity, whereas longer texts tend to contain more sentiment cues and are thus less dependent on any single one.
Negation marking also helps substantially with out-of-domain testing, as we see in Figure fig:tokenizer_accuracy_neg_xtrain. Whereas the sentiment tokenizer alone only barely improves on the simpler methods, it leaps ahead of them when negation marking is added.
Scope marking is also effective for marking quotation and the effects of attitude reports like say, claim, etc., which are often used to create distance between the speaker's commitments and those of others:
For quotation, the strategy is to turn it on and off at quotation marks. To account for nesting, one can keep a counter (though nesting is rare). For attitude verbs, the strategy is the same as the one for negation: _REPORT marking between the relevant predicates and clause-level punctuation.
The only concern is that small data sets might not support the very large increase in vocabulary size that this marking will produce.
There are many cases in which a sentiment contrast exists between words that have the same string representation but different parts of speech. Table tab:pos provides a sample, using the Harvard General Inquirer's disjoint Positiv/Negativ classes to assess the polarity values.
This suggests that it might be worth the effort to run a part-of-speech tagger on your sentiment data and then use the resulting word–tag pairs as features or components of features.
I look to the Stanford Log-Linear Part-of-Speech Tagger (Toutanova, Klein, Manning, and Singer) for all my tagging needs. It is extremely well documented, it comes with sample Java code to facilitate incorporating it into other Java programs, it has a flexible command-line interface in case your existing system is in another programming language, and it is impressively fast (15000 words/second on a standard 2008 desktop). It is free for academic research and has very reasonably priced, ready-to-sign commercial licenses.
The only challenge is that the default tokenization is Treebank style. We saw in the tokenizing unit that this is a sub-optimal tokenizing strategy for sentiment applications.
However, it is straightforward to write new tokenizers to feed the POS-tagger. If you do this, there will be tokens that it handles incorrectly, but these errors might not affect your sentiment analysis.
To illustrate, let's work a bit with the following file, which I'll call foo.txt. The format is one sentence per line:
It is fine if you fine me, but don't rub it in! The play was a hit until someone hit the lead actor. :-( I found it to be booooring!
Suppose we tokenize this using the sentiment tokenizer and then store it on a single line, space-separated, in a file called foo.tok. The output:
It is fine if you fine me , but don't rub it in ! The play was a hit until someone hit the lead actor . :-( I found it to be boooring !
Then the command-line call we want is as follows (I assume that you are in the same directory as the Part-of-Speech Tagger):
java -mx3000m -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger
-tagSeparator / -tokenize false -textFile foo.txt
The output (in the actual output, each example is on a single line with no blank lines between):
It/PRP is/VBZ fine/JJ if/IN you/PRP fine/VBP me/PRP ,/, but/CC don't/NN rub/VBP it/PRP in/IN !/. The/DT play/NN was/VBD a/DT hit/NN until/IN someone/NN hit/VBD the/DT lead/JJ actor/NN ./. :-(/NN I/PRP found/VBD it/PRP to/TO be/VB boooring/VBG !/.
Because the Treebank-style turns contractions into two tokens, the tagger gets the tag for can't wrong (it wants to see ca and n't). This can have detrimental effects on the surrounding tags, so it is perhaps worth pursuing a blend of the sentiment and Treebank styles if POS-tagging is going to be a large part of your system. However, the errors might not be problematic, since they seem to be very consistent.
For more on using the tagger efficiently from the command-line, see Matt Jockers' tutorial.
Figure fig:pos_accuracy uses our familiar classifier assessment to see what POS-tagging contributes. The performance is not all that different from the regular sentiment-aware tokenization, but we shouldn't conclude from this that POS-tagging is not valuable. Wider testing might reveal that there are situations in which it helps substantially.
Dependency parsing transforms a sentence into a quasi-semantic structure that can be extremely useful for extracting sentiment information, particularly where the goal is to relativize the sentiment information to particular entities or topics.
The Stanford Parser (Klein and Manning 2003a, b) can map raw strings, tokenized strings, or POS-tagged strings, to dependency structures (and others). I focus here on the Stanford Dependencies (de Marneffe, Manning, and MacCartney 2006).
There isn't space here to review dependency parsing in detail, but a few simple examples probably suffice to convey how this tool can be used in sentiment analysis.
I put the output of our tagging sample above into a file called foo.tagged and then ran the following command from inside the parser's distribution directory:
java -mx3000m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-tokenized -tagSeparator /
The output is a list of sequences of dependecy-graph edges, with indices keeping track of linear order:
nsubj(fine-3, It-1) cop(fine-3, is-2) mark(fine-6, if-4) nsubj(fine-6, you-5) advcl(fine-3, fine-6) dobj(fine-6, me-7) nsubj(rub-11, don't-10) conj_but(fine-3, rub-11) dobj(rub-11, it-12) prep(rub-11, in-13) det(play-2, The-1) nsubj(hit-5, play-2) cop(hit-5, was-3) det(hit-5, a-4) mark(hit-8, until-6) nsubj(hit-8, someone-7) advcl(hit-5, hit-8) det(actor-11, the-9) amod(actor-11, lead-10) dobj(hit-8, actor-11) nsubj(found-2, I-1) nsubj(boooring-6, it-3) aux(boooring-6, to-4) aux(boooring-6, be-5) xcomp(found-2, boooring-6)
Figure fig:dep provides the graphical structure for these examples:
Such dependency can isolate not only what the sentiment of a text is but also where that sentiment is coming from and whom it is directed at.
Like the POS-tagger, the Parser is free for academic use and has reasonably priced, ready-to-sign commercial licenses.