Tokenizing (splitting a string into its desired constituent parts) is fundamental to all NLP tasks.
There is no single right way to do tokenization. The right algorithm depends on the application.
I suspect that tokenization is even more important in sentiment analysis than it is in other areas of NLP, because sentiment information is often sparsely and unusually represented — a single cluster of punctuation like >:-( might tell the whole story.
The next few subsections define and illustrate some prominent tokenization strategies. To get a feel for how they work, I illustrate with the following invented tweet-like text:
@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D http://sentimentsymposium.com/.
I assume throughout that the text has gone through the following preprocessing steps:
These steps should be taken in order, so that one distinguishes the token <sarcasm>, which is frequently written out as part of a text, from true HTML mark-up (which is not seen directly but which can play a role in tokenization, as discussed below).
These preprocessing steps affect just the emoticon in the sample text:
@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D http://sentimentsymposium.com/.
The whitespace tokenizer simply downcases the string and splits the text on any sequence of whitespace, tab, or newline characters:
@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D http://sentimentsymposium.com/.
@sentimentsymp: can't wait for the nov 9 #sentiment talks! yaaaaaay!!! >:-d http://sentimentsymposium.com/.
Observations:
The Treebank-style is the one used by the Penn Treebank and many other important large-scale corpora for NLP. Thus, it is a de facto standard. This alone makes it worth considering, since it can facilitate the use of other tools.
@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D http://sentimentsymposium.com/.
@ SentimentSymp : ca n't wait for the Nov 9 # Sentiment talks ! YAAAAAAY ! ! ! > ; : -D http : //sentimentsymposium.com/ .
Some drawbacks to the Treebank style for sentiment:
I now review some of the major aspects of a sentiment-aware tokenizer. You are likely to want to tailor these suggestions to your own data and applications.
Emoticons are extremely common in many forms of social media, and they are reliable carriers of sentiment.
The following regular expression captures 96% of the emoticon tokens occurring on Twitter, as estimated by the InfoChimps Smileys Census. (It captures just 36% of the emoticon types, but most are extremely rare and highly confusable with other chunks of text, so I've not tried to capture them.)
[<>]? # optional hat/brow [:;=8] # eyes [\-o\*\']? # optional nose [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth | #### reverse orientation [\)\]\(\[dDpP/\:\}\{@\|\\] # mouth [\-o\*\']? # optional nose [:;=8] # eyes [<>]? # optional hat/brow
Twitter includes topic and user mark-up that is useful for advanced sentiment modeling. Your tokenizer should capture this mark-up if you are processing Twitter data.
Usernames:
@+[\w_]+
Hashtags (topics):
\#+[\w_]+[\w\'_\-]*[\w_]+
Basic HTML mark-up like
the strong
, b
, em
,
and i
tags can be indicators of sentiment. Their opening
or closing elements can be treated as individual tokens. (Don't count
them twice.)
Where sparseness is not an issue, informative tags can be seen as
annotating all the words they contain. For
strong, strong
, b
, em
,
and i
, I often capitalize them, to collapse them with
words written in all caps for emphasis.
<strong>really bad idea<strong>
becomes
REALLY BAD IDEA
For domain-specific applications using Web data, it can be fruitful to study the mark-up:
Some websites change curses to sequences of asterisks, perhaps with letters at the edges (****, s***t). Similarly, some writers use random sequences of non-letter words in place of swears ($#!@). Thus, there can be value in treating such sequences as tokens. (I tend to split apart sequences of exclamation points and question marks, though.)
Punctuation should be kept at the tokenization stage. We will shortly use it to identify further structure in the tokenized string. Thus, the goal for tokenizing is to properly distinguish various senses for the individual punctuation marks.
My basic strategy for handling punctuation is to try to identify all the word-internal marks first, so that any others can be tokenized as separate elements. Some considerations:
The remaining punctuation can be kept as separate words. By and large, this means question marks, exclamation points, and dollar signs without following digits. I find that it works well to tokenize sequences like !!! into three separate exclamation marks, and similarly for !?!? and the like, since the progression from ! to !! is somewhat additive.
At later stages, you might want to filter some punctuation, because its very high frequency can cause problems for some models. I advise not doing this filtering at the tokenization stage, though, as it can be used to efficiently identify further structure.
Preserving capitalization across all words can result in unnecessary sparseness. Words written in all caps are generally worth preserving, though, as they tend to be acronyms or words people intended to emphasize, which correlates with sentiment information.
Lengthening by character repetition is a reliable indicator of heightened emotion. In English, sequences of three or more identical letters in a row are basically unattested in the standard lexicon, so such sequences are very likely to be lengthening.
The amount of lengthening is not predictable, and small differences are unlikely to be meaningful. Thus, it is effective to map sequences of length 3 or greater to sequences of length 3:
Rewrite:
(.)\1{2,}
as
\1\1\1
Even in English, whitespace is only a rough approximation of token-hood in the relevant sense:
The basic strategy is to tokenize these greedily, first, and then proceed to substrings, so that, for example, November 9 is treated as a single token, whereas an isolated occurrence of November is tokenized on its own.
If one starts including n-grams like really good as tokens, it is hard to know where to stop. For large enough collections, bigram or even trigram features might be included (in which case you can tokenize without paying attention to these phrases). For smaller collections, some of the mark-up strategies discussed later on can approximate such information (and often prove more powerful).
The tokenizer that I use for sentiment seeks to isolate as much sentiment information as possible, and it also identifies and normalizes dates, URLs, phone numbers, and various kinds of digital address. These steps help to keep the vocabulary as small as possible, and they provide chances to identify sentiment in areas that would be overlooked by simpler tokenization strategies (July 4th, September 11).
Here's the output of my tokenizer on our sample text:
@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D http://sentimentsymposium.com/.
@sentimentsymp : can't wait for the Nov_09 #sentiment talks ! YAAAY ! ! ! >:-D http://sentimentsymposium.com/ .
The social-media mark-up is all left intact, the date is normalized, and YAAAAAAY has been put into a canonical elongated form.
How important is careful tokenization for sentiment? Is it worth the extra resources? I now address these questions with some experimental data concerning classifier accuracy and tokenization speed.
My first classifier accuracy experimental set-up is as follows:
Figure fig:tokenizer_accuracy reports the results of these experiments.
In addition, I ran a version of the above experiment where the testing data were drawn, not from the same source as the training data, but rather from another corpus with the same kind of star-rating mark-up — in this case, 6000 user-supplied IMDB reviews. Such out-of-domain testing provides insight into how portable the classifier model is.
Figure fig:tokenizer_accuracy_xtrain summarizes the results from this experiment. Overall, the performance is less good and more volatile, but sentiment-aware tokenization is still the best option.
My take-away message is that careful tokenization pays-off, especially where there is relatively little training data available. Where there is a lot of training data, tokenization matters less, since there is enough data for the model to learn that, e.g., happy and happy," are basically both the same token, and it becomes less important to capture the effects of any particular word, emoticon, etc.
As tokenizers get more complicated, they of necessity become less efficient. Table tab:tokenizer_speed illustrates. For many applications, this is not a problem, but it can be a pressing issue if real-time results are needed.
The Catch-22 is that the really fast tokenizers require a lot more data to perform well, whereas the slow tokenizers perform well with limited data.
Tokenization is easily parallelized, so the effects of the slow-down can be mitigated by good infrastructure.
Tokenizer | Total time (secs) | Average secs/text |
---|---|---|
Whitespace | 1.305 | 0.0001 |
Treebank | 9.085 | 0.001 |
Sentiment | 29.915 | 0.002 |