Sentiment Symposium Tutorial: Tokenizing

  1. Overview
  2. Basic text normalization
  3. Whitespace tokenizer
  4. Treebank-style
  5. Sentiment-aware tokenizer
    1. Emoticons
    2. Twitter mark-up
    3. Informative HTML tags
    4. Masked curses
    5. Additional punctuation
    6. Capitalization
    7. Lengthening
    8. Multi-word expressions
    9. Putting the pieces together
  6. Evaluation
    1. Classification accuracy
    2. Speed
  7. Summary of conclusions


Tokenizing (splitting a string into its desired constituent parts) is fundamental to all NLP tasks.

There is no single right way to do tokenization. The right algorithm depends on the application.

I suspect that tokenization is even more important in sentiment analysis than it is in other areas of NLP, because sentiment information is often sparsely and unusually represented — a single cluster of punctuation like >:-( might tell the whole story.

The next few subsections define and illustrate some prominent tokenization strategies. To get a feel for how they work, I illustrate with the following invented tweet-like text:

@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D

Demo Experiment with the tokenizers discussed here using your own text or a live Twitter stream:
Implementation Code for a basic but extensible sentiment tokenizer:

Basic text normalization

I assume throughout that the text has gone through the following preprocessing steps:

  1. All HTML and XML mark-up has been identified and isolated.
  2. HTML character entities like &lt; and &#60; have been mapped to their Unicode counterparts (< for both of my examples).

These steps should be taken in order, so that one distinguishes the token <sarcasm>, which is frequently written out as part of a text, from true HTML mark-up (which is not seen directly but which can play a role in tokenization, as discussed below).

These preprocessing steps affect just the emoticon in the sample text:

@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D

Whitespace tokenizer

The whitespace tokenizer simply downcases the string and splits the text on any sequence of whitespace, tab, or newline characters:

@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D

@sentimentsymp: can't wait for the nov 9 #sentiment talks! yaaaaaay!!! >:-d



The Treebank-style is the one used by the Penn Treebank and many other important large-scale corpora for NLP. Thus, it is a de facto standard. This alone makes it worth considering, since it can facilitate the use of other tools.

@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D

@ SentimentSymp : ca n't wait for the Nov 9 # Sentiment talks ! YAAAAAAY ! ! ! &gt; ; : -D http : // .

Some drawbacks to the Treebank style for sentiment:

Sentiment-aware tokenizer

I now review some of the major aspects of a sentiment-aware tokenizer. You are likely to want to tailor these suggestions to your own data and applications.

Implementation Code for a basic but extensible sentiment tokenizer:


Emoticons are extremely common in many forms of social media, and they are reliable carriers of sentiment.

The following regular expression captures 96% of the emoticon tokens occurring on Twitter, as estimated by the InfoChimps Smileys Census. (It captures just 36% of the emoticon types, but most are extremely rare and highly confusable with other chunks of text, so I've not tried to capture them.)

[<>]?                       # optional hat/brow
[:;=8]                      # eyes
[\-o\*\']?                  # optional nose
[\)\]\(\[dDpP/\:\}\{@\|\\]  # mouth      
|                           #### reverse orientation
[\)\]\(\[dDpP/\:\}\{@\|\\]  # mouth
[\-o\*\']?                  # optional nose
[:;=8]                      # eyes
[<>]?                       # optional hat/brow

Twitter mark-up

Twitter includes topic and user mark-up that is useful for advanced sentiment modeling. Your tokenizer should capture this mark-up if you are processing Twitter data.



Hashtags (topics):


Informative HTML tags

Basic HTML mark-up like the strong, b, em, and i tags can be indicators of sentiment. Their opening or closing elements can be treated as individual tokens. (Don't count them twice.)

Where sparseness is not an issue, informative tags can be seen as annotating all the words they contain. For strong, strong, b, em, and i, I often capitalize them, to collapse them with words written in all caps for emphasis.

<strong>really bad idea<strong>



For domain-specific applications using Web data, it can be fruitful to study the mark-up:

  1. Sentiment information implicit in metadata tags like <span class="rating>2 of 5</span>
  2. Semantic mark-up like <span class="title>The Good, the Bad, and the Ugly</span>

Masked curses

Some websites change curses to sequences of asterisks, perhaps with letters at the edges (****, s***t). Similarly, some writers use random sequences of non-letter words in place of swears ($#!@). Thus, there can be value in treating such sequences as tokens. (I tend to split apart sequences of exclamation points and question marks, though.)

Additional punctuation

Punctuation should be kept at the tokenization stage. We will shortly use it to identify further structure in the tokenized string. Thus, the goal for tokenizing is to properly distinguish various senses for the individual punctuation marks.

My basic strategy for handling punctuation is to try to identify all the word-internal marks first, so that any others can be tokenized as separate elements. Some considerations:

  1. We already tokenized a variety of things that involve word-internal punctation: emoticons, Twitter and HTML mark-up, and masked curses.
  2. In general, sequences mixing only letters, numbers, apostrophes, single dashes (hyphens), and underscores are words.
  3. Sequences consisting entirely of digits, commas, and periods are likely to be numbers and so can be tokenized as words. Optional leading monetary signs and closing percentage signs are good to allow as well.
  4. Sequences of two or more periods are likely to be ellipsis dots and can be collapsed to ...

The remaining punctuation can be kept as separate words. By and large, this means question marks, exclamation points, and dollar signs without following digits. I find that it works well to tokenize sequences like !!! into three separate exclamation marks, and similarly for !?!? and the like, since the progression from ! to !! is somewhat additive.

At later stages, you might want to filter some punctuation, because its very high frequency can cause problems for some models. I advise not doing this filtering at the tokenization stage, though, as it can be used to efficiently identify further structure.


Preserving capitalization across all words can result in unnecessary sparseness. Words written in all caps are generally worth preserving, though, as they tend to be acronyms or words people intended to emphasize, which correlates with sentiment information.


Lengthening by character repetition is a reliable indicator of heightened emotion. In English, sequences of three or more identical letters in a row are basically unattested in the standard lexicon, so such sequences are very likely to be lengthening.

The amount of lengthening is not predictable, and small differences are unlikely to be meaningful. Thus, it is effective to map sequences of length 3 or greater to sequences of length 3:





Multi-word expressions

Even in English, whitespace is only a rough approximation of token-hood in the relevant sense:

  1. Named entities
  2. Phone numbers
  3. Dates
  4. Idioms like out of this world
  5. Multi-word expressions like absolutely amazing

The basic strategy is to tokenize these greedily, first, and then proceed to substrings, so that, for example, November 9 is treated as a single token, whereas an isolated occurrence of November is tokenized on its own.

If one starts including n-grams like really good as tokens, it is hard to know where to stop. For large enough collections, bigram or even trigram features might be included (in which case you can tokenize without paying attention to these phrases). For smaller collections, some of the mark-up strategies discussed later on can approximate such information (and often prove more powerful).

Putting the pieces together

The tokenizer that I use for sentiment seeks to isolate as much sentiment information as possible, and it also identifies and normalizes dates, URLs, phone numbers, and various kinds of digital address. These steps help to keep the vocabulary as small as possible, and they provide chances to identify sentiment in areas that would be overlooked by simpler tokenization strategies (July 4th, September 11).

Here's the output of my tokenizer on our sample text:

@SentimentSymp: can't wait for the Nov 9 #Sentiment talks! YAAAAAAY!!! >:-D

@sentimentsymp : can't wait for the Nov_09 #sentiment talks ! YAAAY ! ! ! >:-D .

The social-media mark-up is all left intact, the date is normalized, and YAAAAAAY has been put into a canonical elongated form.

Demo Experiment with the tokenizers discussed here using your own text or a live Twitter stream:


How important is careful tokenization for sentiment? Is it worth the extra resources? I now address these questions with some experimental data concerning classifier accuracy and tokenization speed.

Classification accuracy

My first classifier accuracy experimental set-up is as follows:

  1. I randomly selected 12,000 OpenTable reviews. The set was balanced in the sense that 6000 were positive (4-5 stars) and 6000 were negative (1-2 stars).
  2. The classifier was a maximum entropy model. The features were all the word-level features determined by the tokenizing function in question.
  3. The amount of training data is likely to be a major factor in performance, so I tested at training sizes from 250 texts to 6000 texts, in increments of 250.
  4. At each training-set size N, I performed 10-fold cross-validation: for 10 runs, the data were randomly split into a training set of size N and a testing set of size 6000. The accuracy results from these 10 runs were averaged.

Figure fig:tokenizer_accuracy reports the results of these experiments.

Figure fig:tokenizer_accuracy
Assessing tokenization algorithms via classification.

In addition, I ran a version of the above experiment where the testing data were drawn, not from the same source as the training data, but rather from another corpus with the same kind of star-rating mark-up — in this case, 6000 user-supplied IMDB reviews. Such out-of-domain testing provides insight into how portable the classifier model is.

Figure fig:tokenizer_accuracy_xtrain summarizes the results from this experiment. Overall, the performance is less good and more volatile, but sentiment-aware tokenization is still the best option.

Figure fig:tokenizer_accuracy_xtrain
Assessing tokenization algorithms via classification: out-of-domain testing on IMDB reviews.

My take-away message is that careful tokenization pays-off, especially where there is relatively little training data available. Where there is a lot of training data, tokenization matters less, since there is enough data for the model to learn that, e.g., happy and happy," are basically both the same token, and it becomes less important to capture the effects of any particular word, emoticon, etc.


As tokenizers get more complicated, they of necessity become less efficient. Table tab:tokenizer_speed illustrates. For many applications, this is not a problem, but it can be a pressing issue if real-time results are needed.

The Catch-22 is that the really fast tokenizers require a lot more data to perform well, whereas the slow tokenizers perform well with limited data.

Tokenization is easily parallelized, so the effects of the slow-down can be mitigated by good infrastructure.

Table tab:tokenizer_speed
Tokenizer speed for 12,000 OpenTable reviews. The numbers are averages for 100 rounds. The average review length is about 50 words.
Tokenizer Total time (secs) Average secs/text
Whitespace 1.305 0.0001
Treebank 9.085 0.001
Sentiment 29.915 0.002

Summary of conclusions

  1. Good tokenizer design will pay off, especially where the amount of training data is limited.
    1. Increased classifier effectiveness
    2. Increased model portability
  2. Where there is a large amount of data, careful tokenizing is less important.
  3. Good tokenizer design might be especially important for sentiment analysis, where a lot of information is encoded in punctuation and non-standard words.
  4. Good tokenizing takes time, which might be an issue for real-time interactive systems. However, the times involved are unlikely to be prohibitive for off-line systems: tokenization is easily parallelized and can be optimized based on known properties of the text.