Sentiment Symposium Tutorial: Context

  1. Overview
  2. Understanding the metadata
    1. Star rating imbalances
    2. Category relationships
    3. Metadata interpretation
    4. Perspective
  3. Factors correlated with overall assessment
    1. The more reviews the better
    2. Helpfulness ratings
    3. Text length
  4. Users
    1. Intra-author consistency
    2. Inter-author diversity
    3. User style
  5. Topic-relative analysis
  6. Long-suffering fans and thwarted expectations
  7. Additional predictive information
  8. Summary of conclusions

Overview

Sentiment is highly variable and context-dependent. The goal of this section is to highlight some ways in which you can improve system performance by embracing this:

Understanding the metadata

There can be enormous benefits to devoting time and resources to understanding your data and associated annotations before you start to build a sentiment model.

Star rating imbalances

Figure fig:imbalances summarizes one of the major challenges to sentiment systems trained on naturalistic annotations like star-ratings: it is almost invariably the case that some categories are vastly over-represented. (This section is largely about getting around this problem when building lexicons.)

Figure fig:imbalances
Imbalances in the distribution of texts relative to categories. For the ratings corpora (panels 1-4), positive dominates. For Experience Project, sympathy and solidarity dominate.
figures/rating-imbalances.png

Category relationships

For many applications, you will want to reduce the dimensionality of your label set.

If you have hundreds of labels, then consider building a word × label matrix using the techniques in the vectors section and then seeing if you can effectively combine columns.

For smaller category sets like star ratings, you might want to directly measure the distance between them, to see if natural divisions emerge. Figure fig:dist uses two probabilistic methods for making such comparisons for a five-star rating system. Strikingly, it looks like a natural division would be a three-way one grouping the entire middle of the scale against its edges.

Figure fig:dist
Category distances. The top panel gives the distances according to their chi-squared statistics (Kilgarriff and Rose 1998; Manning and Schütze 1999:171), and the bottom panel gives their KL-divergences. The orderings are the same in both cases, and they suggest that we might do well to treat the ratings scale as having three parts: the lowest of the low (1 star), the mushy middle (3-4 stars), and the best of the best (5 star).
figures/rating-category-pair-distances.png

Metadata interpretation

Although star-rating systems are by now widely understood, we might still worry that some users are confused. For example, in many parts of Europe, 1 is the best mark one can get in school, and this seems to lead some users to pick 1-star when giving very positive reviews. Figure fig:textrating is reassuring, though: it shows a high correlation between intuitive natural language phrases and star ratings.

Figure fig:textrating
Users who both answered the question on the x-axis and gave a rating (y-axis) provide a window into how people conceptualize the rating scale. The dark horizontal lines indicate the median ratings, with boxes surrounding 50% of the ratings. The outliers could trace to people who treat the stars as a ranking system where 1 is the best.
figures/ta-ratings-by-recommendation.png

Perspective

The star rating is primarily an indicator of the speaker/author vantage point. However, there is good reason to believe that readers/hearers are able to accurately recover the chosen star rating based on their reading of the text. Potts (2011) reports on an experiment conducted with Amazon's Mechanical Turk in which subjects were presented with 130-character reviews from OpenTable and asked to guess which rating the author of the text assigned (1-5 stars). Figure fig:mturk summarizes the results: the author's actual rating is on the x-axis, and the participants' guesses on the y-axis. The responses have been jittered around so that they don't lie atop each other. The plot also includes median responses (the black horizontal lines) and boxes surrounding 50% of the responses. The figure reveals that participants were able to guess with high accuracy which rating the author assigned; the median value is always the actual value, with nearly all subjects guessing within one star rating.

Figure fig:mturk
Experimental results suggesting that texts reliably convey to readers which star rating the author assigned.
figures/mturk-reader-ratings.png

Factors correlated with overall assessment

The environment in which the text was produced can impact sentiment in complex ways.

The more reviews the better

The observational fact is that the more reviews a product gets, the more consistent the ratings, and the more positive those ratings are. Figure fig:ratingproduct summarizes for a few different corpora.

There is an intuitive explanation for this: a few bad reviews is enough to stop people from buying, which in turn leads them to stop reviewing. Conversely, positive reviews stimulate people to buy, which in turn increases the reviewer pool. (Movies and video games are somewhat exceptional in this regard. Both are products that many people buy immediately, sight unseen.)

Figure fig:ratingproduct
Star ratings and product review counts. High ratings and high review counts go hand-in-hand (top panels), and this results in a narrowing of opinions (bottom panels). This is presumably because products with a few negative reviews are purchased less and hence reviewed less.
figures/ratingstats-by-product.png

Helpfulness ratings

Helpfulness ratings are predictors of sentiment, as seen in figure fig:helpful. I am not sure why this is, exactly, but its seems to be a very robust effect (Ghose, Ipeirotis, and Sundararajan 2007; Danescu-Niculescu-Mizil, Kossinets, Kleinberg, and Lee 2009). One factor that likely contributes is that online retailers tend to promote very positive reviews, which means they are read more than others.

Figure fig:helpful
Many sites have meta-data saying "X of Y people found this review helpful", where Y is not the number of views, but rather the total number of people who selected "helpful" or "unhelpful". These plots show that there is a correlation between star ratings and helpfulness ratings: the higher the star rating, the more helpful people find the review.
figures/ratings-by-helpfulness.png

Text length

Text length is another useful predictor (figure fig:lengths). Short texts tend to be highly emotive (positive or negative). Longer texts tend to be balancing perspectives, so they are longer on average.

Figure fig:lengths
Mean review length for three corpora. Middle of the road reviews tend to be longer, presumably because they likely to balance evidence rather than just broadcastings a strongly held opinion.
figures/words-per-review.png

Users

Authorship follows the a Zipfian pattern: most authors contribute one or two texts, and a handful contribute huge numbers of texts. Overall, user-level modeling can sharpen the overall picture.

Figure fig:auth_fs
Authorship frequency spectrum.
figures/authorship-frequency-spectra.png

Intra-author consistency

Individual reviewers tend to give the same ratings repeatedly (figure fig:rev_sd).

Figure fig:rev_sd
Reviewer rating standard deviations
figures/rating-sd-by-author.png

This pattern probably arises because reviewers simply use different parts of the scale: some very kind reviewers never dip below three stars no matter how much they dislike the product, whereas give our stars much more grudgingly. If this is the case, then it might pay off to z-score normalize the ratings of individual authors before you use them:

Definition: z-scores
The z-score for a score x is (x - μ)/σ, where μ is the mean of the population for x and σ is the standard deviation of the population for x.

The population for a score could be the set of a reviewer's scores, or the scores for a product or product class, or the entire corpus.

Inter-author diversity

Though individual raters tend to confine themselves to a small part of the scale, raters differ from each other quite considerably. Figure fig:rev_cmp supports this claim in a rather controlled way, by comparing nine users who all rated the same seven books on Amazon.

Figure fig:rev_cmp
Reviewer comparisons for nine reviewers who rated the same seven books on Amazon. They tend to agree on which half of the scale is the right one, but there is a lot of variation within that space.
figures/reviewer-comparisons.png

User style

Reviewers also use language differently from each other, so the more you can model these differences, the more accurate your analysis will be. Figure fig:damn illustrates with a rather intuitive example concerning how people use curses.

Figure fig:damn
A portrait of individual variation: damn as used by 16 different users in a collection of reviews from IMDB. The panels depict the estimates from a fitted multi-level model in which the intercept and all the predictors are allowed to vary by user. Some users swear only when happy, others only when sad, others at either extreme, and still others just whenever.
figures/multilevel-damn.png

Topic-relative analysis

What you're talking about has a profound effect on the language you use. Thus, if you have topical information (forum name, product, product class, etc.) it should be included in your model. Figure fig:genre shows some quick, intuitive examples of variation by movie genre in IMDB reviews, using the techniques described in the review lexicon section.

Figure fig:genre
Genre effects from IMDB reviews.
figures/genre-laugh.png figures/genre-depressing.png figures/genre-sandler.png

Similarly, confessions at the Experience Project are tagged with group information, and these provide valuable clues as to how to understand the language (figure fig:groups).

Figure fig:groups
Group effects in Experience Project confessions.
figures/groups-weird.png figures/groups-drunk.png

Some sites have aspect-level metadata which can facilitate learning specific topic-level associations. Unfortunately, in my experience, the aspect-level ratings tend to be highly correlated with each other and with the overall rating, which detracts from their usefulness. The situation on OpenTable, summarized in figure fig:aspect and figure fig:diff, is typical.

Figure fig:aspect
OpenTable rating distributions. Positive reviews dominate in all categories. Noise is fundamentally different, since it doesn't have a standard preference ordering.
figures/opentable-ratings.png
Figure fig:diff
Comparisons with Overall. In each panel, the overall rating value is subtracted from the other rating value. Thus, a value of 0 indicates agreement between the two ratings for the review in question.
figures/opentable-ratingdiff.png
Demo Compare aspect-level scores from Open Table:

Long-suffering fans and thwarted expectations

The ratio of positive to negative words as defined by a quality sentiment lexicon might be an indicator of the true sentiment of the text, in that extreme values are likely to be instances of thwarted expectations.

Table tab:thwarted
An example of thwarted expectations. This is a negative review. Inquirer positive terms are in blue, and Inquirer negative terms are in red. There are 20 positive terms and six negative ones, for a Pos:Neg ratio of 3.33.
i had been looking forward to this film since i heard about it early last year , when matthew perry had just signed on . i'm big fan of perry's subtle sense of humor , and in addition , i think chris farley's on-edge , extreme acting was a riot . so naturally , when the trailer for " almost heroes " hit theaters , i almost jumped up and down . a soda in hand , the lights dimming , i was ready to be blown away by farley's final starring role and what was supposed to be matthew perry's big breakthrough . i was ready to be just amazed ; for this to be among farley's best , in spite of david spade's absence . i was ready to be laughing my head off the minute the credits ran . sadly , none of this came to pass . the humor is spotty at best , with good moments and laughable one-liners few and far between . perry and farley have no chemistry ; the role that perry was cast in seems obviously written for spade , for it's his type of humor , and not at all what perry is associated with . and the movie tries to be smart , a subject best left alone when it's a farley flick . the movie is a major dissapointment , with only a few scenes worth a first look , let alone a second . perry delivers not one humorous line the whole movie , and not surprisingly ; the only reason the movie made the top ten grossing list opening week was because it was advertised with farley . and farley's classic humor is widespread , too . almost heroes almost works , but misses the wagon-train by quite a longshot . guys , let's leave the exploring to lewis and clark , huh ? stick to " tommy boy " , and we'll all be " friends " .

Suggestion: create a real-valued feature that is the Pos:Neg ratio if that ratio is below 1 (lower quartile for the whole PangLee data set) or above 1.76 (upper quartile), else 1.31 (the median). The goal is to single out "imbalanced" reviews as potentially untrustworthy at the level of their unigrams. (For similar idea, see Pang, Lee, and Vaithyanathan 2002.)

Additional predictive information

The above is just a sample. Any contextual information you can get your hands on will be valuable. Here's a sample of others that you're likely to be able to bring into a real-world system with relative ease.

  1. Length of time on the market
  2. The number of posts and the distribution of posting times
  3. Price relative to competitions/comparables
  4. Demographics (age, location, education and other carriers of social meaning)
  5. The product's depiction in advertising and official descriptions

Summary of conclusions

  1. Devote some time to understanding the properties of your annotations. This is always important for naturalistic data. But even if it's your own annotation scheme, it might harbor some significant latent structure.
  2. General meta-properties like the number of reviews for a product and the length of time the product has been on the market are likely to be sentiment indicators, due to social effects.
  3. Basic properties of the text (length, vocab size, etc.) can be significant predictors of sentiment.
  4. Bring in usernames as features. They are likely to carry highly predictive information.
  5. In a similar vein, if your data come from multiple corpora, that high-level source information should be included too.
  6. In a similar vein, any demographic information you have about the author will be valuable.
  7. Bring in topic/aspect level features (genre, topical group, etc.), as this is likely to profoundly affect sentiment information.