Sentiment Symposium Tutorial: Vector-space models

Overview
Word–word matrices
Other matrix designs
Sentiment associates
Vector-based sentiment propagation
Related approaches
Summary of conclusions

Overview

This section introduces some basic techniques from the study of vector-space models. In the present setting, the guiding idea behind such models is that important aspects of a word's meanings are latent in its distribution, that is, in its patterns of co-occurrence with other words.

I focus on word–word matrices, rather than the more familiar and widely used word–document matrices of information retrieval and much of computational semantics. The reason for this is simply that, on the data I was experimenting with, this design seemed to be delivering better results with fewer resources.

Turney and Pantel 2010 is a highly readable introduction to vector-space models, covering a wide variety of different design choices and applications. I highly recommend it for anyone aiming to implement and deploy models like this.

Demo Explore word-vector similarity in diverse corpora:

http://sentiment.christopherpotts.net/vecsim/

Word–word matrices

A word–word for a vocabulary V of size n is an n × n matrix in which both the rows and columns represent words in V. The value in cell (i, j) is the number of times that word i and word j appear together in the same context.

Table tab:wordword depicts a fragment of a word–word matrix derived from 12,000 OpenTable reviews, using the sentiment + negation tokenization scheme motivated earlier. Here, the counts for (i, j) are the number of times that word i and word j appear together in the same review. With this matrix design, the rows and columns are the same, though they might be very different with other matrix designs.

Table tab:wordword

Fragment of a word × word count matrix derived from 12,000 OpenTable reviews. The rows and columns are the same, and the diagonal gives the token counts for each word.

	a	a_neg	able	able_neg	about	about_neg	above
a	32620	4101	340	110	2066	551	122
a_neg	4101	3286	31	41	312	142	21
able	340	31	163	2	16	4	0
able_neg	110	41	2	63	6	7	2
about	2066	312	16	6	1177	37	9
about_neg	551	142	4	7	37	365	1
above	122	21	0	2	9	1	88

We now want to measure how similar the row vectors are to one another. If we compare the raw count vectors (say, by using Euclidean distance), then the dominating factor will be overall corpus frequency, which is unlikely to be sentiment relevant. Cosine similarity addresses this by comparing length normalized vectors:

Definition: Cosine similarity: The cosine similarity between vectors A and B of length n is

In table tab:cosine, I picked two sentiment rich target words, excellent and terrible, and compared them for cosine similarity against the whole vocabulary. The figure provides the 20 closest words by this similarity measure. Though the lists contain more function words than we might like, there are also rich sentiment words, some of them specific to the domain of restaurants, which suggest that we might use this method to increase our lexicon size in a context-dependent manner.

Table tab:cosine

Top 20 neighbors of excellent and terrible as measured by cosine similarity.

excellent	terrible
excellent	terrible
service	.
food	was
and	bad
attentive	service
.	food
friendly	poor
atmosphere	over
delicious	there
prepared	completely
definitely	disappointed
ambiance	just
comfortable	about
with	eating
the	only
choice	horrible
well	almost
recommended	seemed
experience	even
outstanding	probably

Demo Explore word-vector similarity in diverse corpora:

http://sentiment.christopherpotts.net/vecsim/

Singular value decomposition (SVD; see also the Latent Semantic Analysis of Deerwester, Dumais, Furnas, Landauer, and Harshman 1990) is a dimensionality reduction technique that has been shown to uncover a lot of underlying semantic structure. Table tab:svd shows the result of length normalizing the word–word matrix and then applying SVD to the result.

Table tab:svd

Table table tab:wordword after length normalization followed by singular value decomposition.

	a	a_neg	able	able_neg	about	about_neg	above
a	-0.00012	-0.00028	-0.00109	-0.00139	-0.00279	0.00115	0.00078
a_neg	-0.00186	0.00077	0.00846	0.00823	0.00457	0.00099	0.00147
able	0.01790	-0.01113	0.01530	-0.00409	-0.01084	-0.01161	-0.01337
able_neg	0.00560	0.00871	0.00003	0.00596	0.01057	-0.01821	-0.00289
about	0.02486	0.01737	-0.02056	-0.01453	0.01422	0.01922	0.05323
about_neg	0.00514	0.00212	0.00252	0.00847	0.01624	0.01034	0.00013
above	0.00663	-0.02498	0.01634	0.00680	-0.00898	0.00609	-0.00667

If we compare the new vectors, the results are somewhat different. To my eye, they look messier from a sentiment perspective. Going forward, I'll stick with the simpler method of just using cosine similarity on the raw count vectors.

Table tab:svdsim

Top 20 neighbors of excellent and terrible as measured by length normalization followed by singular value decomposition.

excellent	terrible
excellent	terrible
said	fish
about	friend
some	had_NEG
potatoes	left_NEG
ambiance	so-so
...	or_NEG
bar	dessert
i'm	especially
what	curry
you_NEG	saturday_NEG
little	again_NEG
got	pre
his	expectations
be_NEG	of
enough_NEG	inattentive
are	attentive
later_NEG	when
snapper	patrons
tables	you've

Other matrix designs

There are many, many design choices available for vector-space models. We can vary not only the matrix itself but also the kinds of dimensionality reduction we perform (if any) and the similarity measures we employ. I refer to Turney and Pantel 2010 for a thorough overview. Here, I confine myself to mentioning a few other matrix designs that might be valuable for sentiment analysis:

Word × document (very common)
Word × word based on proximity in search engine results
Word × word based on dependency edges
Word × dependecy-edge
Modifier × modified
Twitter-based word × hashtag

Sentiment associates

Turney and Littman (2003) generalize the informal procedure used above, to achieve a general lexicon learning method. Their approach:

Given a word × context matrix, apply tf-idf weighting and then SVD.
Definite two seeds sets (positive vs. negative, or whatever contrast is of interest).
Rank words by their closeness to the seed sets, using the Semantic Orientation (SO) function defined just below.

Definition: Semantic orientation: Given opposing seeds sets S₁ and S₂, a matrix M, and a vector similarity measure vecSim defined for M, the semantic orientation of a word w is

Turney and Littman explore a number of different weighting schemes and similarity measures, paying special attention to how well they do on various corpus sizes. They conjecture that their approach can be generalized to a wide variety of sentiment contrasts.

I tried out a simple version of their approach on the OpenTable subset used above. The matrix was an unweighted word–word matrix, and the vector similarity measure was cosine similarity. The seed sets I used (table tab:so_seeds) are designed to aim for positive and negative words used to describe eating and good restaurant experiences. The top 20 positive and negative words are given in table tab:so. The results look extremely promising to me.

Table tab:so_seeds

Seed sets

positive words	good, excellent, delicious, tasty, pleasant
negative words	bad, spoiled, awful, rude, gross

Table tab:so

Top 20 words by the semantic orientation measure.

Positive	Negative
excellent	rude
delicious	awful
good	bad
tasty	worst
pleasant	acted
very	apologies_NEG
service	attitude
great	unprofessional
attentive	desserts_NEG
friendly	complained
restaurant	NEVER
all	recommending_NEG
my	customer_NEG
menu	terrible
recommend	HORRIBLE
experience	rude_NEG
nice	questions_NEG
atmosphere	ridiculous
wine	taking_NEG
well	watery

Vector-based sentiment propagation

The method of Velikovich, Blair-Goldensohn, Hannan, and McDonald (2010) is closely related to that of Turney and Littman 2003, except that it extends the idea with a sophisticated and somewhat mind-bending propagation component aimed at learning truly massive sentiment lexicons.

The algorithm is defined in figure fig:webprop_algorithm.

Figure fig:webprop_algorithm

Web propagation algorithm. G is a cosine similarity graph. P and N are seed sets. γ is a threshold; sentiment scores below it are rounded to 0. T is the number of iterations.

I use the simple example in table tab:webprop_ex illustrate how the algorithm works.

Table tab:webprop_ex

A simple example to show the action of the propagation algorithm. The corpus at left gives rise to the count matrix (top right) which is then converted into a cosine similarity matrix (bottom right), which is the graph G that is the central input to the algorithm.

Corpus
superb amazing
superb movie
superb movie
superb movie
superb movie
superb movie
amazing movie
amazing movie
cool superb

	amazing	cool	movie	superb
amazing	0	0	2	1
cool	0	0	0	1
movie	2	0	0	5
superb	1	1	5	0

	amazing	cool	movie	superb
amazing	1.0	0.45	0.42	0.86
cool	0.45	1.0	0.93	0.0
movie	0.42	0.93	1.0	0.07
superb	0.86	0.0	0.07	1.0

Figure fig:webprop_example continues the above example by showing how propagation works.

Figure fig:webprop_example

The iterative propagation method.

Velikovich et al. build a truly massive graph:

For this study, we used an English graph where the node set V was based on all n-grams up to length 10 extracted from 4 billion web pages. This list was filtered to 20 million candidate phrases using a number of heuristics including frequency and mutual information of word boundaries. A context vector for each candidate phrase was then constructed based on a window of size six aggregated over all mentions of the phrase in the 4 billion documents. The edge set $E$ was constructed by first, for each potential edge (v_i,v_j), computing the cosine similarity value between context vectors. All edges (v_i,v_j) were then discarded if they were not one of the 25 highest weighted edges adjacent to either node v_i or v_j.

Some highlights of their lexicon are given in figure fig:webprop_lexicon.

Figure fig:webprop_lexicon

Web prop lexicon

I think this approach is extremely promising, but I have not yet had the chance to try it out on sub-Google-sized corpora. However, I do have a small Python implementation for reference:

Code Implementation of the vector-based propagation algorithm:

webpropagate.py

Related approaches

Once you have a distributional matrix, there are lots of things you can do with it:

Clustering
Topic modeling
Visualizations via 2d embedding

The central challenge for using these approaches in sentiment analysis is that they all tend to favor content-level associations over sentiment associations. That is, if you feed them the full vocabulary of your corpus, or the top n words from it, then you are unlikely to get sentiment-like groupings back. Some methods for addressing this challenge:

Filter the vocabulary to known sentiment words. (The drawback is that you won't learn any new ones.)
Filter the vocabulary to adjectives and adverbs. (The drawback is that you'll miss sentiment elsewhere.)
Bring in sentiment metadata to guide the unsupervised clustering model. This could be done with Labeled LDA (Ramage, Hall, Nallapati, and Manning 2009) or the related approach of Maas, Daly, Pham, Huang, Ng and Potts (2010), which is focused explicitly on learning sentiment-rich word vectors.

Summary of conclusions

Vector-space models seem to be capable of learning large and powerful feature sets capturing a variety of sentiment contrasts.
The number of design choices is dizzying, and there are relatively few proven guidelines, but this could be freeing, right?