Painting a Novel

I have always wondered about the possibility of shrinking a book into a picture. I love to read, but there are times when I start reading a novel and discover by page 209 that I am not really liking either the content, the author’s views or simply the plot. If I had an image that said something about the book in a timeline-like manner, that would be pretty useful. Having said that, I am actually talking about the prospect of some of the most difficult challenges in NLP. NLP is not my field and my knowledge is pretty naive in that area. However, I have taken an attempt this weekend to actually create images from two novels and compared those to see how informative they are. The goal was to see how emotions evolve in a novel.  These can be called some basic versions of infographs of novels, however, I have tried to keep the aesthetics of these images in mind so that they do not look too technical… whatever that means.

Preliminaries

There can be hundreds of categories to describe the characteristics of a novel even to achieve some sort of accuracy in comparing them. However, I have focused on two broader aspects: sentiment analysis and nature phenomena. Sentiment analysis is a pure NLP problem, the goal is to quantify positive, negative, arousal, sadness etc sentiments in a sentence or a paragraph by matching the words against an existing sentiment/emotion words database. Usually, these databases have scores/stats associated with each word that show the strength of positive or negative emotions. I was hoping a global sentiment analysis might tell me something about how and what kind of emotions show up in the timeline of the novel. The reason for choosing how nature phenomena show up in a novel is quite personal. I am one of those people who love to read descriptions of nature in a novel, it helps me visualize the environment and I feel more attached to the story in many cases.

Data Collection

A nice resource page for sentiment analysis is [1]. I have selected a free database that is available immediately (i.e. you don’t have to request the database and wait for ages to get it). It’s called the AFINN  word list [2]. It has a collection of 2477 words that are collected from Twitter. Each word has been given a score from -5 to +5 (-5 for extreme negative and +5 extreme positive) based on an unsupervised learning algorithm. However, I was not entirely sure whether a list of words collected from twitter feeds could entirely capture the strength of emotions in a novel, especially the ones that were written a century ago (for obvious reasons)! So I found out another list of emotion words [3] that seemed quite helpful category-wise. I manually copied and pasted each category of words  into two text files as seemed appropriate, one for positive emotions and the other for negative emotions. I found a list of ‘Nature’ related words online, I decided to go with it for my experiments.

How it works

I have kept it pretty simple. My idea is not only to see how much emotion information I can accurately extract, but also how I could produce (sort of) nice images from a book. That’s not something a science guy should say, but I have a thing for nice looking abstract patterns. So another goal is to take out the discrete structure of the final image and replace it with a smoothed out version.

Using the AFINN list or the lists I manually compiled, there is a simple way of constructing scores for a sentence or a pragraph. Let’s look at a few examples.

1. “I hate the way he talks, he is disgusting.” The lists usually contain the emotion words, so ‘hate’ and ‘disgusting’ would be the two words we are likely to find in the compiled lists. The AFINN word list has both of these words associated with -4. Adding up, the sentence would get a score of -8.

2. “I like her, but she is quite an idiot.” The word ‘like’ gets +2 and ‘idiot’ gets -4 from AFINN. Net score could be summed up, or we could take the maximum of the magnitudes, preserving the sign at the end. Summing up, this sentence gets a score of -2.

3. “I love her and she is the one in my life.” AFINN has ‘love’ with +3. It doesn’t have a score for the other words. However, there were no other negative words in this sentence, so this would get +3 overall.

4. “The city was shrouded by black smoke. Elliot suddenly understood that its destruction was a matter of time.” According to AFINN, this sentence gets -4. No positive words detected.

I can easily come up with a better heuristic than net summation. However, I did not have much time to spend to experiment what scheme would be good, so I had to be satisfied with this and hope that I see some observable patterns.

For the second word list, positive emotions get +1 and negative emotions -1. Based on the count of positive or negative words in each sentence or paragraph, I multiply the count with the respective sign.

Each sentence or paragraph will be allocated a pixel in the final image, and the pixel will be colored according to the intensity of emotion, i.e. the score obtained from the net summation of the emotion words.

Code

1. To compare a list of sentences or paragraphs against the AFINN list and assign a score, we treat the document set as an n-dimensional vector, where each sentence or paragraph (based on what we are investigating on) is assigned an element in the vector, so the number of sentences or paragraphs is n. The i’th element will be updated when we scan the emotion words list for the corresponding word. At the end, the vector is smoothed by running an exponential moving average filter, and it is reshaped into a matrix for easy viewing and plotting. I have chosen Mathematica because of its many built-in functions to do these things easily.

compareAFINNLists[dat_, elist_] := Module[
  {tmp, ntmp, ptmp, psum, nsum, ppar, npar, i, j},
  Monitor[
   nsum = Table[0, {i, 1, Length[dat]}];
   psum = Table[0, {i, 1, Length[dat]}];
   For[j = 1, j <= Length[elist], j++,
    tmp = StringCount[dat, ___ ~~ elist[[j, 1]] ~~ ___];
    ntmp = Table[
      If[tmp[[i]] == 0,
       0,
       If[elist[[j, 2]] < 0,
        elist[[j, 2]]*tmp[[i]],
        0]
       ]
      ,
      {i, 1, Length[tmp]}];
    ptmp = Table[
      If[tmp[[i]] == 0,
       0,
       If[elist[[j, 2]] > 0,
        elist[[j, 2]]*tmp[[i]],
        0]
       ]
      ,
      {i, 1, Length[tmp]}];
    psum = psum + ptmp;
    nsum = nsum + ntmp;
    ]
   ,
   ProgressIndicator[j, {1, Length[elist]}]
   ];
  ppar = Partition[ExponentialMovingAverage[psum, 0.03],
    Round[Sqrt[Length[psum]]]];
  npar = Partition[ExponentialMovingAverage[nsum, 0.03],
    Round[Sqrt[Length[nsum]]]];
  Return[{ppar, npar, psum, nsum}]
  ]

2. For the other word lists, we follow a similar algorithm. This time, the sign (+1 or -1) is also input as an argument so that this factor can be multiplied with the net score.

compareLists[dat_, elist_, sign_] := Module[
  {tmp, tmp2, sumt, spar, i, j},
  Monitor[
   sumt = Table[0, {i, 1, Length[dat]}];
   For[j = 1, j <= Length[elist], j++,
    tmp = StringCount[dat, ___ ~~ elist[[j]] ~~ ___];
    tmp2 = Table[
      If[tmp[[i]] == 0,
       0,
       sign*tmp[[i]]
       ]
      ,
      {i, 1, Length[tmp]}];
    sumt = sumt + tmp2;
    ]
   ,
   ProgressIndicator[j, {1, Length[elist]}]
   ];
  spar = Partition[ExponentialMovingAverage[sumt, 0.03],
    Round[Sqrt[Length[sumt]]]];
  Return[{spar, sumt}]
  ]

3. Loading the text files and parsing to extract the sentences and/or paragraphs is pretty straightforward.

SetDirectory[NotebookDirectory[]];
data=ToLowerCase[Import["montezuma.txt","Plaintext"]];
data=StringSplit[data,"\n\n"];

emot=StringSplit[StringSplit[Import["AFINN-111.txt"],{"\n"}],"\t"];
emot=Table[{emot[[i,1]],ToExpression[emot[[i,2]]]},{i,1,Length[emot]}];

pemot=Select[StringSplit[StringTrim[Import["positive-emotions.txt"]],{" ",","}],#!=""&];
nemot=Select[StringSplit[StringTrim[Import["negative-emotions.txt"]],{" ",","}],#!=""&];
nature=ToLowerCase[Select[StringSplit[StringTrim[Import["nature.txt"]],{" ",","}],#!=""&]];

res=compareAFINNLists[data,emot];

The resulting output matrix is considered a 2D scalar density field and plotted using the ListDensityPlot command in Mathematica.

Archangel – W.C. Halbrooks

Time to do some experiments and see how the program performs. I chose two novels that were available at my hands immediately. The first one is Archangel, written by my freshman year roommate Carter (W.C. Halbrooks) when he was in high school. I had a copy in my computer, so naturally it became the subject of my first few experiments.

Sentence based analysis: Following are some images produced for sentence based sentiment analysis.

      

minmax

Figure 1. (Left) Positive emotions, (Right) Negative emotions based on the AFINN list. The associated color map is shown below them.

    

Figure 2. (Left) Figure 1 images masked over each other with an alpha value of 0.4, (Right) Sum of positive emotions and abs(negative emotions) matrices.

      

Figure 3.  (Left) Histogram of scores for positive emotions, (Right) histogram of scores in negative emotions.

The images are to be read left to right, top to bottom, just as one would read English text. Here, it is a timeline representing how emotions evolve as we read through each sentence. Figure 1 shows such images for the AFINN words list. Figure 2 shows two ways of combining the positive and negative emotions evolution. From the histograms of figure 3, we see that the average scores hover around  2 and -1.5.

    

minmax

Figure 4. (Left) Positive emotions based on DeRose emotion dictionary, (Right) Negative emotions based on the same dictionary. The color map is shown below them.

     

minmax

Figure 5. (Left) Nature timeline based on my nature words list, (Right) Histogram of scores from Nature words category.

Figure 4 shows positive and negative emotions timeline based on the DeRose emotions dictionary, and figure 5 shows the performance of the Nature word list I found online. Definitely it’s a poor word list (see histogram), only a few words from the list were found in the novel. The other explanation could be that the novel does not have a lot of descriptions of nature, but I will have a hard time believing that.

Paragraph based analysis: Often it is a good idea to look at the net score of a paragraph and see a timeline based on emotions in each paragraph.

    

minmax

Figure 6. Paragraph based positive emotions timeline (left), negative emotions timeline (right). Note the prominence of negative sentiments in the paragraphs in the later stages of the novel.

   

Figure 7. DeRose dictionary based positive emotions timeline. From the score histogram, it seems that quite a lot of words were common between the list and the novel.

Montezuma’s Daughter – Henry Rider Haggard

I recently read this novel. Project Gutenberg [4] offers a free text for all. From the images, I could roughly relate a few events (wars, love and marriage between the protagonists, conspiracy against the empire etc) in the novel.

Paragraph based analysis: From Archangel, it seemed to me that paragraph based analysis is better, for one thing we get less cluttered images!

    

minmax

Figure 8. Positive emotions timeline (left), negative emotions timeline (right).

This, in contrast to Archangel, says a lot about the kind of language used a century ago in novels. Note the prominence of positive emotions throughout the novel. This creates a better way to analyze novels, because the negative emotions are quite visible when there are extreme events. There are approximately six brown shades in the negative sentiments timeline (right). Having read the novel, I can approximately relate the tragic events in the novel with those six lines. Note the dominance of blue in the positive timeline (left) at the very beginning, and the dominance of brown at the very beginning in the negative timeline. The novel starts with a lot of lamentation for the protagonist’s mother’s murder, it is not surprising to see that small patch of brown at the beginning of the negative timeline (or blue patch at the beginning of positive emotions timeline).

    

minmax

Figure 9. Nature description propagation in Montezuma’s daughter. For this larger corpus, the nature words list worked out well (to some extent), as seen from the score histogram. So, we can sort of rely on this timeline picture and say that there are quite a lot of nature descriptions in the last-middle half of the novel, which is not quite wrong. Anahuaq (currently Mexico) in the 15th century is quite well described when the protagonist becomes the king of the tribes there, which happens at around the middle of the novel.

Conclusion

This was just a glimpse of what data could be visualized about novels to give the readers some notion about the emotional experience  as they read a novel along. There can be many other useful information about novels that can be encoded in this timeline-like pictures. The work here does not do justice to the title, I agree, but hey, this was just me spending some spare weekend time off research and other duties to explore what sort of patterns and pictures emerge from the novels I read!

The deciding factors here are (a) a comprehensive list of emotion/sentiment words and (b) a nice heuristic to compare sentences or paragraphs. Let’s be honest, net summation scheme sucks for many logical reasons, for one thing it leaves out small and detailed sentiment strengths in paragraphs or sentences. Nevertheless, I saw some patterns that I expected to see, so it did the job for now. A better scheme could be a Taylor series like summation. As more words from the emotion database are found in the novel, the squared, cubic etc terms of those values will be added to the overall sentiment strength.

The information visualization and art aspects of such images can’t be ignored. From my Google search I have not found anything about such visualization, but it’s quite hard to believe no infovis researchers attempted such work. I am interested to see what sort of work has been done so far.

With a carefully chosen color map, such patterns can be quite artsy from the reader’s or writer’s perspective. The amount of information that can be embedded in a 2D image is limited though, the Free Lunch theorem applies here. An image based on emotions and sentiments in a novel seemed logical to me, however, there can be other aspects equally important to the reader. The experience of reading a novel is quite personalized, different readers value different factors.

[1]  http://neuro.imm.dtu.dk/wiki/Text_sentiment_analysis

[2] http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

[3] http://www.derose.net/steve/resources/emotionwords/ewords.html

[4] http://www.gutenberg.org/cache/epub/1848/pg1848.txt

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s