Analyzing Tagore’s Literature (Part 2)

In part 1, I employed Bose-Einstein distribution to find out how the “temperature” of Tagore’s writing varies across different novels. In part 2, I delve into Zipf’s power law and similarity metrics used to compare high dimensional vectors in order to analyze the lexical wealth and similarity across different novels and short stories written by the legend.

Zipf’s Law

In fractal theory, Zipf’s power law on linguistics is a tried and accepted heuristic to compare large texts [1]. This power law statistics, derived from the behavior of certain kind of fractals, can be used in many other disciplines too. In simple terms, Zipf’s law is stated as: N = Ak^(-$\phi$).

Taking logs on both sides: log(N) = log(A) – $\phi$ * log(k)

we get a linear equation. Here, N is the total number of words in a corpus, k is the ratio of the number of distinct words n to N. A is a constant amplitude and phi is a phase value that is unique for a given author. Using simple regression analysis, it is possible to find a characteristic phi for any author. The law merely dictates a simple fact: as the text size increases, the number of distinct words decreases. At what rate this happens is a question that is related to the expertise of a writer in maintaining variability of words and sentence structures over the course of his novels.

Demonstration

The following table shows n, N and k values for the same set of novels in part 1.

The following table shows the same data for a collection of short stories.

Note the higher values of k for the short stories. This could be mainly due to the smaller size of the text.

Figure 1. (Left) Data points in the Log(k)-Log(N) plane, and a linear fit equation showing the characteristic gradient $\phi$. (Right) Same experiment done on the short stories.

Figure 2. The linear fit equations for novels and stories on the same plot (red – stories, blue – novels). Clearly it demonstrates that the rate at which Tagore’s lexical wealth k falls is higher for novels. This could be due to the difference in the text size though.

Heap’s Law

Heap’s law is similar to Zipf’s law. It’s a power law that describes how the number of unique elements in a set of randomly chosen elements grow as the size of the set increases. In our case, we would expect to see that the number of unique words increase as we increase the size of the text.

Figure 3. (Left) Heap’s law demonstrated for novels (log(n) vs. log(N) plot), (right) for short stories.

Figure 4. The two linear fit equations on the same plot (red – short stories, blue – novels). This demonstrates that although the number of distinct words used in short stories prevail for a short size of text, ultimately the novels take over as the size of the text increases. This may indicate a better effort on Tagore’s side to polish and revise his novels to amplify the lexical wealth, whereas, statistically, this may be less true for his short stories.

Similarity Measure

The variability of distinct words across a set of novels or short stories can be captured by feature vectors – essentially rows of numbers in a document-term matrix. Comparing these high dimensional vectors to infer the similarity between short stories and novels of Tagore might be useful. Here, I use two schemes to compare these high dimensional vectors. One is the cosine of the angle between two vectors, and the other is the L2-norm of the difference between two vectors. These schemes project the high dimensional vectors to scalar values that can be easily compared. Histograms from all possible pair combinations are produced to analyze how similar or different are the span of the words used in short stories or novels.

Figure 5. (Left) Histograms from L2-normed difference scheme, (right) from cosine scheme. Red – short stories, blue – novels. Note the bimodal distribution for both novels and stories, except the cosine heuristic for short stories. It seems there are two principal modes of similarity among all novels and stories. Although this could be just a statistical property of texts that I am not aware of.

Note the width of novels histograms in both cases; they are wider than those of stories’. For the cosine scheme the novels histogram has a mode that’s closer to 1.0, whereas the average peak for stories histogram is farther from 1.0. These two observations mean that similar words and sentence structure recur themselves throughout novels, more than short stories. This is consistent with the inferences drawn in part 1 and Zipf’s and Heap’s laws for Tagore’s work.

Comparing Upendrakishor Raychowdhury’s Work

One last thing I try here is to see how these measures can be used to compare different authors’ works. Although my aim was to compare Kazi Najrul Islam with Tagore, unfortunately I could not find any of his work in text form. Instead, I found Upendrakishor Raychowdhury’s short stories collection and decided to compare the lexical wealth between the two authors’ stories collection. It should be noted that lexical wealth is only one of the (measurable) heuristics to compare authors. Most of the comparisons in the field of literature are qualitative and depend on the taste of readers and critics. Nonetheless, the lexical wealth does say a lot about the author’s expertise in not being monotonous.

The following table shows UR’s short stories that I have collected, along with their k values.

Figure 6. (Left) Zipf’s law linear fit for UR’s short stories. (Right) Zipf’s law linear fits for both UR’s (red) and Tagore’s (blue) short stories. Although it seems that UR has an upper hand with Tagore (smaller falloff rate as we increase the lexical wealth k), it would be dubious to claim that UR is better at not being monotonous. It’s quite risky to draw conclusions based on such a small margin, lack of adequate data is another issue. I could say something clearly if I had a collection of hundreds of stories from both writers. 🙂

Conclusion

In part 1 I found out that a possible characteristic falloff of the lexical wealth may exist for Tagore’s writings. The experiments here in part 2 restate a celebrated fact in linguistics: every author has a natural limit after which his writings give way to being monotonous in terms of repeating words and sentence structures. Rabindranath Tagore was not so different from the group of his contemporary writers. It will be interesting to see how his works compare with other contemporary works when/if I get enough data. 🙂

[1] L. L. Goncalves, L. B. Goncalves, Fractal power law in literary English, Physica A 360 (2006) 557 – 575.