This Thanksgiving break, I spent a lot of my time sitting in my room and crunching a lot of numbers in my laptop and in a cluster. That’s my usual job, but I haven’t really been working too much with data in the last couple of months (been concentrating more on developing some sensor electronics for my thesis). This is the time of the semester I usually take a break for a day or two, forget about everything else and just relax. However, this will be a memorable break, and I was/am very excited, writing this update late at night because I simply had to share this experience with the world.
Computation is very powerful. I previously wrote that computation is not comprehension, always. But when it does yield comprehension, it can be the most beautiful thing ever. It can tell you stories that no one ever told before. And they are the best stories of all — beautiful truth disguised in a beautiful narrative of numbers.
For the last four years, I have been working (on and off) on developing a computational and algorithmic framework for analyzing religious texts, and associated scholars and historians relationships. A lot of the time has been spent crawling, mining, and cleaning the right form of data. Recently, our team (a few humanities professors from UC Davis and Stanford, and me) finished crawling and cleaning a few million hadiths (different versions of the teachings of the Islamic prophet), over 75000 scholar biographies and a few hundred thousand correct narration sequences. And based on some previous scripts I wrote, I worked during this break to reduce a complicated hair-ball like mess of network relationships (between scholars) to a handful of cleaner and important sub-networks.
Hair ball of network relationships among a few thousand historians.
Eigen-analysis reveals much cleaner, tighter, and important communities, and throws off unimportant relationships.
A few eigen-analysis tricks were enough to pull this off, but the path to any mathematical project like this is usually not quite straight and rather convoluted. Some simple mathematical intuition can take you a long way in these things, and I hope to detail this journey later in the blog.
There are few exciting results so far, and they will be detailed in publications and talks. I will list the key experiences of the last two days, which is representative of what it is like to live the life of a data detective (when you have collected good data).
Time seemed to stop.
I saw a thousand years worth history unfold in front of me, in my computer, within two days. I saw trust, betrayal, self-interest, selflessness, and overall honest efforts of over 75000 scholars who lived in different cities/countries in different times, their lives condensed in network diagrams and eigenvectors. Each step taken to get from the hair ball mess to the cleaner networks was inspiring me immensely; solving each piece of that puzzle and not realizing when the day already became an evening. When I was getting close to making major inferences, the work did not seem work anymore. I had to push myself to get to bed, convincing myself against the fact that the results were knocking on my door. The satisfaction of doing this type of work is very high.
We need better visualization tools.
More than distributions and tables, I think we need better visualization tools to understand data, as our cognitive abilities can be fully utilized in that case. Visualizing networks and eigenvectors in clever ways at different stages of the analysis saved me a few weeks worth effort. More on this in later posts.
When in doubt, compute.
Hypotheses and intuition can take us quite far, but when the data is complex, we have no other choice other than falling back to standard mathematical tools. It is usually encouraged to not compute blindly, but certain gut feelings and a few semi-blind computations can yield accidental but good results.
Simple mathematical tools are usually good enough.
SVD, PCA, basic graph manipulation techniques provide results that are easily understandable. Many times, some unsupervised learning methods get on my nerves for the many parameters and tuning options they provide. Using them is like trying to get a 98% in an exam instead of being satisfied with 92%. Sometimes the effort is not worth the time and agony, and the results are confusing many times.
Collect data that matters, impact is a byproduct.
One billion Muslims and their cultures are directly or indirectly affected by these hadiths and narration trees. The people of other faith who live around them are affected by these teachings too. Hadiths can unite, divide, inspire, or confuse people. Yet, no one seems to care about looking through a computational lens on the system, which could be an immensely useful tool to check the reliability, sanity, and soundness of this knowledge. It is also a novel way to see how knowledge diffused in the middle eastern countries over centuries.
Our team’s initial aim was to collect the correct form of this data from reliable sources. Impact will come as a byproduct of the computational analysis, that’s my hope.