Art&Linguistics: Word Flowers
Application of Informatics
What does classical literature like Shakespeare's Hamlet have to do with Informatics? In the last few years, partially due to the efforts of companies like Google, text, documents, and books (in fact almost anything paper-based) have been digitalised. Publishers are starting to sell electronic books, and most academic publications are available as digital documents. And there has been lots of effort in digitizing literary works (See Project Gutenberg). This not only provides us with better access to historical and notable literature, but also gives us opportunity to further analyze the content of those works by computational means.
A research area called computational liguistics has been growing rapidly in the recent years.
This discipline attempts to make sense of text by analyzing its content statistically or quantitatively.
Informatics in Action
The following example displays a simple statistical analysis of several classical texts from Project Gutenberg:
- Persuasion, by Jane Austen
- Sense and Sensibility, by Jane Austen
- Emma, by Jane Austen
- Bible King James Version
- Poems of William Blake
- Songs of Innocence and Experience, by William Blake
- The Man Who Was Thursday, by G. K. Chesterton
- The Ball and The Cross, by G. K. Chesterton
- The Wisdom of Father Brown, by G. K. Chesterton
- Paradise Lost, by John Milton
- The Tragedy of Julius Caesar, by William Shakespeare
- Macbeth, by William Shakespeare
- Hamlet, by William Shakespeare
- Leaves of Grass, by Walt Whitman
Using a Python program, we first analyse the occurances of the word "love" and its derivatives ( "loves", "love", "loved", "loveth", "lov", "loue", "loues" ). Next, we do the same with the word "kill" ( "kill", "killed", "kills", "murder", "murdered", "killeth", "killes" ). The results are displayed in the tables below.
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| View data in XML format. | View data in XML format. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Now, we present those results in a more engaging form (See the visualization below, drag the mouse to rotate the tree). The title of book with higher percentage of a certain word is displayed in a bigger font, and in darker red. What we demonstrate here is the proportion of a particular word count from the total word count. Thus, although bible has the highest absolute count of the word "kill", it is proportionally not as high as Chesterton's The Wisdom of Father Brown (displayed in bold and dark-red below)
Visualization of Word "Kill" (bigger and darker title has higher percentage)
Visualization of Word "Love" (bigger and darker title has higher percentage)
Resources
Data
Code
- View the Python source code used to generate the XML dataset
- Download the Python source code [Right click and Save As]
Other Resources
- The visualization above was made using v3ga's Tentaclez code, and processing initally developed by MIT Media Lab.
