In Part 1 of the language analytics series we analyzed Croatian dictionary words, i.e. building blocks of the language, but now we would like to know how all these words are actually used, which is what really matters in the end. In theory it is simple idea, but in practice things get complicated very quickly because there’s no way we can record everyday face-to-face interactions, phone calls, mail correspondence etc.… well, actually there is but I’m not working for NSA, so we’ll have to scratch that option, but there is a “proxy” for all this in a form of language corpuses, which are basically collections of complete texts or extracts from whole bunch of different sources. Most modern corpora are at least 1 million words in size, however, when it comes to sample size the more the merrier (well, most of the time), so we’ve decided to analyze language corpus consisting of more than 100 million words, contained within around 15.000 Croatian-language literary works, newspaper articles, magazines etc. The “sport” we are talking about is called “corpus linguistics” – study of language as expressed in aforementioned corpora, bodies of “real world” text. It is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language and explores how that language relates to other languages. In the first post we mentioned how Eskimos (Inuit) have more than 50 different words for snow. It is an extreme language occurrence that reflects extreme way of life of Inuit people. However, despite extreme examples like the previous one, there are also lot of similarities between world languages. For instance, did you know that with a few minor exceptions, there are really only two ways to say “tea” in the world? One way is aforementioned “tea” and the other one is “chay”, with both versions originating from China many centuries ago. How they spread around the world offers a clear picture of how globalization worked before “globalization” was a term anybody used. The words that sound like “chay” spread across land, along the Silk Road, while the “tea”-like version spread over water by Dutch traders bringing the novel leaves to Europe from coastal province of Fujian where the identical Chinese character for “tea” was pronounced differently.
However, it’s not only about the words we share or the anecdotal evidence like the above one, there are all sorts of interdependencies and similarities in the way we use our languages. Recently, by studying linguistic corpora of four phylogenetically widely spaced Indo-European languages: Greek (Hellenic), Russian (Slavic), Spanish (Romance) and English (Germanic language family), scientist found out that despite being separated by thousands of years of linguistic evolution, the average inter-correlation among the four languages in the frequency with which they used list of common words was 0.85. Like we said, languages define us, but despite all the perceived differences we are often more similar that we would like to admit.
Anyway, back to the topic… aim of this analysis is to provide basis for understanding and further use of Croatian language from NLP (Natural Language Processing) perspective. We briefly mentioned NLP in Part 1, but what is NLP really? Very broadly, NLP is a discipline that lies at the intersection of computational linguistics and artificial intelligence primarily focused on enabling computers to understand and process human language. You see, computers are exceptional at working with structured data like spreadsheets or database tables, they can execute mind-boggling calculations at crazy speed, but they suck when it comes to processing human language, something which comes very natural to us. As computers continue to become more and more accessible, the importance of user interfaces that are effective and user-friendly–regardless of user expertise–becomes more important. Since natural language usually provides for effortless and effective communication in human-human interaction, its potential in human-computer interaction is huge. On top of this, majority of information in the world is actually unstructured, mostly in the form of raw text (messages, mails, social media, documents etc.), so the big question is how can we get a computer to understand unstructured text and extract data from it? Well, researchers in the fields of corpus linguistics and NLP have developed an array of methods and techniques for studying both the linguistic form (meaningful unit of speech) and the content of texts.
Usually, the first step in the process is to break the text apart into separate sentences/words, so called sentence segmentation and word tokenization. After that usually some sort of stemming and lemmatization is performed. The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. While stemming algorithms work by cutting off the end or the beginning of the word while taking into account a list of common prefixes and suffixes (e.g. waited, waits, waiting –> wait), lemmatization takes into consideration the morphological analysis of the words. Lemmatization does not simply chop off inflections, but instead relies on detailed dictionaries which the algorithm can look through to obtain the correct base forms of words (e.g. , am, are, is –> be or better –> good). Next important step is to look at each token and try to guess its part of speech — whether token is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence help immensely when trying to figure out what the sentence is talking about, having in mind predicate=verb and subject=noun relation (obviously it’s more complicated than that). If you remember our previous post, you can see how dictionary we obtained can help us with lemmatization and POS tagging, and why we started this whole journey with dictionary analysis.
Different types of frequency analysis can be performed in the following steps (bag of words, n-grams, TF-IDF etc.). Performing these counts can help us filter out stop words, detect key (high-frequency) words and phrases, thus getting very basic idea about the content/topic of the text. There are also more powerful information extraction approaches like dependency parsing and named entity recognition, to name a few. With dependency parsing we are trying to figure out how all the words in the sentence relate to each other. This is done by building tree-like structure that assigns a single parent word to each word in the sentence with the main verb as the root. On the other hand, Named entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, or any other concept connected to particular subject domain (finance, law, technology etc.). Annotations like POS tagging and NER provide for semantic link that connects the mentioned word (concept) to knowledge that lies behind that word (e.g. Zagreb –> capital of Croatia). Aforementioned are some of the typical steps within NLP pipeline, but there are many more approaches, techniques and algorithms used depending on the particular problem and field of study. I hope this short introduction gives you at least high-level picture of what things are being done and why we are doing these exercises.
However, NLP exercises we plan to do, even in its simpler form, are totally different beasts compared to dictionary analysis… while dictionary had a couple of hundred thousand entries, our corpora had more than 100 million words. 100 million words to tokenize, stem/lemmatize and POS tag (what we’ll cover in this post). Since NLTK/Open NLP/[insert here] does not support Croatian, I’ve decided to do it from scratch and well, 100 million tokens kinda exposed my weakness of going with quick, dirty and barely optimized code. However, thanks to modern computing power and genius of Gaius Julius I’ve managed to overcome the problem relatively unscathed.
In this analysis we’ll “repeat” many exercises from Part I (and do some new ones) to see how building blocks of language are really used in everyday life, so let’s begin. Now that we have tokenized and lemmatized our corpus, question I always wondered about (but couldn’t find the answer), which are the most common words in Croatian language? Depending whether we are counting before or after lemmatization, the top 10 is as follows:
So there you go folks, depending how you look at it, conjunction “i” (=“and” in English) and verb “biti” (=”be”) are most common words in Croatian language. For comparison, word “and” is ranked as #5 in OEC (Oxford English Corpus) and #3 in COCA (Corpus of Contemporary American English), while “be” is #2 and #2 respectively. If you are wondering, article “the” is #1, but in Croatian we don’t have anything similar.
As we can see, there are no nouns and just one verb in Top 10, so you might be wondering what most common nouns and verbs are. Well, Top 5 in Noun and Verb category (with their overall rank in the first column) are listed in the following tables. Selection/ranking of verbs makes sense (be, can, say, want, have), and nouns as well (Croatia, year, Zagreb, day, percentage), once we remember we have newspaper articles in our corpus which I would say strongly influenced noun rankings.
Remember from Part I how part-of-speech “rest” category (function words like pronouns, prepositions, conjunctions, etc.) accounted for only 0.2% of the dictionary? Well, as we can see from top 10 all these “glue” words are dominating when it comes to their usage frequency (apart from verb “be” ). Overall POS distribution in the corpus is as follows:
This makes sense if we consider standard subject (noun) + predicate (verb) + object (noun) form of the sentence, along with adjective and “glue” (function) words. When it comes to most common words we saw domination of shorter words, which is also evident if we take a look at Top 5 in 1-2-3-4-5 letter word categories, along with their overall ranks:
If you remember Part I, we’ve analyzed dictionary and concluded that mean length of the words is whopping 9.22. Well, it turns out that in practice we prefer shorter words, so mean word length in use (weighted by their relative frequency – usage probability) is much shorter and stands around 5.16.
Basically, most functional words (pronouns, conjunctions, prepositions etc.) are short; they consist only of 2-3 letters and obviously all those high frequency functional/ “glue” words played big part in lowering the mean. In contrast, almost all notional/content words (nouns, verbs, adjectives) are long (longer than average word length). That`s why they have opposite influence: increase in frequency of these word lengthens average word length and vice versa. As we previously saw, content words completely dominate dictionary (~99%), but functional words strike back in practice. If we plot corpus word length histogram we see huge difference in comparison to equivalent dictionary histogram (refer to Part I):
There’s one other very interesting thing related to word frequency we should mention. Did you know that in every language, the most frequent word occurs twice as often as the second most frequent word, three times as often as the third most frequent word etc.? This phenomenon called Zipf’s law (named after American linguist) is more than century old, however scientist are still unsure about its the underlying “mechanics”. Zipf’s law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.: it’s a form of power-law distribution with inverse relation between rank and frequency. The rank-size rule, describes the remarkable regularity in many phenomena, including the distribution of city sizes, the sizes of businesses, the sizes of particles (such as sand), the lengths of rivers, the frequencies of word usage, and wealth among individuals. All are real-world observations that follow power laws, such as Zipf’s law, the Benford’s law, or the Pareto distribution. In “About” section of the blog I’ve talked briefly about mental models, well, power law is extremely useful concept to grasp. We are so used to Gaussian view of the world that often we “model” reality (in our minds) with wrong distribution (i.e. normal, Gaussian-like), when in fact very often we should be thinking in terms of power laws. In a power law-modeled distributions, extreme events are not treated as outliers. In fact, they determine the shape of the curve. So remember, world is not normal. No pun intended. 🙂
Anyway, back to the subject… in the Brown Corpus of American English text, word “the” is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences, while second-place word “of” accounts for slightly over 3.5% of words. What about Croatian language? Most common word “biti” accounts for 6.9%, while second most frequent word “I” accounts for 3.4%. Coincidence?
If you google images for term “Zipfs law” you’ll get following graph as one of the first results. The line was predicted by Zipf’s law, and the dots depict the actual word frequencies in the text.
So let’s plot our frequency/rank graph to compare.
Pretty remarkable. On both graphs there’s most common word, followed by closely packed duo in second and third place (at about half of top word frequency), and then 3-word-cluster at third of the frequency. Rest of the graphs also looks remarkably the same, with same small humps at the same places. Crazy shit, huh?