Language. Most probably the greatest invention in the history of the mankind, that fire that Prometheus stole from the gods which set us free. Forget the internet, forget the AI, forget… whatever… it was the language that increased human race learning curve like nothing else in history before or after it’s invention. It is a tool that permits us to store and transfer acquired knowledge, to create marvelous works of art or write boring analytics blog which no one reads, among many other things. On this blog we talk a lot about models, and in its essence, language was one of our first models of the world, or better say our perception of the world. It became intermediary between our mind and reality, a symbolic model that creates the picture of reality we perceive, helping us make better sense of it and communicate it more effectively. Some 100-200 thousand years ago, humans advanced to the point where they could understand that sound “bone” was not itself a bone, but it could be used as a symbol, a representation, of a bone. At that point language was invented, first as spoken words and only 5 thousand years ago in its written form.
When you think about it, language is very hierarchical in its structure, which means that elements at one level are combined to construct elements at the next level up. To put it simply, several sounds/letters make a word, several words make a phrase, several phrases make a sentence and, ultimately, sentences convey ideas. It is possible to study and compare language(s) from analytics perspective on multiple levels, and this is exactly what is happening in recent years with whole field of computational linguistics gaining more and more prominence. In the current era of social media huge effort is dedicated to analysis of language, with large IT companies like Google, Facebook, IBM & Microsoft (among many others) investing zillions into the creation of more refined NLP technology.
In this post we’ll start bottom-up, so before we delve into NLP let’s first start with the statistical analysis of the Croatian language which will come very handy afterwards. To non-Croatian speaking readers this might be boring due to lack of knowledge of the subject, but hey, give it a go, you might learn thing or two and who knows, when you come to vacation here it might come in handy. This type of analysis has never been done before for Croatian language (at least to my knowledge) and as you will see, it will open many more questions, so I hope there are others there who might see use of this approach and who knows, potential for joint projects. And I don’t mean only linguists, but cryptographers, NLP practitioners and basically whoever finds this interesting and can contribute.
You might be thinking, but what’s there to be analyzed? Because language comes so natural to us, we are using it without thinking much about its underlying “mechanics” but believe me there’s so much there. To be honest, I didn’t see it either until I started this exercise, but that’s another great thing about data science, it forces you to learn a lot about different domains and as a generalist it’s something I really love doing. Anyway, before we delve deeper, let’s first do a round of trivia… did you know that Eskimos (Inuit) have more than 50 different words for snow? that Maori language has only 15 phonemes (sounds)? that native Chinese speakers don’t recognize “r” and “l” as two completely different sounds? or that most Aboriginal Nations use cardinal-direction terms – north, south, east and west (instead of left, right, back and forward) – when referring to the location so it’s perfectly normal if you hear “there is a spider on your northeast leg” but good luck figuring out which leg to shake. As you can see languages define us, it is a tool that we shape and that also shapes us, it shapes the way we communicate but also the way we think. Every language is like a Lego set, and even though some pieces are shared every language uses those pieces in a different way. In any language, each letter or word has its own “personality”. The most obvious trait that letters or words can have is the frequency with which they appear in a language, but they are also many other characteristics of a given language, such us relative frequency of bigrams, trigrams, the first and last letters of a word, ratio of vowels to consonants, the average length of words, frequencies of “small” words and lot more. The behavior of the letters and words reflects the way a people use its own language and characterize that language in unique way.
Before we proceed there’s important distinction to be made. Every language has its own vocabulary, i.e. unique pieces of the language Lego set, however in everyday usage some pieces/words are more popular and used more often so there are two different datasets for the analysis (language vocabulary vs. language corpus) or we can say two different analysis altogether. Today we will start from the very beginning and analyze building blocks of the Croatian language, i.e. the dictionary (vocabulary). Finding data set for the analysis was much harder than I thought; most of the dictionaries are not free, once which “are” can only be queried online so after an hour of googling to find free, digitized version of the Croatian dictionary I was pretty much stuck. However, one of the google hits took me to the page of university math professor Goran Igaly who built comprehensive Croatian (and Croatian-English) dictionary in his free time. So, I take this opportunity to thank prof. Igaly who compiled this dictionary and generously made it available for free.
After long introduction, it’s time to get nerdy (lol). Our data set is Croatian dictionary with more than 600.000 entries in the form of tab delimited .txt file. As you will see, Croatian language is very complicated due to lot of inflection. By inflection I don’t only mean conjugation, but inflection of the nastiest kind – declension, something native English speakers will probably have hard time comprehending. As you can see from the picture, in blue we have root word and in yellow different variants. For the purpose of this analysis we’ll analyze only root words because results of the full dictionary analysis would otherwise be heavily skewed. (Also, in second part of this post we’ll analyze Croatian in real-word use so all the word variations and their actual usage will be captured in that analysis.)
As you may have noticed, I’m not a zealot of any programming language but prefer to choose language based on the task at hand and in this case as there will be lot of parsing and data wrangling so Python appears to be perfect choice. (Don’t get me wrong, it’s not just data wrangling, Python is great for many, many different applications, and data science in general). Anyway, few lines of code and picture above becomes picture below where have original dictionary row/record, root word and respective word class. After some data enhancement and filtering (removing variants and leaving only root words), we are left with about 150.000 records in the following format which is good enough to start asking questions, so let’s start brainstorming and getting creative.
Every word belongs to one of the word classes (or parts of speech), so let’s check what’s their distribution in the Croatian vocabulary. As we can see, nouns and adjectives are core building blocks in our language Lego set, taking 82% of the whole vocabulary, with verbs and adverbs making for additional 17% and rest (pronouns, prepositions, etc.) with 0.2%.
When it comes to word length within word classes it appears Croatian has very big Lego blocks. Adjectives are longest with almost 10 letters in average, with adverbs, verbs and nouns ranging from 9.7 to 8.7 letters on average.
If we plot word length histogram we get following distribution, which can be reasonably well fitted with shifted Poisson distribution with mean of 9.22. If we make comparison with English dictionary which has mean word length of 6.94, average word in Croatian vocabulary is thus whopping 33% longer… which I would say makes pretty complicated language even harder. However, we will see in the Part 2 of this analysis that mean word length in use (weighted by relative frequency of use) is much shorter, i.e. we prefer shorter words in everyday use… but enough with the spoilers.
What we can do as part of vocabulary/dictionary analysis is go down a level and see what’s happening on the “microscopic” level of letters. Great Samuel Morse did similar analysis some 200 years ago, but instead of dictionary he counted the number of letters in sets of printers’ type. Why? He needed to know letter frequency so he could give the simplest codes to the most frequently used letters in order to speed up the communication over the telegraph. So can you figure out which letters were most frequent in Morse’s “data set” based on the figure above?
We will do even thorough analysis letter-wise on the Croatian language corpus in Part 2, but let’s first check Croatian dictionary so word game lovers better pay attention. Without further ado, in alphabetical order…
…or in descending order of frequency:
As we can see, “i” and “a” by far ahead, followed by two other vowel brothers in “o” and “e” and then only “n”, “r” and “t”. You are probably wondering what the hell are “#”, “$” and “&”? From the first picture Croatian speakers will easily deduce – & is ”dž”, $ is ”lj” and # is ”nj”. Yeah, we have bigrams as individual letters so I had to replace them with single characters in order to optimize and speed up the code, but now I’m too lazy to change xlabel ticks… and I’m also protesting this two-letters-is-one-letter craziness. (I told you Croatian is complicated.) For the sake of comparison, analysis of entries in the Concise Oxford dictionary, ignoring frequency of word use, gives an order of “EARIOTNSLCUDPMHGBFYWKVXZJQ”. We can see similarities; 4 vowels in top 5 letters in both languages, in Croatian 3 most frequent consonants are “n”(5.),”r”(6.),”t”(7.) while in English it’s “r”(3.), “t”(6.) and “n”(7.).
On the picture above, we can see that vowels are most popular, which makes sense as they are pretty common and there are only 5 of them in Croatian language. But have you ever wondered what’s the ratio of consonants to vowels? Well, now you’ll finally be able to sleep in peace – it’s 1.38 consonants to 1 vowel. I suspect that more “melodic” languages, e.g. Italian, have lower consonant-to-vowel ratios than coarser-to-ear ones, let’s say German. I would likewise hypothesize that Dalmatian vocabulary has lower consonant-to-vowel ratio than let’s say vocabulary of Hrvatsko Zagorje, so if someone has access to dictionaries of different dialects, let’s test this hypothesis. It also make sense that letter “i” is #1 considering most verbs in infinitive and adverbs in root form are ending up with “i”. Let’s check if this is true.
As presumed, “i” is by far most popular dictionary word-ending letter, with more than 35% of root words ending with it. When it comes to first letters it’s consonants that are leading the pack, with “p” taking the lead.
However, first and last letter distribution is only part of the letter “behavior”, but what about letter distribution within the word, from beginning to end? If you can’t visualize what I mean, I’m talking about something like following picture. We mentioned letter “p”, how it’s very often in the beginning of the word and in this picture we can see that and much more. Letter “p” usually finds itself in the beginning of the word and its occurrences rapidly fall of, with almost no presence in the end of the words.
Initially this looked like fun idea to check, but in practice I almost regretted starting this exercise. You know type of programming problem that forces you to take pen and paper and sketch things? Well, this one was like that. It looks almost trivial but in practice it was pain to solution. I’ve decided to divide words in 10 parts/buckets (0%-10%-20% etc.) and count occurrences of letters for each bucket. Well, what about words that have 4 letters? or words that have 17 letters? Actually, any word that has more or less than 10 letters for that matter. Basically, you end up with something like this.
Word “što” (meaning “what”) has three letters, basically beginning, middle and end. However, in our 10-bucket word, “š” contributes with 1 in first, second and third bucket, and contributes with 0.33 in fourth bucket. “T” contributes with 0.66 in fourth bucket, with 1 in fifth and sixth bucket and again with 0.66 in seventh bucket. At this point you probably get the gist. There are words that are longer than 10 so contributions to single bucket is less than 1. Probably even after this explanation, this exercise looks straightforward, but I dare you to give it a go. Also, if you have a better idea how to calculate letter distributions give me a call.
Now, back to the topic. Following graphic represents letter distribution in Croatian dictionary words. Both columns basically communicate same data but in slightly different way. I wasn’t sure which representation to use so I left both. On the left there’s more intuitive continuous area plot where color indicates letter frequency, i.e. brighter the red more frequent the letter. In the right column we have something like histogram, showing letter “counts” for all 10 buckets. After getting the results, I started going through the pictures and I was like “damn, there are more rules and structure and organization in all this than I previously thought”. I had many “aha” moments; I already mentioned “i” as favorite word ending, but look how “t” precedes it and has a peak right before the end (again verbs in infinitive with ending “-iti”). Look how “č” peaks before “ć” right near the end in the words with “-čući”, “-čeći” etc, similar as “š” appears in the words with “-vši” endings. We can see that vowels are more evenly distributed, while consonants range widely in their “behavior” – there are consonants which prefer beginning of the word like “f”, “g”, “h” etc. and then there are consonants which fill middle or near-end of the words, like “l”, “r”, “s” and “c”, “n”, “v”, “t” etc. Letter “n” is often find in word endings due to typical adjective ending of “-an” (e.g. “bezidejan”). Letter “ć” is very hard to find in the beginning of the word but its very often in the word endings, while on the other hand “p” is totally opposite in its behavior. In general, words usually end with vowels, but interestingly not so much with “u”, but they typically start with consonants and almost all the vowels rise in frequency between buckets 1 and 2 as they usually complement starting consonant. Interesting how in this regard “r” behaves as a vowel (in Croatian grammar “r” is sometimes really referred as 6th vowel as it takes a role of the vowel in words missing ones). These are just a few interesting things I’ve noticed, but to analyze everything it would take more pages in already very long post, so I leave it to you… and I leave it to folks who know letters and words much better than me.
Phew, it’s really time to close this post. I hope you learned something new, but don’t forget, there’s much more to come in Part 2. Part 1 was just(?) analysis of the Croatian dictionary, but what happens in practice, in everyday usage? Well, that’s whole other story so pay us a visit in few weeks’ time. Analysis of whole language corpus is totally different beast, it’s much more complicated endeavor with significantly larger data set, but I’m pretty sure it will give us more interesting results and tell us how these building blocks of language are actually used in everyday life, which is what really matters in the end.