First post… shit… bit under pressure, gotta make this interesting and cool, so why not open the blog with most uninteresting part of analytics process – data collection and preparation. It’s a boring and painful ordeal most of the time, far removed from the spotlight but basically foundation for everything else, because – no data, no party. However, this time it was elegant, it was cool and even easier than coding web crawler in Python… but before we delve into technicalities, let’s introduce our subject of interest.
This post is the first episode in the “Movie Marathon” trilogy (?), i.e. analysis of the movie industry. However, my obsession with movies started some 25 years ago when I was a kid. Those were dark ages before internet, before streaming services, Youtube and HD, just good old fashioned VCRs and video cassettes. In Croatia it was also time of war and lot of blood on the TV, so there were rules: movies were ok, standard channels (we had only 3 of them, lol) and news programme were not permitted. My father’s accounting company (looks like love for number crunching is hereditary trait) did accounting stuff for the local theater & video store so basically I had endless supply of movies… well, not really endless, at some point I’ve seen everything there is to see but I guess I had short memory and no problem of re-watching the classics… and when I say classics I mean classics: Rambo, American Ninja (deeply underrated! 😀 ), Delta Force, Kickboxer etc. It was all about school, football and movies – not necessarily in that order though.
So when I decided to start a blog and write about my free-time analytics “projects”, logical question emerged – what to pick for the first post? Well, why not scrape whole IMDB database and try to do some movie analytics… and statistically prove that American Ninja and Michael Dudikoff were Oscar-worthy? Well, why not!
But how to parse whole IMDB and most importantly, what will be my weapon of choice? This time it wasn’t one of the usual suspects (python/Java frameworks, freeware tools or browser add-ons), but I’ve decided to check all this robot hype going on lately and opted for UIPath (this is not product placement, not a paid one unfortunately). UiPath is a free Robotic Process Automation (RPA) tool for automating web, desktop or virtual environment applications. It’s built on .NET framework so all of you who worked in Microsoft Visual Studio will have familiar feel. Also, you’ll be able to use all the methods and functions that come with .NET (with IntelliSense code completion), invoke and execute code etc… but most of the time automation modelling is done by building workflows through user-friendly drag-and-drop of activities, without scripting or coding, much like some BPM tools. People that have no IT/programming background can also achieve (some) results really quick with built-in recorders/wizards that read and replay user actions, but I definitely recommend training that UIPath provides for better understanding of the tool. Training and certification are free, as well as community version of the tool, so this is a huge plus comparing them with competition. (Initially wanted to check out BluePrism but didn’t get far.) There is also large community for support so considering all this, starting the RPA journey with UIPath is really made easy. Anyway, there is lot you can do with UIPath (check it out), but this time I’ve decided to test its web scraping capabilities.
Plan of the attack was following, go to the IMDB Advanced Title Search, search for English-language feature film released in 2017 and scrape it all; all 6.114 titles across all pages and all useful elements/info that can be scraped: title, year, content rating, length, genre, IMDB score, Metascore, number of votes, director, actors, financial data etc. I’ll not go into too much details, but with data scrapping functionalities UIPath Studio provides this thing was set and ready to go in minutes… ok, bit trial and error while data was structured to my liking but it was almost effortless, even pagination problem was elegantly solved. Basically, all it took to get first results were 3 activities – think of them as building blocks of automation workflows – and after little bit of customization Mr. Robot v1.0 was ready to go.
Minute later after pushing play, shazam! excel file with first 300 results was there. Test passed. Well done, Mr. Robot v1.0, but now it’s time to do proper heavy lifting – let’s iterate from 2017 to 1993 and scrape the hell out of this webpage. This required few more tweaks – had to insert loop for “time travel” and do little bit of string parsing/manipulation for URL and selectors (they store attributes of GUI elements in the shape of an XML fragment) so Mr. Robot v2.0 wouldn’t get confused when he visits new page. Anyway, very soon Mr. Robot v2.0 was ready and raring to go, so it was time to let him loose and go to bed.
I had a weird dream that machines took over the world but when I woke up in the morning everything was fine and Mr. Robot v2.0 did a splendid job. 25 excel files, 25 years of movie data, in total 70 thousand movie titles and related data – f***** movie bonanza. Now, let’s do bit more tweaking so robot consolidates all these excels and that’s it. Writing this with some delay and looking back, I could have probably done whole thing even more elegantly, but I have a thing for quick, dirty and (barely) functioning code. In the end it wasn’t that ugly, I split my workflow in smaller processes and did invocations from Main workflow… so it was all by the book and good old OOP principles.
But what matters most after this whole exercise is that we acquired the data with minimal amount of effort and nerves spent… and be sure, this is anomaly, because according to rough estimates, data scientists spend somewhere between 50 and 80 percent of their time collecting and preparing data, before any useful nugget of insight can be extracted. True, this data set is still not fully prepared and there will probably be some more data collecting (that’s why I decided to extract URL of each particular movie) to augment the existing data set but all in all, this was really painless. When people talk about analytics project very often they skip data collection/preparation steps; I know, it’s mundane and people would rather talk about their fancy models and big insights (me too!), but having quality data is as important as having sound models, so I’ve decided not to fake it and start from the beginning, even if it might not be so sexy. Another reason why I wanted to talk about data collection in this case has to do with the fact that I’ve tested UIPath as part of this exercise and it held its part of the bargain so hopefully some of you might find this post useful from technology perspective. Anyway, now it’s time to analyze data so stay tuned, Movie Marathon Episodes 2 & 3 are in the making (this is a lie, didn’t really think this far ahead 😀 ) . That’s all folks, hope you enjoyed it!