Jonas' personal website

13 June 2015

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 3)

Part 3 - Usecases

You have read part 1 and part 2 of this series and wonder whether you will be able to actually apply all those new and complicated commands for anything practical? Wonder no more: In the third and last post I compiled a list of real world use cases to nudge your inspiration.

Automation of continuous reports

Problem: Revenue report for team needs to be generated and emailed every week.

Solution: Use mysql, mail and a cron job

$ vim revenue_cron.sh # create bash script that cron job can call
echo "select * from RevenueStats_TeamFoo;" | mysql some_database -B > ~/$(date +%y%m%d_rev_report.tsv) #pull data from mysql and call the tsv current_date_rev_report
mv ~/$(date +%y%m%d_rev_report.tsv) '~/jvdh/Google Drive/rev_report/' #move report into Google Drive folder where it will be uploaded automatically
echo "Hi Team,
The latest revenue reporting tsv is ready for download at
Google Drive.
Greetings,
jvdh" | mail -s "marketview report is ready" team@foobar.com" #exit vim
$ crontab -e
0 3 * * 4 ~/code/revenue_cron.sh #every Thursday at 3am, assuming you saved the script in folder ~/code/revenue_cron.sh

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 2)

Part 2 - Useful tools

While you learned how to survive in a command line environment in the last post, the second part of this series will get you up and running with the tools you need for data analysis. After reading this post you will not only be able to view & manipulate big data in the command line, but also know how to automate repetitive task such as generating reports or sending notification emails!

LESS

Less was developed in the 1980's as an advancement of more, an earlier file viewer. Let's deal with the obvious right in the beginning:

less > more
less is more, more or less
cat is a simple file reader, no more no less

$ wget https://www.gutenberg.org/files/11/11-h/11-h.htm ; html2text 11-h.htm > alice.txt ; less alice.txt #download alice in wonderland, convert from html to text and display content

/Alice -> searches for Alice in viewed file

n -> searches forward for next occurrence of Alice

N -> searches backward for previous occurrence of Alice

A BEGINNER'S GUIDE TO DATA ANALYSIS WITH UNIX UTILITIES (PART 1)

Part 1 - Unix Basics

Imagine your goal is to view or manipulate large quantities of data. Google Spreadsheets are crashing, so is Excel - what do you do? A demon from an ancient world might offer itself as your sword to slash the Gordian knot of big data: Unix utilities. They are versatile, much faster and have better performance than Excel. The best thing? They are not only free, but (if you are running Linux or OS X) already installed on your computer.

However, the price you pay is the time spent learning how to use these initially unintuitive and badly documented tools. The first part of this blog post series wants to help you take the initial step into a journey through the Unix utilities (which you may never want to end). Afterwards, we are going to explore powerful data manipulation tools in part 2 and a few use cases in part 3. If you already know your way around the command line and you only want to check out the data science applications, I suggest you skip to part 2 or part 3 right away.

Let's start with a quick history:

Unix was developed by Ken Thompson in 1972 because he wanted an operating system that allowed him to play "Space Travel" on his cast-off PDP - 7. It has since been advanced and now constitutes the foundation of most operating systems, the prominent exception being Windows. Both Mac OS and Linux are based on the Unix philosophy and include many Unix utilities.

Now, lets move on to something practical: