by Rahul Kumar

Beginning this year, I set a goal to learn more about Linux and how I can become better at using the terminal. I then stumbled on this blog piece from Netflix on analyzing Linux performance and decided to try something similar.

What is my goal?

Given some data, what are quick stats that I can put together, using only the terminal, without the help of SQL or the Python Pandas library? I decided to get hold of some Washington DC Metro Area Capital Bikeshare data and find out what insights we could extract using terminal commands.

Here’s what I’m trying to answer:

  1. What is the busiest day, month, and bike share docking stations
  2. What is the longest duration of rides
  3. What is the usage pattern amongst members and casual riders.

Data Preparation

To begin, I downloaded all the data from Capital Bikeshare’s system data repository in ZIP format. I unzipped them and put them in a single folder where I will run all the terminal commands. Before we can do any analysis, we have some preparation to do with the data.

Our starting data is a list of CSV files with one for each month.

ls -ltr

Using the head command, we can see what the header of each CSV looks like.

head -l filename.csv

For preparation, we will do the following:

  1. Concat all 12 csv files into one big file (with just one header row)
  2. Remove the quotation marks (“) from the values (we don’t really need them)

Concatenating the Data

Running two awk commands and piping the output into stream editor (sed) does our cleaning work creating one file which we will use for our analysis. Supplying FNR > 1 to the first awk command ensures that headers are omitted from all the CSV files when they are appended. We then supply a print command to get them back in the output file.

awk 'FNR > 1' *.csv > consolidated.csv && awk -F, 'BEGIN { print  "Duration,Start date,End date,Start station number,Start station,End station number,End station,Bike number,Member type" } { print $0 }' OFS=, consolidated.csv |  sed 's/"//g' > clean.csv

Let’s ensure our new clean.csv has the header that we want and it does!  If you want to ensure it got all the rows you can run wc -l and sum the rows of each file to check that it matches the wc -l of the final file

head -2 clean.csv

Working with the Data

Now that our data is prepped, let’s begin answering some questions.

Who rode more (Hours)? Member Vs Casual

There are two kinds of riders – Members and Casual. We will use our clean csv file and the awk program to run the aggregation on total hours driven by each type of memberships.

awk -F, '$9 == "Casual" { duration = duration + $1; } END {print duration/3600,"hrs";}' clean.csv

awk -F, '$9 == "Member" { duration = duration + $1; } END {print duration/3600,"hrs";}' clean.csv

Members seem to have driven more number of hours compared to casual members.

Which month had the most number of rides?

Since each row in each csv is an event, we can run the wc program on all of the csv files and sort the row count to get to the winner.

wc -l *.csv | sort -r -n | awk 'NR > 2 { print }'

July and August are clear winners while Jan and Dec are the trailing months. Warmer weather clearly seems to have a role there. Ridership falls as we progress towards the winter months.

What was the longest duration for which a bike remained in use?

Since we are looking to answer this question regardless of month or membership type, we can use the clean csv file and sort on the duration column to get to the top row. We will use the echo statement together with command substitution to print a statement. The back tick operator in the terminal substitutes the result into the echo statement which gives us a statement as a result on the terminal.

echo "The Longest duration for which a bike was in use is `sort -k1 -n -r clean.csv | head -1 | awk '{ print $1/3600,"hrs"}'`"

Looks like someone used the bike for an entire day. We can take a closer look at the head to find out where it was used.

sortk -k1 -n -r clean.csv | head -1

The bike was taken out at the Taft and E Gude Dr and docked back at the Rockville Metro West by a Member on Jan 8th 2018.

Which are the top 3 busiest stations from where bikes get taken out the most?

Running sort and uniq commands on the 5th column (Start Station Name) helps us get to the answer. When passing in the c flag, the uniq command gives a count of the number of rows of each type (Start Station Name in our case). Using the head command, we can get to the top 3 rows.

awk -F, '{ print $5 }' clean.csv | sort | uniq -c | sort -k1 -r -n | head -3

The national mall area clearly seems to be the most popular when it comes to starting a ride.

Which are the top 3 busiest stations from where bikes get returned to the most?

To get to answer, we can run the same sequence of commands that we ran in the question above but instead use the End Station Name column

awk -F, '{ print $7 }' clean.csv | sort | uniq -c | sort -k1 -r -n | head -3

Those very 3 stations which get the most bike taken out of  are also the ones that get most bikes docked back.

What are the top 10 busiest days (by number of rides) for 2018?

The column Start Date is a Date Time column, so we have to extract the date out of this and then find the date with most number of rows in our clean csv file. Using awk, we can print just the Start Date column, pipe that into sort, and finally use uniq to give us the result.

awk -F, 'NR > 1{ print $2 }' clean.csv | awk '{ print $1}' | sort | uniq -c | sort -k1 -r -n | head -10

Although July and August were the busiest months, April 14th was the busiest day with just over 19000 rides recorded in a single day. December 3rd, surprisingly, takes the 3rd spot in that list.

Where am I in this data?

I am a bike share member and while customers are anonymized in the bike share data, I was still curious to track the ride activity on the route I take to work in the morning. A string match for station start and end name columns using the awk command will show the ride history between those stations for 2018. I am somewhere in there.

awk -F, '$5 == "Henry Bacon Dr & Lincoln Memorial Circle NW" && $7 == "Market Square / King St & Royal St" { print $0 }' clean.csv

A new station (Gravelly Point) was added in December 2018 on that route so I will be curious to track its usage in 2019 when more data becomes available for that station.

Bike share station names are available on the system map to do more analysis between stations if required.

Last, all the commands used in the examples above can be accessed from the GitHub repo.

Final thoughts

The examples above illustrate some of the ways by which we can do some quick basic data analysis and generate useful insights simply by using UNIX programs. For further reading, there is an excellent book by Jerone Janssens (Data Science At The Command Line, O’REILLY) which covers some more terminal utilities and their application in data science and is highly recommended to gain more experience on the subject.