(For Beginners) Make Data Analysis simple and quick with just Python’s Pandas

Seungjun (Josh) Kim
5 min readJan 21, 2020

--

We spend so much time performing analysis, making complicated models and tuning parameters for neural networks. But often times, a lot of the questions we want to answer can be tackle with just simple queries in SQL / Pandas without using such complicated models. For simple insights, Python’s pandas package is enough. I used the golden globe awards data as an example data set to illustrate my points.

The data set is from Kaggle, the most popular data science community and competition platform for data analyst and scientists. I definitely recommend immersing yourself in this platform as there are so many useful resources and notebooks available. The data set is available here.

The Golden Globe Awards data set records all nominations and whether each of them won the awards or not from 1944 to 2020.

You first import all the necessary libraries and read in the data.

import numpy as np
import pandas as pd
df = pd.read_csv("/kaggle/input/golden-globe-awards/golden_globe_awards.csv")

Next, you look at the first 5 lines of the data just to see how it looks like.

df.head()

There are some things you need to check before you perform any data analysis — Those are data types and missing values.

df.dtypes

.dtypes function in pandas allows you to look at what the datatype of each of the features (columns) is.

df.isnull().any()
Yes or No on whether there are missing values in each column

.isnull().any() shows if there is any missing values in each of the columns. If there are any, it returns True for that column. You can fill in missing values in various ways but let’s just fill in the missing film titles with the word “Unknown”.

df.film.fillna('Unknown', inplace=True)

Now, into data analysis. There are some first level questions that we just want to know with some quick and instant analysis. Some of those questions are things like who is the actor /actress who won the most awards for the past 60 years? How about for a certain category (e.g. best performance in leading role)? We don’t need convoluted neural networks or Machine Learning for these questions. This is where pandas functions like “group by” operations come in really handy.

Group by is a function in pandas that allows you to calculate or perform some operations on “group” basis.

How many awards were given out each year?

win_num_by_year = df[df.win==True].groupby('year_award').win.count().to_frame()

You are grouping by “the year” and counting how many wins (i.e. number of awards given) there were for each year.

Who are the top 3 actors/actresses who won the most golden globes?

df[df.win==True].groupby('nominee').count().sort_values('win', ascending=False).head(3)

In this case, you are grouping by “nominee” names and counting how many of each of those nominees won the awards and finally displaying the top 3 of them after sorting the list in decreasing order.

Which categories have the highest probability of winning golden globes once you get nominated?

df.groupby('category').win.apply(lambda x: sum(x==True)*100/x.count()).to_frame().sort_values('win',ascending=False).head(20)

.apply function is really helpful if you want to “apply” certain formulas to a column or some columns. We used the “lambda function” which allows us to specify user customized functions as an input for the “apply” function. In this case, we wanted to calculate the percentage of nominees actually receiving the awards (i.e. win == True) grouped by “category”.

There are some categories where you win the award for sure once you get nominated (e.g. Hollywood citizens award, New Foreign Star Of The Year — Actor etc.). Categories such as Actor / Actress In A Leading Role, Picture and Cinematography have pretty high probability of winning once you get nominated (> 70%).

Which film earned the most awards?

df[df.win==True].groupby('film').win.count().to_frame().sort_values('win').tail(10)

Who won the Supporting Role in any Motion Picture awards the most?

Often times, the importance of supporting roles is overlooked.

df[df.category=='Best Performance by an Actress in a Supporting Role in any Motion Picture'].groupby('nominee').\
win.count().to_frame().sort_values('win').tail(10)
df[df.category=='Best Performance by an Actor in a Supporting Role in any Motion Picture'].groupby('nominee').\
win.count().to_frame().sort_values('win').tail(10)

Small tip in the code above — backslash “\” allows you to write some long code in multiple lines and still make the code work. If you don’t include “\” and just go to the next line, that long line of code will cause an error.

Is there any correlation between length of title of film and its probability of winning awards?

Just out of curiosity (but expecting correlation to be very weak)

# Getting word count of film title
df['film_word_count'] = df.film.str.split(" ").apply(lambda x: len(x))

# Replace True or False to 1 or 0
df.win.replace({True: 1, False: 0}, inplace=True)
df[['win','film_word_count']].corr().iloc[0,1]

Hashtag “#” within a code does not run but instead acts as a “comment”. Using the hashtag, you can write down explanations regarding the code or memos that you need to remember without having to worry about that line of code being run and causing errors.

df.column_name.replace allows you to replace certain values in the column to other specified values. The input has to be in dictionary format. In this case, we are replacing boolean values (True or False) to integers 1 or 0 (so that we can calculate proportions).

df.corr() is the function that calculates correlation between numerical variables. In this case, I selected only the columns “win” and “film word count” to calculate correlation between those two columns. The correlation, as we expected, is very weak (0.009864149948747625).

Full notebook can be found here.

More useful pandas tips and techniques can be found from this wonderful post from “python10pm”, a Kaggler.

Please consider clicking the clap button if you found this post useful! Thank you!

Seungjun (Josh) Kim

LinkedIn | Github | Website

--

--

Seungjun (Josh) Kim
Seungjun (Josh) Kim

Written by Seungjun (Josh) Kim

Data Scientist; PhD Student in Informatics; Artist (Singing, Percussion); Consider Supporting Me : ) https://joshnjuny.medium.com/membership

Responses (1)