First Data Science Project — The Tools

Htoo Latt
5 min readSep 19, 2020

In this blog post, I will be talking about the first data science project I have done. I will be discussing and explaining some of the libraries that I used, the functions, methods, and visualization tools that I found the most useful. Other than that I will also be explaining the reasoning behind some of the decisions I made.

The Project and the Data

The purpose of the project is to analyze several film industry datasets so that a brand new film production company can make a decision on what sort of movie they should produce.

For this project, I was given several datasets from IMDB, RottenTomatoes, and Box Office Mojo. I chose only to work with the given data and not do any webscrapping or APIs calls. After looking through and investigating all the given datasets it became apparent that there are some limitations to the analysis I could do.

The Datasets Problems and Limitations

  • The IMDB datasets (the main database that I will be using) only goes back to 2010
  • Across the data from different sources, there are no unique identifiers, so movie titles had to used to join the different tables, therefore the movies with duplicates titles had to be removed
  • The datasets from RottenTomatoes do not even include movie titles so they cannot be used in tandem with data from other sources.

The Tools

For this project, I utilize the pandas' library to manipulate the data, numpy library to do some calculations, and the seaborn library to visualize the data.

In the initial data exploration the methods that I found the most useful were isna().sum() and .drop()

isna().sum() allowed me to find out how many entries are missing from the columns that I needed to use and when used in tandem with .shape(), I can find out if it is feasible to drop all the rows with the missing entries.

imdbjoinratings.drop(index= imdbjoinratings[imdbjoinratings[‘averagerating’].isna()].index, inplace=True)

This allow me to not have to specify an axis and use a conditional to drop rows in one line of code.

Furthermore, I changed the currency value given in a string format including dollar signs and commas into a float value by using .apply (), lambda, .strip(), .replace() and finally the .astype(float) methods and functions.

budgetdf[‘worldwide_gross’] = budgetdf[‘worldwide_gross’].apply(lambda x:x.strip(‘$’)).apply(lambda x:x.replace(‘,’,’’)).astype(float)

.apply() is a pandas method that is used on a single dataframe column or series, which applies the given function on every entry in the series. In the example above it is used in conjunction with a lambda function which is an unnamed function that can be used on the fly.

In the above example, I used .apply() and a lambda function twice. The first time to remove the dollar signs from the currency sting using the .strip() method, and the second time to remove the commas with the .replace() method. The reason that I did not use the .strip() method for both times is because .strip() doesn’t work on commas.

Finally, I use .astype(float) to change the string with nothing but numbers into a float which can be manipulated using mathematical processes.

Scatter Matrix, lmplot, and the correlation coefficient

pd.plotting.scatter_matrix(imdbjoinratings[[‘averagerating’,’production_budget’,’worldwide_gross’,’runtime_minutes’]], alpha=0.2, figsize=(10,10));

For the initial data exploration, I utilize the pandas’ scatter matrix to visualize the relationships between different columns. I found this to be an extremely useful tool that allows me to see which connections deserve to be looked at more closely. From the graph above we can see that there might be a linear relationship between budget and gross.

In order to further explore these relations, I make use of numpy.corrcoef method and matplotlib’s lmplot.

lmplot is a visualization tool from matplotlib that adds a best fit straight line to a scatterplot of the data. This allows the audience to easily see if there is a strong linear relationship between the two factors. To find out how strong the relationship is, I also calculated the correlation coefficient.

The numpy corrcoef method allows me to easily find the Pearson’s correlation coefficient without having to go through the long calculations of finding the covariances.

np.corrcoef(budgetdf[‘production_budget’], budgetdf[‘worldwide_gross’])[1,0]0.7483059765694755

In the graph shown above, I showed two different graphs on the same figure so that the two graphs can be easily compared. This was done by using matplotlib plotly function

fig, ax1 = plt.subplots(1, 2, figsize=(22,10))

Groupby

Out of all the tools that I used for this project, the one that I believe to have been the most useful is the pandas function .groupby. This function is very similar to SQL groupby in that we can call on certain information based on the groups. The official pandas' document describes groupby as a process combining splitting data based on given criteria, applying a function to each split group separately, and finally giving out the combined data structure.

In my project, I wanted to find out the average gross group by genres. In order to do this, I would need to write the code below twenty-two different times for all the genres

filt1 = df[‘genres’] == ‘Animation’animationGrp = df.loc[filt1]

Instead, I use the code below which gave me a new data frame containing the average of all the columns grouped by genres.

genres_mean = genres_gross_df.groupby(‘genres’).mean().reset_index()

An alternative to using pandas groupby() method would have been to import a library that allows me to use SQL style operations on a pandas dataframe such as SQLite3.

Bargraphs and Boxplot

After grouping the data based on various criteria categories including genres, actors, writers, and directors, the next step was to graph the information in a way to easily convey it to the audience. I chose to use bar graphs to show the median and mean because can be used to easily compare between groups.

I also thought about using a pie chart but after graphing it I saw that comparing between groups was almost impossible without labels showing the percentages.

For plotting a smaller data set of individual entries I thought the best choice was either a scatter plot or a box plot. I decided that the box plot was more appropriate because it can be used easily to compare the average between different categories and also pinpoint the outliers at the same time.

Whereas with a scatterplot, the audience would have to search for each individual marks in order to compare, and the average can not be determined at all.

--

--