Plotting Beautiful Bar Charts with Matplotlib
How to create beautiful visualizations using Python (Matplotlib), annotate your plots, make them more beautiful and add images.
Visualizing data is a crucial aspect of the data science process. Creating appealing visuals that effectively communicate the intended numbers to your audience is a valuable skill. In this tutorial, you'll learn how to create beautiful plots (with a focus on bar charts) using Python (Matplotlib), how to annotate your plots, make them more beautiful and how to add images to your visualizations with Matplotlib.
Prerequisites?
A good knowledge of Python, pandas and a basic knowledge of Matplotlib should be enough to understand what will be done in this tutorial.
So, let's dive in!
What plot are we making?
In this tutorial, we'll be creating a horizontal bar chart, that compares the goals per 90 minutes of the top goalscorers in the English Premier League (EPL) for the 2022/23 season.
Why this plot?
Bar charts are perfect for data comparison, making them an ideal choice for this task. The plot can be used to settle football arguments and answer questions like "Who was the best goalscorer in the 2022/23 EPL season?". However, keep in mind that other statistics should be considered for a more comprehensive analysis.
If the plot you are trying to make isn't a bar chart, then don't worry as ideas you will gain from this tutorial can be extended to other plots you will be making using Matplotlib.
Install the necessary libraries
In making the horizontal bar chart, you'll need 5 major packages;
pandas,matplotlib,pillow, highlight_text and urllib
Pandas: data handling and data cleaning.
Matplotlib: creating the visualizations.
Pillow (PIL; Python Imaging Library): for handling images.
Highlight_text: adding headings and subheadings to the plots.
Urllib: making requests to websites.
Ensure you install the libraries by running the below line in your terminal or in a code cell (if you are using Google Collab).
pip install pandas matplotlib pillow highlight_text
Where to get the data from?
One of the first problems to tackle when solving most problems in data science is where do you get your data from? Without data you can do almost nothing, there's a reason we have the "data" in "data science".
So because football stats are mostly open and are on almost every sports website, a simple Google search of "EPL top scorer standings 2022/23" should provide you with more than enough data sources. The BBC EPL top scorer standings page is one of the top results you'll get from the Google search and that's what you'll be using in this tutorial. This page has the required stats needed, player names and goals per 90.
Instead of manually writing the contents of the table into a CSV file (that would be stressful and time-consuming), you can read the table directly from the webpage using pandas' read_html()
function.
#import the pandas library
import pandas as pd
#read the table directly from the web
data=pd.read_html("https://www.bbc.com/sport/football/premier-league/top-scorers")[0]
The pd.read_html()
function returns a list of dataframes, and the webpage only had 1 table hence the reason for the pd.read_html(link)[0]
, which means to get the first table.
From the dataframe, you can see that the name column isn't in the right manner, the club names have been joined with the player names and you only need the player names for the plot.
One lesson you should learn in your data science journey is that a lot of times, data won't come in the way you want it, you have to do some cleaning and handling to make it suitable for your purpose!
How to resolve this issue?
Well looking at names in the name column, it is such that the first name (capitalized) came first followed by the last name (capitalized also) of the player then the last name is repeated once more, then the club name, for example Erling HaalandHaalandManchester CityMan City
. So an approach to employ will be to count the number of capitalized letters in the string and anytime there's more than 2 capitalized letters (the 1st for the first name and the 2nd one for the last name), remove the part of the string from the 3rd capitalized letter (first letter of the repeated last name) onward!
The code to do that is below and should provide more understanding.
#writing a function to remove the repeated surname and club name
def clean_name(name):
no_capital_letters=0
for i,letter in enumerate(name):
if letter.isupper():
no_capital_letters+=1
if no_capital_letters==3:
return name[:i]
#applying the function to the name column
data["Name"]=data["Name"].apply(clean_name)
data.head()
The dataframe has about 10 columns and you'll need only 2 (Name and Goals per 90), so, create a new dataframe that has the only 2 columns you'll need. Also, arrange the dataframe in ascending order of Goals per 90 values and extract the top 15 players in the dataframe, this will make the plot easy to read and neater.
Code below:
#required columns o are Name,Goals per 90
required_data=data[["Name","Goals per 90"]]
#sort_values in Ascending order
required_data=required_data.sort_values(by="Goals per 90",ascending=True)
required_data=required_data.reset_index()
#top 15 players only (from 10th downward, since it's sorted in ascending order)
required_data=required_data[10:]
required_data
Making the plot.
Some of the fundamental parts of a visualization in Matplotlib are Figures, Axes and Axis, and they are somewhat closely related to one another; figure and axes in meaning, axes and axis in spelling, and you should understand their differences before going ahead to make your plot.
A figure can be thought of as a canvas or a blank sheet on which plots are created. It is the overall image or layout of the visualization, on the other hand, an Axes refers to a specific area or region where data is plotted. Axes are created within a figure and can be thought of as the actual plotting area. A single figure could have multiple Axes, making it easier to create subplots or arrange plots in a grid-like fashion. So the bar plot is going to be inside the axes, while the axes will be in the figure. An Axis (note the difference from Axes) refers to a numerical scale and graphical representation of the data range for a specific dimension (typically x-axis or y-axis) in a plot. It provides a reference system that allows you to map data points to positions on the plot.
So, create a simple figure, an axes on the figure and a horizontal bar chart on the axes by running the code below.
#Add these extra imports
import matplotlib.pyplot as plt
#creating a figure
fig = plt.figure(facecolor = "#fff3e0",figsize=(6,9), dpi=300)
#creating an axes
ax = plt.subplot(111,facecolor = "#fff3e0")
# specify the height of the bars
height= 0.6
# Make a horizontal barplot on the Axes
ax.barh(
required_data["Name"],
required_data["Goals per 90"],
height=height,
color="#b52f43"
)
The colors ( for example "#b52f43") are in the Hexadecimal color system. In working with colors, a color picker is a good tool you'll need, it provides you with the option of selecting different colors and getting their codes in different color systems (Hexadecimal,RGB, HSV and so on). The figures and axes are given the same background color. In your own plot, you can play around with different color combinations and select the one you prefer.
The dpi (dots per inch) and figsize parameter (in inch) in the plt.figure()
function helps to calculate the actual size of your image, for example, the figure will have a total image size of 6*200 by 9*200 (1200 by 1800). The 1,1,1 in the plt.subplot()
function tells matplotlib to create a subplot with 1 row, and 1 column and this axes will be the first and only plot in the figure.
The resulting visualization:
Improving the Bar chart
A question you might face in this part will be what are spines and ticks?
The lines that make up the rectangle that you can see in the figure (surrounding the bar plots) are called the spines; they are made up of top, bottom, left and right spines. Ticks are those short lines that extend from the bottom and left spines and they point at the player names(ticks from the left spines) and the goals per 90 values (ticks from the bottom spines).
This image from the matplotlib documentation should give a more visual explanation of what spines, ticks and other parts of a figure are.
Remove the spines and ticks of the plot to make it look better by adding the below code.
# Remove spines
ax.spines["top"].set(visible = False)
ax.spines["bottom"].set(visible = False)
ax.spines["left"].set_visible(False)
ax.spines["right"].set_visible(False)
#remove ticks
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
The next step will be to annotate the bars, you can do that by looping through each player in the dataframe and then annotating the goals per 90 value of a player beside the bar for the player. Matplotlib has the axes.annotate()
function that allows you to do that.
#Annotate the bars
for index,gp90 in enumerate(required_data["Goals per 90"]):
ax.annotate(
xy = (gp90 , index),
text = f"{gp90}",
xytext = (20, 0),
size = 13,
textcoords = "offset points",
color = "#000712",
ha = "center",
va = "center",
weight = "bold"
)
The axes.annotate()
function uses a lot of parameters, the purpose of some of them are; xy is a tuple (x position,y position) indicating the point you want to annotate, the text parameter is the text of the annotation, xytext is a tuple showing the position you want the text to be, relative to the point you want to annotate and textcoord parameter shows the coordinate system of the xytext parameter. Other parameters and their meaning can be found in the matplotlib documentation.
The plot is almost good enough but the X-axis tick labels are a little bit too much (currently in multiples of 0.2), you can write them in multiples of 0.4 to make them fewer.
The code to do that:
#add this import
import matplotlib.ticker as ticker
#change X axis tick labels interval
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.4))
Adding a subheading and heading to the plots
The highlight_text package allows you to add and customize beautiful texts to matplotlib figures and axes .
You can add a heading (title of the plot) and subheading (short explanation of the plot) to the figure, the higlight_text package provides a fig_text()
function that allows you do that.
The code to do that is shown below:
#add this import
from highlight_text import fig_text
#Heading
fig_text(
x=-0.1,y=0.93,
s="ENGLISH PREMIER LEAGUE 2022/23 season",
size=16,
color="black",
weight="bold",
annotationbbox_kw={"xycoords": "figure fraction"})
#subheading
fig_text(
x = -0.1, y = 0.9,
s = "<GOALS PER 90 MINUTES> | PREMIER LEAGUE MATCHES | <TOP 15 PLAYERS ONLY>",
color = "black",
size = 10,
highlight_textprops = [
{"color": "#5c191f"},
{"color":"#5c191f"}
],
annotationbbox_kw={"xycoords": "figure fraction"}
)
Parameters x and y represent the positions you want the text to start from, the annotationbbox_kw={"xycoords": "figure fraction"}
line means that the coordinate system you are using is the figure fraction coordinate system. The figure fraction coordinate system allows you to express the x, and y positions as a value between (0 and 1), which means the extreme left of the figure has an x value of 0, the extreme right a value of 0, the bottom has a y value of 0 and the top a value of 1.
The highlight_textprops
parameter allows you to customize the texts (change colors, font weight and a lot more), the part of the text to customize is covered with substring delimiters (<>) in the "s" (which represents text) parameter, and it accepts a list of python dictionaries (the dictionaries have the customization information like color, weight, font family and so on) such that the length of the list (number of dictionaries) is same as the number of string delimiter pairs in the "s" parameter, for example from the code, the first dictionary in the list passed to the highlight_textprops function is for the “<GOALS PER 90 MINUTES>” part and the second one is for the “<TOP 15 PLAYERS ONLY>” part of the text.
Why use an x value of -0.1 as used in the code?
Well, an x value of 0 didn't place the headings and subheadings in the right position and after iterating for different values manually and observing the best position,-0.1 gave the best visualization.
Add Images to the figure
To add images to your figure, the steps to use are as follows:
Get and open the image.
Create a new axes on the figure where the image is going to be (in this case, the top right corner).
Put the image on the axes.
Finally switch off the axis of the new axes.
Where do you get the image from?
Any .PNG Image type can be used, PNG Images are preferred because they blend with the background color of the figure or axes when you add them to your visualizations.
For this plot, you'll be using the English Premier league Logo since that’s the league the plot is focused on. You can download the image to use using the urllib
package from the fotmob website, the fotmob website has lots of player and league images (although you have to manually get their links) in .PNG formats.
Open the image using PIL (python’s imaging library; python’s library for handling images), then create a new axes (width: 0.125;height:0.125;) at the top right corner of the figure ( 0.825,0.85; figure fraction coordinate system) position and paste the image on the axes.
Finally, switch off the axis of the newly created Axes.
#add the following imports
import urllib
from PIL import Image
#image url
epl_logo_url = "https://images.fotmob.com/image_resources/logo/leaguelogo/47.png"
#Get and Open the league logo
league_logo = Image.open(urllib.request.urlopen(epl_logo_url))
#create new axes at top right corner [left_position,bottom_position,width,hight]
logo_ax = fig.add_axes([0.825, 0.85, 0.125, 0.125])
#paste logo on the axes
logo_ax.imshow(league_logo)
#switch off the axis off our AXES
logo_ax.axis("off")
This is the final visualization you'll get:
To save the figure, add the below code:
#save a plot in matplotlib
plt.savefig('goals_per_90.png',bbox_inches='tight')
The full code for the horizontal bar chart:
import urllib
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from highlight_text import fig_text
fig = plt.figure(facecolor = "#fff3e0",figsize=(6,9), dpi=300)#figsize=(6, 2.5), dpi = 200
ax = plt.subplot(111,facecolor = "#fff3e0")
height= 0.6
ax.barh(
required_data["Name"],
required_data["Goals per 90"],
height=height,
color="#b52f43"
)
ax.xaxis.set_major_locator(ticker.MultipleLocator(0.4))
ax.spines["top"].set(visible = False)
ax.spines["bottom"].set(visible = False)
ax.spines["left"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
for index,gp90 in enumerate(required_data["Goals per 90"]):
ax.annotate(
xy = (gp90 +0.07 , index-0.3),
text = f"{gp90}",
xytext = (0, 7),
size = 13,
textcoords = "offset points",
color = "#000712",
ha = "center",
va = "center",
weight = "bold"
)
#Heading
fig_text(
x=-0.1,y=0.93,
s="ENGLISH PREMIER LEAGUE 2022/23 season",
size=16,
color="black",
weight="bold",
annotationbbox_kw={"xycoords": "figure fraction"})
#subheading
fig_text(
x = -0.1, y = 0.9,
s = "<GOALS PER 90 MINUTES> | PREMIER LEAGUE MATCHES | <TOP 15 PLAYERS ONLY>",
color = "black",
size = 10,
highlight_textprops = [
{"color": "#5c191f"},
{"color":"#5c191f"}
],
annotationbbox_kw={"xycoords": "figure fraction"}
)
epl_logo_url = "https://images.fotmob.com/image_resources/logo/leaguelogo/47.png"
logo_ax = fig.add_axes([0.825, 0.85, 0.125, 0.125])
league_logo = Image.open(urllib.request.urlopen(epl_logo_url))
logo_ax.imshow(league_logo)
logo_ax.axis("off")
plt.savefig('goals_per_90.png',bbox_inches='tight')
You can also find the full code here.
I hope you’ve learned from this tutorial the difference between an axes and a figure, what spines and ticks are, how to annotate your plots, add headings and subheadings, images and many more.
A final note will be that many of the values of positions (of texts, axes, images), font size, and colors in the code weren't gotten right away, they were gotten from trial and error until the visualization looked like what was expected.
Remember, visualization is a powerful tool in data science, and continuous improvement and experimentation are essential for creating impressive and informative visualizations.
Thanks for reading! Happy coding!