Practical Techniques for Polished Visuals with Plotnine

Nicola Rennie

PyData Global 2024

About Me


Academic background in statistics and operational research

Experience in data science consultancy

Lecturer in Health Data Science in Lancaster Medical School.

Interests: data visualisation, reproducible research, …

What’s this talk about?

How I made this:

Area plot of coal production with coloured annotation



What is plotnine?

A brief introduction

What is plotnine?

  • A data visualisation library

  • Brings the Grammar of Graphics to Python

  • Implementation of ggplot2 from R

Plotnine logo

How does plotnine work?

  • Build plots layer by layer, adding components like data points, lines, and labels to customise their visualizations.

  • Map columns of DataFrames to different components and properties e.g. colours.

  • High-level, declarative syntax, where you specify what you want to plot, rather than how to draw it.



Creating customised
plots

Building an annotated area chart

Data

  • Carbon Majors is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers. The Carbon Majors dataset is available for download and for non-commercial use, subject to InfluenceMap’s Terms and Conditions.

  • The dataset has 12,551 rows and 7 columns with the following variables: year, parent_entity, parent_type, commodity, production_value, production_unit, and total_emissions_MtCO2e.

  • For this visualisation, we’ll look at total amount of coal produced per year since 1900, broken down by type of coal.

Data

After some data wrangling, we have this data:

plot_data.head()


year       commodity          n
1900      Anthracite   1.379115
1900      Bituminous  11.312304
1900         Lignite   3.856433
1900   Metallurgical   1.573317
1900  Sub-Bituminous   2.110480


See nrennie.rbind.io/blog/plotnine-annotated-area-chart for data wrangling code.

Additional data

# Values for annotations
exceeds100 = plot_data.groupby('year')['n'].sum()
exceeds100 = exceeds100[exceeds100 > 100].index.min()


# Create data for x-axis labels
segment_data = pd.DataFrame({
    'year': list(range(1900, 2021, 20))
})


# y-axis labels
y_axis_data = pd.DataFrame({
    'value': [0, 2000, 4000, 6000, 8000],
    'label': ['0', '2,000', '4,000', '6,000', '8,000\nmillion\ntonnes']
})

Variables for colours and fonts

Colours:

bg_col='#FFFFFA'
text_col='#0D5C63'
col_palette=['#E58606', '#5D69B1', '#52BCA3', '#99C945', '#CC61B0', '#24796C']



Fonts:

body_font='sans'

Variables for text

import textwrap

title_text='Coal production since 1900'

st='Carbon Majors is a database of historical production data from 122 of the world’s largest oil, gas, coal, and cement producers. This data is used to quantify the direct operational emissions and emissions from the combustion of marketed products that can be attributed to these entities.'

wrapped_subtitle='\n'.join(textwrap.wrap(st, width=50))

Basic plot

Import plotnine:

import plotnine as gg


Basic plot:

p = (gg.ggplot(plot_data, gg.aes(x='year', y='n')))

Blank plot

Adding geoms

p = (p +
     # Axis lines
     gg.geom_segment(data=segment_data,
                     mapping=gg.aes(x='year', xend='year', y=0, yend=-1700),
                     linetype='dashed', alpha=0.4, color=text_col) +
     # Axis labels
     gg.geom_text(data=segment_data,
                  mapping=gg.aes(x='year', y=-1900, label='year'),
                  color=text_col, size=8, family=body_font, ha='left') +
     gg.geom_text(data=y_axis_data,
                  mapping=gg.aes(x=2023, y='value', label='label'),
                  color=text_col, size=8, family=body_font, ha='left', va='top'))

Blank plot with custom gridlines and axis labels

Adding annotations

p = (p +
     gg.annotate(
         'segment', x=exceeds100, xend=exceeds100, y=0, yend=5000,
         size=1, color=text_col) +
    gg.annotate(
         'text', x=exceeds100 + 2, y=5000, label=exceeds100,
         size=10, color=text_col, family=body_font,
         ha='left', va='top', fontweight='bold') +
    gg.annotate(
         'text', x=exceeds100 + 2, y=5000 - 600,
         label='Total coal production first\nexceeds 100 million tonnes\nper year.',
         size=9, color=text_col, family=body_font, ha='left', va='top', ) +
    gg.annotate(
         'segment', x=1975, xend=1975, y=0, yend=10000,
         size=1, color=text_col) +
    gg.annotate(
         'text', x=1975 + 2, y=10000, label='Coal types',
         size=10, color=text_col,family=body_font,
         ha='left', va='top', fontweight='bold'))

Plot with gridlines and annotations

A few more features…

p = (p +
    # Add area plot
    gg.geom_area(gg.aes(fill='commodity')) +
    # Colour, x-axis, and y-axis scales
    gg.scale_fill_manual(values=col_palette) +
    gg.scale_x_continuous(limits=(1896, 2034)) +
    gg.scale_y_continuous(limits=(-3300, 12000)))

…and another few!

p = (p +
    # Text for title and subtitle
    gg.annotate(
         'text', x=1900, y=11400,
         label=title_text, color=text_col, family=body_font,
         ha='left', va='top',
         size=13, fontweight='bold'
) +
    gg.annotate(
         'text', x=1900, y=10500,
         label=wrapped_subtitle, color=text_col, family=body_font,
         ha='left', va='top', size=9.5
))

Area plot of coal production

Editing the theme

p = (p +
    gg.coord_cartesian(expand=False) +
    gg.theme_void(base_size=8) +
    gg.theme(
         legend_position='none',
         plot_background=gg.element_rect(fill=bg_col, color=bg_col),
         panel_background=gg.element_rect(fill=bg_col, color=bg_col)
))

Area plot of coal production with annotations



Exploiting matplotlib

Using other libraries with plotnine

The HighlightText library

  • A library to make annotations easier in matplotlib.

  • A way to specify individual font properties for substrings of text.

  • Different colours, shading backgrounds, different font size, weights, or styles.

  • Documentation: pypi.org/project/highlight-text

Writing HighlightText annotations

txt='Normal text then <coloured text::{"color": "red"}>'


# annotation labels
coal_types_label='Total coal production includes\nproduction of <Bituminous::{"color": "#E58606"}>,\n<Sub-bituminous::{"color": "#5D69B1"}>, <Metallurgical::{"color": "#52BCA3"}>,\n<Lignite::{"color": "#99C945"}>, <Anthracite::{"color": "#CC61B0"}>, and <Thermal::{"color": "#24796C"}>\ncoal. Bituminous accounts\nfor around half.'

# caption
cap='<Data::{"fontweight": "bold"}>: Carbon Majors\n<Graphic::{"fontweight": "bold"}>: Nicola Rennie (@nrennie)'

Convert to matplotlib object

import matplotlib.pyplot as plt
fig = p.draw()
fig.set_size_inches(8, 6, forward=True)
fig.set_dpi(300)
ax = plt.gca()

Adding HighlightText annotations

import highlight_text as ht

ht.ax_text(
    x=1977, y=9400, s=coal_types_label,
    vsep=3, color=text_col,
    fontname=body_font,
    fontsize=9, va='top')
    
ht.ax_text(
    x=1900, y=-2300, s=cap, color=text_col,
    fontname=body_font, fontsize=7.5, va='top')
    
plt.show()

Area plot of coal production with coloured annotation

Summary

  • Plotnine allows you to build plots one layer at a time.

  • The syntax is fairly intuitive.

  • Customising plots takes a bit of work but it’s worth it.

  • Use libraries that work with matplotlib to gain extra features.

Resources