Creating your data science portfolio

R Ladies Gaborone & Botswana R Users

👩 Nicola Rennie

A lollipop chart with each point showing a different stage of the author's career.

Data science portfolios

Why build a data science portfolio?

  • Extend your CV

  • Personal projects are evidence of skills developed in confidential projects

  • Showcase projects you have designed and enjoy

  • For your own reference

What could a data science portfolio include?



It depends.

What could a data science portfolio include?


  • Highlight 3-5 projects

  • Show the process

  • Share code and outputs

Gif of dashboard sketch
Image: giphy.com

Where could I keep my portfolio?

  • Public git repository (e.g. GitHub)

  • Website

    • Built with:
      • Quarto, Hugo, …
    • Hosted on:
      • GitHub Pages, Quarto Pub, Netlify…

GitHub logo

How do I create a data science project?


  • Get some data
  • Do something to it
    • Data wrangling
    • Visualisation
    • Modelling
  • Write about it

Where do I find data?


Getting started with #TidyTuesday



R4DS logo

How did I start my portfolio?

A #TidyTuesday Example

Making a GitHub repository

Making a GitHub repository

  • Create a new repository

  • Fill in project details

  • Add a README file

Github new repository screenshot

  • File -> New Project

  • Create project from Version control

  • Clone a project from a Git repository

Github clone process in RStudio

Git help: happygitwithr.com

Live Coding!

Create a file

  • .R file

  • .Rmd file

  • .qmd file


---
title: "#TidyTuesday"
author: Nicola Rennie
format: html
---

Load the data

Data: github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07/readme.md

big_tech_stock_prices <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-07/big_tech_stock_prices.csv")

or

tuesdata <- tidytuesdayR::tt_load("2023-02-07")
big_tech_stock_prices <- tuesdata$big_tech_stock_prices

or

tuesdata <- tidytuesdayR::tt_load(2023, week = 6)
big_tech_stock_prices <- tuesdata$big_tech_stock_prices

Initial exploration

View the column names:

colnames(big_tech_stock_prices)
[1] "stock_symbol" "date"         "open"         "high"         "low"         
[6] "close"        "adj_close"    "volume"      


Read the data dictionary: github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-02-07/readme.md

Choosing an approach

  • Which aspects of the data do you want to show?
    • How have values changed over time?
    • How do different companies compare?
  • What things do you want to learn?
    • {ggsankey}
    • Working with icons in fonts

Data wrangling

library(tidyverse)
library(lubridate)
plot_data <- big_tech_stock_prices |> 
  mutate(year = year(date)) |> 
  group_by(stock_symbol, year) |> 
  summarise(open = mean(open, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(year <= 2022)
# A tibble: 6 × 3
  stock_symbol  year  open
  <chr>        <dbl> <dbl>
1 AAPL          2010  9.28
2 AAPL          2011 13.0 
3 AAPL          2012 20.6 
4 AAPL          2013 16.9 
5 AAPL          2014 23.1 
6 AAPL          2015 30.0 

Initial plots

ggplot(plot_data,
       aes(x = year,
           y = open)) +
  geom_col()

Initial plots

ggplot(plot_data,
       aes(x = year,
           y = open,
           fill = stock_symbol)) +
  geom_col()

Initial plots

ggplot(plot_data,
       aes(x = year,
           y = open,
           colour = stock_symbol)) +
  geom_line() +
  geom_point()

Choosing an idea

library(ggsankey)
ggplot(plot_data,
       aes(x = year,
           value = open,
           node = stock_symbol,
           fill = stock_symbol)) +
  geom_sankey_bump()

Refining your plot

ggplot(plot_data,
       aes(x = year,
           value = open,
           node = stock_symbol,
           fill = stock_symbol)) +
  geom_sankey_bump(space = 1,
                   colour = "transparent",
                   smooth = 6,
                   alpha = 0.8)

Refining your plot

ggplot(plot_data,
       aes(x = year,
           value = open,
           node = stock_symbol,
           fill = (stock_symbol == "ADBE"))) +
  geom_sankey_bump(space = 1,
                   colour = "transparent",
                   smooth = 6,
                   alpha = 0.8) +
  scale_fill_manual(
    values = c("grey", "#fb0f01")
    )

Refining your plot

g <- ggplot(plot_data,
       aes(x = year,
           value = open,
           node = stock_symbol,
           fill = (stock_symbol == "ADBE"))) +
  geom_sankey_bump(space = 1,
                   colour = "transparent",
                   smooth = 6,
                   alpha = 0.8) +
  scale_fill_manual(
    values = c("grey", "#fb0f01")
    ) +
  scale_x_continuous(
    breaks = seq(2010, 2022, 2)
    ) +
  theme_minimal()
g

Refining your plot

st <- "In 2022, of 14 tech companies considered, 
Adobe Inc. had the highest average daily stock 
price when the markets opened, after overtaking 
Netflix in 2021. Data: Yahoo Finance"
g <- g +
  labs(title = "The Rise of Adobe Inc.",
       subtitle = str_wrap(st, 80)) 
g

Refining your plot

g +
  theme(text = element_text(colour = "#546666"),
        plot.margin = margin(
          10, 10, 10, 20
          ),
        # title and subtitle
        plot.title = element_text(
          size = 20,
          colour = "#2F4F4F"
          ),
        plot.subtitle = element_text(
          size = 16,
          lineheight = 0.4,
          hjust = 0
          ),
        plot.title.position = "plot",
        # axes 
        axis.text.y = element_blank(),
        axis.text.x = element_text(
          size = 16,
          vjust = 2
          ),
        axis.title = element_blank(),
        axis.ticks = element_blank(),
        # other elements
        legend.position = "none",
        panel.grid.minor = element_blank(),
        panel.grid.major.y= element_blank())

The final plot

A sankey chart showing the changing rank of 14 tech companies changing between 2010 and 2022, with rank defined by opening stock price. 13 of the companies are shown in shades of grey, whilst Adobe is shown in red, with Adobe coming into the top spot in 2021.

The rest of the process…

  • {showtext} for Google Fonts and Icons
  • {ggtext} to add markdown
  • More colours
  • Segments for lines

Full code: github.com/nrennie/tidytuesday/tree/main/2023/2023-02-07

Sharing your work

Push your code

Git interface in RStudio GUI screenshot

Publish your work


  • README.md file in the git repository
  • GitHub Pages, Quarto Pub, …
  • Twitter

Quarto publishing:

quarto publish big_tech.qmd

Key points


  • Data science portfolios should highlight projects you enjoy

  • Include textual descriptions of what you did

  • #TidyTuesday is a beginner-friendly way to get started

Slides

Questions?