Introduction to Statistics

Nicola Rennie

Descriptive statistics

Descriptive statistics provide a summary that quantitatively describes a sample of data.

Population

Population refers to the entire group of individuals that we want to draw conclusions about.

Sample

Sample refers to the (usually smaller) group of people for which we have collected data on.

Generate sample data

For the examples later, let’s create a population of data in Python:

Generate sample data

… and draw a sample from it:


What do the values look like?

Mean

The mean, often simply called the average, is defined as the sum of all values divided by the number of values. It’s a measure of central tendency that tells us what’s happening near the middle of the data.

\(\bar{x} = \frac{1}{n} \sum_{i=i}^{n} x_{i}\)


In Python, we use the mean() function from numpy:

Median

The median of a dataset is the middle value when the data is arranged in ascending order, or the average of the two middle values if the dataset has an even number of observations.


In Python, we use the median() function from numpy:

Mode

The mode statistic represents the value that appears most frequently in a dataset.


In Python, we use the mode() function from statistics:

Range

The range is the difference between the maximum and minimum values in a dataset.


In Python, we can use the max() and min() function and subtract the values:

Or, we can use the ptp() function from numpy:

Sample variance

The sample variance tells us about how spread out the data is. A lower variance indicates that values tend to be close to the mean, and a higher variance indicates that the values are spread out over a wider range.

\(s^2 = \frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}\)


In Python, we use the var() function from numpy:

Sample standard deviation

The sample standard deviation is the square root of the variance. It also tells us about how spread out the data is.

\(s = \sqrt{\frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}}\)


In Python, we use the std() function from numpy:

Descriptive statistics

Descriptive statistics provide a summary that quantitatively describes a sample of data.

  • Mean: The sum of the values divided by the number of values.
  • Median: The middle value of the data when it’s sorted.
  • Mode: The value that appears most frequently.
  • Range: The difference between the maximum and minimum values.
  • Variance: The average of the squared differences from the mean.
  • Standard deviation: The square root of the variance.

Exercise

In Python:

  • Import numpy and statistics.
  • Import data from pydataset using from pydataset import data
  • Load the housing data set using housing = data('Housing')
  • Calculate the mean, median, mode, range, variance, and standard deviation of house prices.

Remember: you can extract a column in Python using dataset['column_name'].

Exercise solutions

import numpy as np
import statistics
from pydataset import data

# load data
housing = data('Housing')

Exercise solutions

# summary statistics
np.mean(housing['price'])
68121.59706959708
np.median(housing['price'])
62000.0
statistics.mode(housing['price'])
60000.0
max(housing['price']) - min(housing['price'])
165000.0
np.var(housing['price'])
711726713.9951562
np.std(housing['price'])
26678.206723750307

Questions?