from datascience import *
import numpy as np
Table.interactive_plots()
Our first dataset today comes from Basketball Reference. It contains per-game averages of players in the 2019-2020 NBA season.
Run the cell below to load it in, select the relevant columns, and do some data cleaning.
Note: Most of the interesting data comes from the "better" players in the league; we will only look at players who averaged at least 10 points per game in the season. This isn't perfect, since there were plenty of good players who averaged less than 10 points per game.
nba = Table.read_table('data/nba-2020.csv') \
.select('Player', 'Pos', 'Tm', 'PTS', 'TRB', 'AST', '3PA', '3P%') \
.where('3PA', are.not_equal_to(0))
def remove_code(name):
return name[:name.index('\\')]
def get_court(pos):
if 'G' in pos:
return 'Guard'
else:
return 'Forward'
nba = nba.with_columns('Player', nba.apply(remove_code, 'Player'),
'Pos', nba.apply(get_court, 'Pos')) \
.where('PTS', are.above(10))
nba
Player | Pos | Tm | PTS | TRB | AST | 3PA | 3P% |
---|---|---|---|---|---|---|---|
Bam Adebayo | Forward | MIA | 15.9 | 10.2 | 5.1 | 0.2 | 0.143 |
LaMarcus Aldridge | Forward | SAS | 18.9 | 7.4 | 2.4 | 3 | 0.389 |
Jarrett Allen | Forward | BRK | 11.1 | 9.6 | 1.6 | 0.1 | 0 |
Giannis Antetokounmpo | Forward | MIL | 29.5 | 13.6 | 5.6 | 4.7 | 0.304 |
Carmelo Anthony | Forward | POR | 15.4 | 6.3 | 1.5 | 3.9 | 0.385 |
OG Anunoby | Forward | TOR | 10.6 | 5.3 | 1.6 | 3.3 | 0.39 |
D.J. Augustin | Guard | ORL | 10.5 | 2.1 | 4.6 | 3.5 | 0.348 |
Deandre Ayton | Forward | PHO | 18.2 | 11.5 | 1.9 | 0.3 | 0.231 |
Marvin Bagley III | Forward | SAC | 14.2 | 7.5 | 0.8 | 1.7 | 0.182 |
Lonzo Ball | Guard | NOP | 11.8 | 6.1 | 7 | 6.3 | 0.375 |
... (163 rows omitted)
A description of each column:
'Player'
: name'Pos'
: general position (either Forward or Guard)'Tm'
: abbreviated team'PTS'
: average number of points scored per game'TRB'
: average number of rebounds per game (a player receives a rebound when they grab the ball after someone misses)'AST'
: average number of assists per game (a player receives an assist when they pass the ball to someone who then scores)'3PA'
: average number of three-point shots attempted per game (a three point shot is one from behind a certain line, which is between 22-24 feet from the basket)'3P%'
: average proportion of three-point shots that go innba.group('Pos', np.mean).select('Pos', 'PTS mean', 'TRB mean', 'AST mean')
Pos | PTS mean | TRB mean | AST mean |
---|---|---|---|
Forward | 15.6297 | 6.68901 | 2.41099 |
Guard | 16.7463 | 4.00244 | 4.45244 |
nba.group('Pos', np.mean).select('Pos', 'PTS mean', 'TRB mean', 'AST mean').barh('Pos')
nba.hist('PTS', density = False, bins = np.arange(10, 40, 2.5),
width = 400, height = 600)
nba.hist('TRB', density = False, group = 'Pos', bins = np.arange(17),
xaxis_title = 'Rebounds',
title = 'Distribution of Rebounds')
/opt/conda/lib/python3.8/site-packages/datascience/tables.py:920: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
example_data = Table().with_columns(
'x', np.array([1, 4, 4, 3, 6]),
'y', np.array([-1, 2, 8, 0, 1])
)
example_data
x | y |
---|---|
1 | -1 |
4 | 2 |
4 | 8 |
3 | 0 |
6 | 1 |
example_data.scatter('x', 'y', s = 50, width = 500, height = 500)
nba
Player | Pos | Tm | PTS | TRB | AST | 3PA | 3P% |
---|---|---|---|---|---|---|---|
Bam Adebayo | Forward | MIA | 15.9 | 10.2 | 5.1 | 0.2 | 0.143 |
LaMarcus Aldridge | Forward | SAS | 18.9 | 7.4 | 2.4 | 3 | 0.389 |
Jarrett Allen | Forward | BRK | 11.1 | 9.6 | 1.6 | 0.1 | 0 |
Giannis Antetokounmpo | Forward | MIL | 29.5 | 13.6 | 5.6 | 4.7 | 0.304 |
Carmelo Anthony | Forward | POR | 15.4 | 6.3 | 1.5 | 3.9 | 0.385 |
OG Anunoby | Forward | TOR | 10.6 | 5.3 | 1.6 | 3.3 | 0.39 |
D.J. Augustin | Guard | ORL | 10.5 | 2.1 | 4.6 | 3.5 | 0.348 |
Deandre Ayton | Forward | PHO | 18.2 | 11.5 | 1.9 | 0.3 | 0.231 |
Marvin Bagley III | Forward | SAC | 14.2 | 7.5 | 0.8 | 1.7 | 0.182 |
Lonzo Ball | Guard | NOP | 11.8 | 6.1 | 7 | 6.3 | 0.375 |
... (163 rows omitted)
nba.scatter('PTS', 'AST')
Observation: On average, as the number of points a player averages increases, the number of assists they average also increases.
nba.where('Pos', 'Forward') \
.scatter('TRB', '3PA',
xaxis_title = 'Rebounds Per Game (TRB)',
yaxis_title = 'Three-Point Attempts Per Game (3PA)',
title = '3PA vs. TRB for Forwards',
width = 800,
height = 500)
Observation: on average, as the number of rebounds a player averages per game increases, the number of three point attempts they average per game decreases.
nba.where('3PA', are.above(2)) \
.scatter('PTS', '3P%',
xaxis_title = 'Points Per Game (PTS)', yaxis_title = 'Three-Point Percentage (3P%)',
title = '3P% vs. PTS for Players with at least 2 3PA',
width = 700, height = 500)
Observation: on average, as the number of points per game a player averages increases, three-point percentage neither increases nor decreases. (In other terms – it appears that PTS and 3P% are uncorrelated.)
nba.scatter('PTS', '3P%', s = 40)
nba.scatter('PTS', '3P%', s = 40, sizes = '3PA')
nba.scatter('TRB', '3PA', group = 'Pos', s = 30)
Observation: Guards tend to have fewer rebounds and more three-point attempts than forwards, who tend to have more rebounds and fewer three-point attempts.
nba.where('PTS', are.above(25)) \
.scatter('PTS', 'AST',
labels = 'Player',
s = 30,
width = 500,
height = 500)
nba.where('PTS', are.above(20)) \
.scatter('PTS', 'AST',
labels = 'Player',
s = 30,
sizes = '3PA',
title = 'Players Averaging at least 20 PTS')
nba_yearly = Table.read_table('data/nba-league-averages.csv') \
.select('Season', 'PTS', 'FGA', '3PA', '3P%', 'Pace')
nba_yearly = nba_yearly.with_columns('Season', np.arange(2021, 1979, -1))
nba_yearly
Season | PTS | FGA | 3PA | 3P% | Pace |
---|---|---|---|---|---|
2021 | 111.7 | 88.3 | 34.7 | 0.367 | 99.2 |
2020 | 111.8 | 88.8 | 34.1 | 0.358 | 100.3 |
2019 | 111.2 | 89.2 | 32 | 0.355 | 100 |
2018 | 106.3 | 86.1 | 29 | 0.362 | 97.3 |
2017 | 105.6 | 85.4 | 27 | 0.358 | 96.4 |
2016 | 102.7 | 84.6 | 24.1 | 0.354 | 95.8 |
2015 | 100 | 83.6 | 22.4 | 0.35 | 93.9 |
2014 | 101 | 83 | 21.5 | 0.36 | 93.9 |
2013 | 98.1 | 82 | 20 | 0.359 | 92 |
2012 | 96.3 | 81.4 | 18.4 | 0.349 | 91.3 |
... (32 rows omitted)
Our second dataset also comes from Basketball Reference. This dataset contains team-based average statistics for each year.
A little bit about our new dataset:
'Season'
: the second calendar year for each season (e.g. 2018
refers to the 2017-18 season)'FGA'
: the average number of field goal attempts (shot attempts) per game'Pace'
: the average number of times a team had possession of the ball per gamenba_yearly.plot('Season', 'Pace')
Observation: The league slowed down in the late 90s and early 2000s, but is speeding back up.
nba_yearly.plot('Season', '3PA',
yaxis_title = 'Three-Point Attempts (3PA)',
title = 'Three-Point Attempts Per Season',
width = 700)
Observation: The three-point shot has rapidly increased in popularity over the past decade.
nba_yearly.select('Season', 'FGA', '3PA')
Season | FGA | 3PA |
---|---|---|
2021 | 88.3 | 34.7 |
2020 | 88.8 | 34.1 |
2019 | 89.2 | 32 |
2018 | 86.1 | 29 |
2017 | 85.4 | 27 |
2016 | 84.6 | 24.1 |
2015 | 83.6 | 22.4 |
2014 | 83 | 21.5 |
2013 | 82 | 20 |
2012 | 81.4 | 18.4 |
... (32 rows omitted)
# Notice how we only supplied `plot` with a single argument
nba_yearly.select('Season', 'FGA', '3PA').plot('Season')
champ = nba_yearly.take(np.arange(1, 7)).select('Season', 'PTS', '3PA', 'Pace').with_columns(
'Champion', np.array(['LAL', 'TOR', 'GSW', 'GSW', 'CLE', 'GSW'])
).select(0, -1, 1, 2, 3)
champ
Season | Champion | PTS | 3PA | Pace |
---|---|---|---|---|
2020 | LAL | 111.8 | 34.1 | 100.3 |
2019 | TOR | 111.2 | 32 | 100 |
2018 | GSW | 106.3 | 29 | 97.3 |
2017 | GSW | 105.6 | 27 | 96.4 |
2016 | CLE | 102.7 | 24.1 | 95.8 |
2015 | GSW | 100 | 22.4 | 93.9 |
champ.select('Season', 'Pace').barh('Season')
champ.scatter('PTS', 'Pace', s = 100, labels = 'Champion')
champ.scatter('PTS', 'Pace', s = 100, labels = 'Season')
champ.select('Season', 'PTS', 'Pace').plot('Season')