EDA of 2019 Cycling Data
Intro
Strava is used by many endurance athletes to track their endurance sports. I log every run and ride I do in Strava. Using both David Yang’s awesome medium post and Mark Koester’s epic github repo qs_ledger as inspiration, I am going to download my 2019 strava cycling data using stravalib, strava’s API, and conduct an exploratory data analysis.
Importing dependencies
import pandas as pd
import altair as alt
import numpy as np
from stravalib import unithelper
from vega_datasets import data
Dowloading data from the Strava API using stravalib
from stravalib import Client
client_id = #get from your strava settings
client_secret = #get from your strava settings
client = Client()
url = client.authorization_url(
client_id=client_id,
redirect_uri='http://localhost/'
)
print(url)
https://www.strava.com/oauth/authorize?client_id=24067&redirect_uri=http%3A%2F%2Flocalhost%2F&approval_prompt=auto&response_type=code&scope=read%2Cactivity%3Aread
code = 'd03200d9e3b9ff3e16b9dc11ce27d6a55d3281cd' # Change t8f32b9be13bbc477e6cda663fb34ebcea302a99dhis to what you see
access_token = client.exchange_code_for_token(
client_id=client_id,
client_secret=client_secret,
code=code
)
refresh_token = access_token['access_token']
client = Client(access_token=refresh_token)
# Test the connection
athlete = client.get_athlete()
print(f'Hello, {athlete.firstname}, I know you.')
Hello, Chase, I know you.
rides = pd.DataFrame(
columns=[
'date',
'moving_time',
'activity_id',
'name',
'distance',
'elevation gain',
'type',
'trainer',
'average_speed',
'average_watts',
'suffer_score',
'average_heartrate',
'average_cadence',
'kilojoules',
'gear_id',
'average_temp',
'start_longitude',
'start_latitude'
]
)
for activity in client.get_activities(
after="2018-12-31T00:00:00Z",
before="2020-01-01T00:00:00Z"):
if activity.type == "Ride":
rides = rides.append(
{
'date': activity.start_date_local.date(),
'activity_id': activity.id,
'moving_time': activity.moving_time,
'name': activity.name,
'distance': round(float(unithelper.miles(activity.distance)), 2),
'elevation gain': float(unithelper.feet(activity.total_elevation_gain)),
'type': activity.type,
'trainer': activity.trainer,
'average_speed': float(unithelper.miles_per_hour(activity.average_speed)),
'average_watts': activity.average_watts,
'suffer_score': activity.suffer_score,
'average_heartrate': activity.average_heartrate,
'average_cadence': activity.average_cadence,
'kilojoules': activity.kilojoules,
'gear_id': activity.gear_id,
'average_temp': activity.average_temp,
'start_longitude': activity.start_longitude,
'start_latitude': activity.start_latitude
},
ignore_index=True
)
rides.head()
date | moving_time | activity_id | name | distance | elevation gain | type | trainer | average_speed | average_watts | suffer_score | average_heartrate | average_cadence | kilojoules | gear_id | average_temp | start_longitude | start_latitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-12-31 | 00:20:03 | 2971700880 | FTP Test | 7.48 | 0.000000 | Ride | True | 22.385021 | 225.5 | 55 | 172.5 | 94.1 | 271.3 | b5499491 | None | None | None |
1 | 2019-12-31 | 00:09:58 | 2971670149 | 10 min FTP Warm Up Ride with Matt Wilpers | 2.98 | 0.000000 | Ride | True | 17.940229 | 132.9 | 7 | 141.4 | 91.6 | 79.4 | b4933861 | None | None | None |
2 | 2019-12-29 | 00:22:59 | 2966736434 | Coronado Island Perkins Ride part 2 | 3.34 | 26.902887 | Ride | False | 8.706156 | 55.3 | 5 | 123 | None | 76.2 | b1315248 | None | -117.18 | 32.68 |
3 | 2019-12-29 | 00:51:56 | 2966569823 | Coronado Island Perkins Ride part 1 | 8.29 | 38.713911 | Ride | False | 9.571850 | 63.7 | 13 | 122.5 | None | 198.6 | b1315248 | None | -117.17 | 32.7 |
4 | 2019-12-24 | 00:20:14 | 2954036814 | 20 min 16 sec Just Ride | 6.03 | 0.000000 | Ride | True | 17.882069 | 129 | 13 | 139.4 | 86 | 156.6 | b5499491 | None | None | None |
Data Pre-Processing
rides['date'] = pd.to_datetime(rides['date'])
#rides['moving_time'] = pd.to_datetime(rides['moving_time'])
#rides['moving_time'] = rides['moving_time'].dt.minutes
rides['moving_time'] = rides['moving_time'].apply(lambda x: x/np.timedelta64(1,'m'))
# date additions of time_added
rides['year'] = rides['date'].dt.year
rides['month'] = rides['date'].dt.month
rides['mnth_yr'] = rides['date'].apply(lambda x: x.strftime('%Y-%m')) # note: not very efficient
rides['day'] = rides['date'].dt.day
rides['dow'] = rides['date'].dt.weekday_name
rides['week_number'] = rides['date'].dt.week
rides['hour'] = rides['date'].dt.hour
rides['date'] = rides['date'].apply(lambda x: x.strftime('%Y-%m-%d')) # note: not very efficient
rides.head()
date | moving_time | activity_id | name | distance | elevation gain | type | trainer | average_speed | average_watts | ... | average_temp | start_longitude | start_latitude | year | month | mnth_yr | day | dow | week_number | hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-12-31 | 20.050000 | 2971700880 | FTP Test | 7.48 | 0.000000 | Ride | True | 22.385021 | 225.5 | ... | None | None | None | 2019 | 12 | 2019-12 | 31 | Tuesday | 1 | 0 |
1 | 2019-12-31 | 9.966667 | 2971670149 | 10 min FTP Warm Up Ride with Matt Wilpers | 2.98 | 0.000000 | Ride | True | 17.940229 | 132.9 | ... | None | None | None | 2019 | 12 | 2019-12 | 31 | Tuesday | 1 | 0 |
2 | 2019-12-29 | 22.983333 | 2966736434 | Coronado Island Perkins Ride part 2 | 3.34 | 26.902887 | Ride | False | 8.706156 | 55.3 | ... | None | -117.18 | 32.68 | 2019 | 12 | 2019-12 | 29 | Sunday | 52 | 0 |
3 | 2019-12-29 | 51.933333 | 2966569823 | Coronado Island Perkins Ride part 1 | 8.29 | 38.713911 | Ride | False | 9.571850 | 63.7 | ... | None | -117.17 | 32.7 | 2019 | 12 | 2019-12 | 29 | Sunday | 52 | 0 |
4 | 2019-12-24 | 20.233333 | 2954036814 | 20 min 16 sec Just Ride | 6.03 | 0.000000 | Ride | True | 17.882069 | 129 | ... | None | None | None | 2019 | 12 | 2019-12 | 24 | Tuesday | 52 | 0 |
5 rows × 25 columns
Exploratory Data Analysis
# Generate Range of Dates from First Run added to Today
first_date = rides['date'].tail(1).values[0]
last_date = rides['date'].head(1).values[0]
all_dates = pd.date_range(start=first_date, end=last_date)
all_dates = pd.DataFrame(all_dates, columns=['date'])
# Total Activites / Days
perc_activities = round(len(rides) / len(all_dates), 2)
perc_activities
0.43
rides['date'] = pd.to_datetime(rides['date'])
cycling_distance_per_date = pd.merge(left=all_dates, right=rides, left_on="date", right_on="date", how="outer")
cycling_distance_per_date['distance'].fillna(0, inplace=True)
cycling_distance_per_date['RollingMeanMi'] = cycling_distance_per_date['distance'].rolling(window=10, center=True).mean()
chart = alt.Chart(cycling_distance_per_date).mark_line().encode(
x='date',
y='RollingMeanMi:Q'
)
chart.properties(
width=800,
height=250,
title='Ten Day Rolling Miles Ridden in 2019'
)
cycling_month = rides.groupby(['mnth_yr'])['distance'].sum().reset_index(name='distance')
alt.Chart(cycling_month).mark_bar(size=25).encode(
x=alt.X('mnth_yr:T', axis=alt.Axis(title='date')),
y='distance:Q'
).properties(
width=800,
height=200,
title='Miles Ridden per Month in 2019'
)
cycling_month_time = rides.groupby(['mnth_yr'])['moving_time'].sum().reset_index(name='moving_time')
cycling_month_time['hours'] = cycling_month_time['moving_time'] / 60
alt.Chart(cycling_month_time).mark_bar(size=25).encode(
x=alt.X('mnth_yr:T', axis=alt.Axis(title='date')),
y=alt.Y('hours:Q', axis=alt.Axis(title='Hours'))
).properties(
width=800,
title='Hours Moving per Month in 2019'
)
rides['Hours'] = rides['moving_time'] / 60
alt.Chart(rides).mark_bar(size=4).encode(
x="date:T",
y="Hours:Q",
color='Hours:Q'
).properties(
width=800,
title='Hours Moving per activity in 2019'
)
rides['Hours'] = rides['moving_time'] / 60
bars = alt.Chart(rides).mark_bar(size=4).encode(
x="date:T",
y="Hours:Q",
color='Hours:Q'
).properties(
width=800,
title='Hours Moving per activity in 2019'
)
events = pd.DataFrame([
{
"start": "2019-09-02",
"end": "2019-09-10",
"event": "Cycling Trip with my Father"
}
])
end = alt.Chart(events).mark_rule(
color="#FFA500",
strokeWidth=2
).encode(
x='end:T'
).transform_filter(alt.datum.event == "Cycling Trip with my Father")
start = alt.Chart(events).mark_rule(
color="#FFA500",
strokeWidth=2
).encode(
x='start:T'
).transform_filter(alt.datum.event == "Cycling Trip with my Father")
text = alt.Chart(events).mark_text(
align='left',
baseline='middle',
dx=7,
dy=-135,
size=15
).encode(
x='start:T',
x2='end:T',
text='event',
color=alt.value('#FFA500 ')
)
(bars + text+ start + end).properties(width=600)
alt.Chart(rides).mark_bar().encode(
alt.X("Hours:Q", bin=True),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
color='trainer:N',
).properties(
width=500,
title='Number of Activities by Time'
)
alt.Chart(rides).mark_bar().encode(
alt.X("distance:Q", bin=True),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
color='trainer:N',
).properties(
width=500,
title='Number of Activities by Distance'
)
alt.Chart(rides).mark_bar().encode(
alt.X("average_watts:Q",bin=alt.Bin(step=20)),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
color='trainer:N',
).properties(
width=500,
title='Average Watts Histogram'
)
alt.Chart(rides).mark_bar().encode(
alt.X("average_heartrate:Q",bin=alt.Bin(step=10)),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
color='trainer:N',
).properties(
width=500,
title='Average Heartrate Histogram'
)
alt.Chart(rides).mark_bar().encode(
alt.X("kilojoules:Q",bin=alt.Bin(step=50)),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
).properties(
width=500,
title='Kilojoules Histogram'
)
alt.Chart(rides).mark_bar(color='red').encode(
alt.X("suffer_score:Q",bin=alt.Bin(step=10)),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
).properties(
width=500,
title='Suffer Score Histogram'
)
alt.Chart(rides).mark_bar(color='green').encode(
alt.X("elevation gain:Q",bin=alt.Bin(step=500)),
y=alt.Y('count(activity_id):Q', axis=alt.Axis(title='# of rides')),
).properties(
width=500,
title='Total Elevation Gain Histogram'
)
categoryNames = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
bar = alt.Chart(rides).mark_bar(size=50).encode(
x=alt.X('dow', sort=categoryNames, axis=alt.Axis(title='Day of the Week')),
y='mean(Hours)',
color='mean(Hours):Q'
)
rule = alt.Chart(rides).mark_rule(color='red').encode(
y='mean(Hours):Q'
)
(bar + rule).properties(
width=500,
title='Average Hours Riding by Day of the Week'
)
alt.Chart(rides).mark_rect().encode(
x='week_number:O',
y='dow:O',
color='Hours:Q'
)
alt.Chart(rides).mark_circle().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative'),
color='trainer:N',
).properties(
width=150,
height=150
).repeat(
row=['kilojoules', 'average_watts', 'suffer_score','average_heartrate'],
column=['kilojoules', 'average_watts', 'suffer_score','average_heartrate']
).interactive()
states = alt.topo_feature(data.us_10m.url, feature='states')
# US states background
background = alt.Chart(states).mark_geoshape(
fill='lightgray',
stroke='white'
).properties(
width=800,
height=500
).project('albersUsa')
# airport positions on background
points = alt.Chart(rides).mark_circle(
size=20,
color='steelblue'
).encode(
longitude='start_longitude:Q',
latitude='start_latitude:Q',
tooltip=['name', 'date', 'distance', 'elevation gain']
)
background + points.properties(
title='2019 Rides plotted Across America'
)
Yearly Summary
# Set Year and Workout Type:
target_year = 2019
# Examples: 'Ride', 'Run', 'Walk', 'Swim', 'WeightTraining', 'Hike'
target_type = 'Ride'
def yearly_summary(year, workout_type):
# Data Setup
year_data = rides[(rides['year'] == year) & (rides.type == workout_type)].copy()
year_data["moving_time2"] = year_data["moving_time"].astype(str)
year_data["date"] = year_data["date"].astype(str)
print('====== {} {} Summary ====== '.format(year, workout_type))
print('Total Number of {} Workouts: {:,}'.format(workout_type, len(year_data)))
print('Total {} Distance: {:,} miles'.format(workout_type, round(year_data['distance'].sum(),2)))
print('Total {} Elevation Gain: {:,} feet'.format(workout_type, round(year_data['elevation gain'].sum(),2)))
print(' ')
average_cycling_distance = round(year_data['distance'].mean(),1)
print('Average {} Distance: {:,} miles'.format(workout_type, average_cycling_distance))
average_cycling_time = round(year_data['moving_time'].mean(),2)
print('Average {} Time: {:,} minutes'.format(workout_type, average_cycling_time))
average_cycling_speed = round(year_data['average_speed'].mean(), 2)
print('Average {} Speed: {:,} mph'.format(workout_type, average_cycling_speed))
average_cycling_hr = int(round(year_data['average_heartrate'].mean(), 2))
print('Average {} Heartrate: {:,} bpm'.format(workout_type, average_cycling_hr))
average_cycling_ascent = int(round(year_data['elevation gain'].mean(), 2))
print('Average {} Total Ascent: {:,} ft'.format(workout_type, average_cycling_ascent))
average_cycling_power = int(round(year_data['average_watts'].mean(), 2))
print('Average {} Power: {:,} watts'.format(workout_type, average_cycling_power))
print(' ')
print('{}s with the highest power:'.format(workout_type))
for index, row in year_data.sort_values(by=['average_watts'], ascending=False).head(5).iterrows():
print(str(row["average_watts"]) + " watts " + "for " + row["moving_time2"] + " minutes: " + "'"+ row["name"] + "'" + " on " + row["date"] )
print(' ')
print('Longest {}s:'.format(workout_type))
for index, row in year_data.sort_values(by=['distance'], ascending=False).head(5).iterrows():
print(str(row["distance"]) + " mi: " + row["name"] + " on " + row["date"] )
yearly_summary(year=target_year, workout_type=target_type)
====== 2019 Ride Summary ======
Total Number of Ride Workouts: 156
Total Ride Distance: 2,211.89 miles
Total Ride Elevation Gain: 53,716.21 feet
Average Ride Distance: 14.2 miles
Average Ride Time: 54.47 minutes
Average Ride Speed: 16.64 mph
Average Ride Heartrate: 146 bpm
Average Ride Total Ascent: 344 ft
Average Ride Power: 134 watts
Rides with the highest power:
225.5 watts for 20.05 minutes: 'FTP Test' on 2019-12-31
217.4 watts for 19.983333333333334 minutes: '20 min FTP Test Ride with Matt Wilpers' on 2019-09-24
202.0 watts for 29.966666666666665 minutes: '30 min Tabata Ride with Robin Arzon' on 2019-02-27
199.1 watts for 19.983333333333334 minutes: '20 min FTP Test Ride with Matt Wilpers' on 2019-01-16
187.6 watts for 29.966666666666665 minutes: '30 min Pop Ride with Cody Rigsby' on 2019-06-25
Longest Rides:
41.96 mi: TCH Gravel Camp 2019: Seeley to Ovando Abonded on 2019-09-04
41.94 mi: Red Rocks and Back. Slow & steady gets it done. on 2019-06-16
39.46 mi: TCH Gravel Camp 2019: Morrell Mountain Lookout on 2019-09-05
37.28 mi: Lookout N Back on 2019-05-12
37.04 mi: TCH Gravel 2019: Seeley to Holland Lake Lodge on 2019-09-06