Python project working around Strava API rate limitation in order to extract and analyse full segment effort power data from multiple ride activities.
The daily mood
It's been such a hard week of work and I needed kind of a distractive weekend. First I helped my son build the Lego ISS (consisting of 864 pieces). Then we had neighbourhood discussions about the recent announcement of a huge construction project which will start next year, take place just in front of our apartment doors, get incredibly noisy and finally reduce our green area into a urban "wasteland".
Beside that I was able to join an outdoor training of my runner club for the first time since three month of pandemic. A good transition to the little Python programming initiative around riding power analytics that I started last week. In my previous post, we were able to collect Strava segment power data from a single ride, and visualise it.
Still there is potential for improvement and I am now looking forward to the following tasks:
- Filter activity request using input arguments
- Workaround API usage rate limit
- Retrieve more segments and more activities
- Preview multiple activities inside one graph (for comparison)
1. Filter activites
In part 1 we retrieved last activity only. Stravalib get_activites method offers the following parameters
- before (datetime.datetime or str or None) – Result will start with activities whose start date is before specified date. (UTC)
- after (datetime.datetime or str or None) – Result will start with activities whose start date is after specified value. (UTC)
- limit (int or None) – How many maximum activities to return.
To start with, new input arguments are added to the program:
parser.add_argument("--batch_size", "-b", help="Limit activity batch size (default: 10)")
parser.add_argument("--filter_before", "-s", help="Filter activity before YYYY-MM-DD (default: None)")
parser.add_argument("--filter_after", "-e", help="Filter activity after YYYY-MM-DD (default: None)")
args = parser.parse_args()
# verify input argument formats
if args.filter_before:
dt.datetime.strptime(args.filter_before, "%Y-%m-%d")
if args.filter_after:
dt.datetime.strptime(args.filter_after, "%Y-%m-%d")
# set default argument value (see --help)
if not args.batch_size:
args.batch_size = 10
Then the activity request is extended to new parameters:
try:
activities = client.get_activities(
limit=args.batch_size,
after=args.filter_after,
before=args.filter_before
)
The we loop across the activity list:
for act_summary in activities:
# https://developers.strava.com/docs/reference/#api-models-SummaryActivity
if act_summary.type == 'Ride' and act_summary.name != 'Gravel': # filter road rides only
2. Workaround limit
In part 1 we introduced some error handling for API usage rate limit. Strava allows 100 requests per each quarter of an hour and 1000 total requests (including those failing against rate limit) per UTC day. Let us ignore second hard limit for now and create a retry method addressing first one:
# handle process retry while informing the user at smaller intervals # my credits to https://stackoverflow.com/questions/13071384/ceil-a-datetime-to-next-quarter-of-an-hour import datetime, time, math def retry_loop_next_quarter():
dt = datetime.datetime.now()
# how many secs have passed this hour
nsecs = dt.minute*60 + dt.second + dt.microsecond*1e-6
# number of seconds to next quarter hour mark
delta = math.ceil(nsecs / 900) * 900 - nsecs
# time + number of seconds to quarter hour mark
dt_next_quarter = dt + datetime.timedelta(seconds=delta)
delta_sec = 1
while delta_sec > 0:
time.sleep(30)
dt_diff = dt_next_quarter - datetime.datetime.now()
delta_sec = dt_diff.days * 86400 + dt_diff.seconds
delta = divmod(delta_sec, 60)
print ("Auto-retry in " + str(list(delta)[0]) + " min " + str(list(delta)[1]) + " sec...")
return
Python offers a comprehensive support for retry loop:
while True:
try:
# some stravalib request
except exc.RateLimitExceeded:
print ("API rate limit exceeded.")
retry_loop_next_quarter()
continue
break
Now the program might not exit any more as soon as API rate limit exceedeed, but retry at next natural quarter of an hour, as long as the token is still valid and the daily rate limit is not reached. Because of this, it might take some time for collecting data, approx. longer 3 rides per hour, but we do not care too much about this.
3. More data
Thanks to the filter and workaround described above, we are not only able to retrieve a complete collection of activity segment efforts and store it in a separate pickle file, but also fetch and filter a batch of activities.
# loop over activity list
for act_summary in activities:
# https://developers.strava.com/docs/reference/#api-models-SummaryActivity
if act_summary.type == 'Ride' and act_summary.name != 'Gravel': # filter road rides only
4. One single graph
In part 1 we analysed only the subset of one activity. We draw a bar char of average power by elevation grade range. Now that we were able to collect the records of multiple and complete activity tracks, we'll use Seaborn to plot and compare multiple activities with each other. In a nutshell, Seaborn is a high-level library based on Matplotlib, it handles Pandas dataframes instead of NumPy arrays, and simplifies complex data representations like those involving grouping and aggregation (ex. Heat maps).
First we extend our program arguments to allow for multiple input files (activities):
# setup program usage and parse arguments import sys, argparse parser = argparse.ArgumentParser() parser.add_argument("--pickle_files", "-f", help="Coma separated list of pickle file(s) ex. --pickle_files=file1.pkl,file2.pkl") args = parser.parse_args() # terminate if mandatory argument is not set if args.pickle_files is None: sys.exit("Please specify some pickle file(s) via --pickle_files (--help for Usage)")
And then we iterate over those files, read and pot:
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # read and filter serialized data for filename in args.pickle_files.split(','): # read metadata file with open(filename.split('.')[0] + '.txt') as f: metadata = f.readlines() metadata = [x.strip() for x in metadata] # ride date is at line 2 word 2 activity_date = metadata[1].split(' ')[1] # read pickle file df = pd.read_pickle(filename) # filter wrong or non relevant elevation grades df = df[(df.avg_grade > -3) & (df.avg_grade < 24)] # add graph item sns.regplot(x='avg_grade', y='avg_power', data=df[['avg_grade','avg_power']], label=activity_date, fit_reg=True) plt.xlabel('Grade (%)') plt.ylabel('Power (W)') plt.title('Activity efforts') plt.legend(loc='lower right') plt.show()
Here I am using the linear regression feature of Seaborn.
You probably remember of school maths, and of the exercise to figure-out which line crosses 2 points? Well, a linear regression is somehow the extended exercise, where you need to figure-out which line best describes a collection of points, that is to say the line that has the shortest cumulated distance to all points.
The method is typically used for classification purpose, means determining which category new data belongs to, based on categories defined from a relevant set of reference data. This statistical approach is also one of the simplest forms of Machine Learning (ML).
In our case, we could use the best-fit line to determine wether new performances belong above or below the expected probability, in order to get some training indicator. We can also "misuse" the concept for drawing a trend line of power for each training, and compare lines with each other.
Indeed given a first degree function "y = ax + b", the regression factor "a" kind of reflects the intensity by grade (ex. focus on climbs or not) while "b" reflects the overall performance level.
On the above graph, we can see a regular raise in performance which occurred during a single month. The lines don't cross in the observed range (x) because all 3 rides are well distributed in terms of segments (x) and efforts (y), although these are effectively different training routes and distances. Obviously, the linear coefficient varies a lot more if we compare a short time trial on a fat route with a long profiled ride. Therefore this method requires the user to take care about analysing comparable workouts, otherwise find another type of "best-fit".
Here is an example of anti-pattern:
In this case it is not obvious wether the delivered effort (and consequently, the performance measure and training effect) is actually better at the beginning or at the end of the month.
Next
I thought about the following remaining tasks, in case I find time and motivation for a last part of this series:
- Look at data representation (i.e. filtering, chart) and description (i.e. fit function) alternatives
- Store input parameters in a .config file instead of environment variables
- Place segment efforts on a time, distance or Strava index frame
- Dedup/re-calculate overlaping segment efforts given finest level of granularity
- Re-inject power data into Strava activity
Comments
Post a Comment