023: Strava Ride Power Graph: Part 2

023: Strava Ride Power Graph: Part 2 - Retry loop

Python project working around Strava API rate limitation in order to extract and analyse full segment effort power data from multiple ride activities.

The daily mood

It's been such a hard week of work and I needed kind of a distractive weekend. First I helped my son build the Lego ISS (consisting of 864 pieces). Then we had neighbourhood discussions about the recent announcement of a huge construction project which will start next year, take place just in front of our apartment doors, get incredibly noisy and finally reduce our green area into a urban "wasteland".

Beside that I was able to join an outdoor training of my runner club for the first time since three month of pandemic. A good transition to the little Python programming initiative around riding power analytics that I started last week. In my previous post, we were able to collect Strava segment power data from a single ride, and visualise it.

Still there is potential for improvement and I am now looking forward to the following tasks:

Filter activity request using input arguments
Workaround API usage rate limit
Retrieve more segments and more activities
Preview multiple activities inside one graph (for comparison)

1. Filter activites

In part 1 we retrieved last activity only. Stravalib get_activites method offers the following parameters

before (datetime.datetime or str or None) – Result will start with activities whose start date is before specified date. (UTC)
after (datetime.datetime or str or None) – Result will start with activities whose start date is after specified value. (UTC)
limit (int or None) – How many maximum activities to return.

To start with, new input arguments are added to the program:

parser.add_argument("--batch_size", "-b", help="Limit activity batch size (default: 10)")
parser.add_argument("--filter_before", "-s", help="Filter activity before YYYY-MM-DD (default: None)")
parser.add_argument("--filter_after", "-e", help="Filter activity after YYYY-MM-DD (default: None)")
args = parser.parse_args()
# verify input argument formats
if args.filter_before:
    dt.datetime.strptime(args.filter_before, "%Y-%m-%d")
if args.filter_after:
    dt.datetime.strptime(args.filter_after, "%Y-%m-%d")
# set default argument value (see --help)
if not args.batch_size:
  args.batch_size = 10

Then the activity request is extended to new parameters:

try:
    activities = client.get_activities(
            limit=args.batch_size, 
            after=args.filter_after, 
            before=args.filter_before
        )

The we loop across the activity list:

for act_summary in activities:
    # https://developers.strava.com/docs/reference/#api-models-SummaryActivity
    if act_summary.type == 'Ride' and act_summary.name != 'Gravel': # filter road rides only

2. Workaround limit

In part 1 we introduced some error handling for API usage rate limit. Strava allows 100 requests per each quarter of an hour and 1000 total requests (including those failing against rate limit) per UTC day. Let us ignore second hard limit for now and create a retry method addressing first one:

# handle process retry while informing the user at smaller intervals
# my credits to https://stackoverflow.com/questions/13071384/ceil-a-datetime-to-next-quarter-of-an-hour
import datetime, time, math
def retry_loop_next_quarter():
    dt = datetime.datetime.now()
    # how many secs have passed this hour
    nsecs = dt.minute*60 + dt.second + dt.microsecond*1e-6
    # number of seconds to next quarter hour mark
    delta = math.ceil(nsecs / 900) * 900 - nsecs
    # time + number of seconds to quarter hour mark
    dt_next_quarter = dt + datetime.timedelta(seconds=delta)
    delta_sec = 1
    while delta_sec > 0:
        time.sleep(30)
        dt_diff = dt_next_quarter - datetime.datetime.now()
        delta_sec = dt_diff.days * 86400 + dt_diff.seconds
        delta = divmod(delta_sec, 60)
        print ("Auto-retry in " + str(list(delta)[0]) + " min " + str(list(delta)[1]) + " sec...")
    return

Python offers a comprehensive support for retry loop:

while True:
  try:
    # some stravalib request
  except exc.RateLimitExceeded:
    print ("API rate limit exceeded.")
    retry_loop_next_quarter()
    continue
break

Now the program might not exit any more as soon as API rate limit exceedeed, but retry at next natural quarter of an hour, as long as the token is still valid and the daily rate limit is not reached. Because of this, it might take some time for collecting data, approx. longer 3 rides per hour, but we do not care too much about this.

3. More data

Thanks to the filter and workaround described above, we are not only able to retrieve a complete collection of activity segment efforts and store it in a separate pickle file, but also fetch and filter a batch of activities.

# loop over activity list
for act_summary in activities:
    # https://developers.strava.com/docs/reference/#api-models-SummaryActivity
    if act_summary.type == 'Ride' and act_summary.name != 'Gravel': # filter road rides only

4. One single graph

In part 1 we analysed only the subset of one activity. We draw a bar char of average power by elevation grade range. Now that we were able to collect the records of multiple and complete activity tracks, we'll use Seaborn to plot and compare multiple activities with each other. In a nutshell, Seaborn is a high-level library based on Matplotlib, it handles Pandas dataframes instead of NumPy arrays, and simplifies complex data representations like those involving grouping and aggregation (ex. Heat maps).

First we extend our program arguments to allow for multiple input files (activities):

# setup program usage and parse arguments
import sys, argparse
parser = argparse.ArgumentParser()
parser.add_argument("--pickle_files", "-f", help="Coma separated list of pickle file(s) ex. --pickle_files=file1.pkl,file2.pkl")
args = parser.parse_args()
# terminate if mandatory argument is not set
if args.pickle_files is None:
    sys.exit("Please specify some pickle file(s) via --pickle_files (--help for Usage)")

And then we iterate over those files, read and pot:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read and filter serialized data
for filename in args.pickle_files.split(','):
    # read metadata file
    with open(filename.split('.')[0] + '.txt') as f:
        metadata = f.readlines()
        metadata = [x.strip() for x in metadata]
    # ride date is at line 2 word 2    
    activity_date = metadata[1].split(' ')[1]
    # read pickle file
    df = pd.read_pickle(filename)
    # filter wrong or non relevant elevation grades
    df = df[(df.avg_grade > -3) & (df.avg_grade < 24)]
    # add graph item
    sns.regplot(x='avg_grade', y='avg_power',
            data=df[['avg_grade','avg_power']], 
            label=activity_date, fit_reg=True)
plt.xlabel('Grade (%)')
plt.ylabel('Power (W)')
plt.title('Activity efforts')
plt.legend(loc='lower right')
plt.show()

Here I am using the linear regression feature of Seaborn.

You probably remember of school maths, and of the exercise to figure-out which line crosses 2 points? Well, a linear regression is somehow the extended exercise, where you need to figure-out which line best describes a collection of points, that is to say the line that has the shortest cumulated distance to all points.

The method is typically used for classification purpose, means determining which category new data belongs to, based on categories defined from a relevant set of reference data. This statistical approach is also one of the simplest forms of Machine Learning (ML).

In our case, we could use the best-fit line to determine wether new performances belong above or below the expected probability, in order to get some training indicator. We can also "misuse" the concept for drawing a trend line of power for each training, and compare lines with each other.

Indeed given a first degree function "y = ax + b", the regression factor "a" kind of reflects the intensity by grade (ex. focus on climbs or not) while "b" reflects the overall performance level.

On the above graph, we can see a regular raise in performance which occurred during a single month. The lines don't cross in the observed range (x) because all 3 rides are well distributed in terms of segments (x) and efforts (y), although these are effectively different training routes and distances. Obviously, the linear coefficient varies a lot more if we compare a short time trial on a fat route with a long profiled ride. Therefore this method requires the user to take care about analysing comparable workouts, otherwise find another type of "best-fit".

Here is an example of anti-pattern:

In this case it is not obvious wether the delivered effort (and consequently, the performance measure and training effect) is actually better at the beginning or at the end of the month.

I thought about the following remaining tasks, in case I find time and motivation for a last part of this series:

Look at data representation (i.e. filtering, chart) and description (i.e. fit function) alternatives
Store input parameters in a .config file instead of environment variables
Place segment efforts on a time, distance or Strava index frame
Dedup/re-calculate overlaping segment efforts given finest level of granularity
Re-inject power data into Strava activity

Sources

https://github.com/tncad/life-hack/tree/master/023-strava-ride-power-graph-part2

The newbie cloud architect diary

Search This Blog

023: Strava Ride Power Graph: Part 2 - Retry loop

Labels

Comments

Post a Comment