1

So in order to better understand the data science topic of linear regression, I have been trying to recreate what scikitlearn's LinearRegression module does under the hood. The problem that I am having is that when I start a gradient descent of the slope and intercept using my data, I am unable to get the slope and intercept values to converge, no matter what step size I use or descent iterations. The data that I am trying to find the linear relationship between is NBA FG% and NBA W/L% which can be found here (it's only about 250 rows of data but I figured it would be easier to share in a pastebin...). You can recreate the graph the initial graph of the data by using:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

def graph1(axis = []):
    x = FG_pct
    y = W_L_pct
    plt.scatter(x, y)

    plt.title('NBA FG% vs. Win%')
    plt.xlabel('FG pct (%)')
    plt.ylabel('Win pct (%)')
    if len(axis) > 1:
        plt.axis(axis)
    plt.legend()

It will look like this (minus the color):

enter image description here

There is a pretty obvious relationship between the two variables and you can basically take a pretty good guess at what the line of best fit would be (my guess was a slope of 5 and an intercept of around -1.75).

The gradient descent equations I used, which are derived by taking the derivatives of the loss function with respect to both slope and intercept, are these:

def get_b_gradient(x_pts, y_pts, m, b):
    N = len(x_pts)
    tot = 0

    for x, y in zip(x_pts, y_pts):
        tot += y - (m*x + b)

    gradient = (-2/N)*tot
    return gradient

def get_m_gradient(x_pts, y_pts, m, b):
    N = len(x_pts)
    tot = 0

    for x, y in zip(x_pts, y_pts):
        tot += x * (y - (m*x + b))

    gradient = (-2/N)*tot
    return gradient

def get_step(x_pts, y_pts, m, b, learning_rate):
    init_b = get_b_gradient(x_pts, y_pts, m, b)
    init_m = get_m_gradient(x_pts, y_pts, m, b)

    final_b = b - (init_b*learning_rate)
    final_m = m - (init_m*learning_rate)

    return final_m, final_b

def gradient_descent(x_pts, y_pts, m, b, learning_rate, num_iterations):
    for i in range(num_iterations):
        m, b = get_step(x_pts, y_pts, m, b, learning_rate)
    return m, b

After getting these it is just a matter of finding the right number of iterations and learning rate to get the slope and intercept to converge to the optimal value. Since I am unsure of a systematic way to find these values I simply try inputting different orders of magnitude into the gradient_descent function:

# 1000 iterations, learning rate of 0.1, and initial slope and intercept guess of 0
m, b = gradient_descent(df['FG%'], df['W/L%'], 0, 0, 0.1, 1000)

You can track the converge of your slope and intercept using a graph such as this:

def convergence_graph(iterations, learning_rate, m, b):
    plt.subplot(1, 2, 1)
    for i in range(iterations):
        plt.scatter(i,b, color='orange')
        plt.title('convergence of b')
        m, b = get_step(df['FG%'], df['W/L%'], m, b, learning_rate)

    plt.subplot(1, 2, 2)
    for i in range(iterations):
        plt.scatter(i,m, color='blue')
        plt.title('convergence of m')
        m, b = get_step(df['FG%'], df['W/L%'], m, b, learning_rate)

And this is really where the problem is evident. Using the same iterations (1000) and learning_rate as before (0.1) you see a graph that looks like this:

enter image description here

I would say that the linearity of those graphs mean that it is still converging at that point so the answer would be to increase the learning rate but no matter what order of magnitude I choose for the learning rate (all the way up to millions) the graphs still retain linearity and never converge. I also tried going with a smaller learning rate and messing with the # of iterations... nothing. Ultimately I decided to throw it into sklearn to see if it would have any trouble:

FG_pct = np.array(FG_pct)
FG_pct = FG_pct.reshape(-1, 1)

line_fitter = LinearRegression().fit(FG_pct, W_L_pct)

win_loss_predict = line_fitter.predict(FG_pct)

It had no problem:

enter image description here

So this is getting rather long and I am sorry for that. I don't have any data sciency people to ask directly and no professors around so I figured I'd throw it up here. Ultimately, I am unsure of whether the issues arises in 1) my gradient descent equations or 2) my approach in finding a proper learning rate and # of iterations. If anyone could point out what is happening, why the slope and intercept aren't converging, and what I am doing wrong that would be much appreciated!

8
  • 2
    Linear regression can be solved in closed form (least-squares estimation), without hill-climbing; why do you believe that scikit does iterative optimization?
    – alexis
    Commented Jun 5, 2020 at 14:57
  • 1
    The only time you need to use optimization for a linear regression is for datasets too large for memory. In those cases the algorithm is a cumulative form of estimation. You can see the closed form using just the design matrix here: en.m.wikipedia.org/wiki/Linear_regression
    – gph
    Commented Jun 5, 2020 at 15:35
  • Your gradient equations are correct - you might try lowering your learning rate (e.g. 1e-5) and increasing number of iterations (e.g. 50_000). (I was unable to access to the dataset btw, maybe problem is on my end though). Commented Jun 5, 2020 at 16:26
  • @alexis gradient descent is the the only method I have heard of for linear regression (I have actually never taken a statistics course in my life but have taken three levels of calculus. (Don't even ask me how that's even possible...). I'll take a look at least-squares estimation as well as any other resources you have for me! Commented Jun 5, 2020 at 18:49
  • 1
    @JacobGarwin, many thanks for that. The setting of learning_rate=0.5 and num_iterations=50_000 gave m = 6.506 and b = -2.46 which are practically same as what scikit-learn gives. So it is all hyperparameter tuning and sometimes trying parameters by hand is ok but when there are many params and possibilities, cross-validation is the way to go. Commented Jun 5, 2020 at 21:02

1 Answer 1

3

I would recommend taking a step back from the way data science material presents these topics. Linear regression, gradient descent. These are not data science topics. These are statistics concepts. I would start looking through intro stats material. Just about anything you pick up will have a chapter on ordinary linear regression (OLS).

Gradient descent is a more sophisticated version of Newton's method for finding zeros. I strongly recommend looking at that algorithm. It's very accessible if you have a good understanding of calculus which it sounds like you probably do. If you do look into it, note that there are no "learning rates". That term makes me gag. In the pre "data science" days, aka about 10 years ago, it was called step size.

The step size is critical to the speed of convergence. However if it is too large you will most likely never converge. Say your step size is 10 and your derivative (univariate case) is 0.1. Your guess moves by 1. But what if the minimum was only 0.25 units away from the current guess? Congrats. Your solution just got WORSE. You can bounce around the minimum all day and never find it (I suspect this might be what is happening in your code). What many algorithms use is a decreasing step size. Usually proportionate to the number of iterations. For example on the jth iteration your step size could be 10/j. This too has problems that can be solved with stabilizing values and additional boundaries on the shape of the step size as the iterations evolve.

It's actually really great what you're trying to do. There are WAY too many people "doing data science" that don't know jack about what's really going on. The downside is that this isn't an easy route to take. I encourage you to keep going!! It's worth it. But you'll need to recognize that you've jumped into the deep end a bit. There are simpler algorithms that you'll get much more out of and will lay the foundations for more advanced stuff later.

Edit: More Direct Answer

So, the only thing in your code that needs to change is the gradients. In both gradient calculations change

gradient = (-2/N)*tot

to

gradient = (-2)*tot

The gradient doesn't have an N in the denominator. Some derivations may show that way, but that's probably because they are deriving the closed-form solution and have set the whole thing equal to zero.

It seems like the reason your parameters are going crazy is because your step size is too big. Using that one change it returned params:

m, b = gradient_descent(FG_pct, W_L_pct, 6, -1, 0.003, 10000)
m = 6.465
b = -2.44

I think in your example you were seeding the algorithm with an initial guess of 0, 0. A good initial guess can make a HUGE difference.

Closed Form Alternative Here's an example using the closed form. It produces the exact answer with no searching.

from matplotlib.pyplot import plot, scatter
import numpy as np

Y = np.array(W_L_pct)
X = np.array([np.ones(len(FG_pct)), FG_pct]).reshape(2, 270).T

A = np.linalg.inv(np.matmul(X.T, X))
B = np.matmul(X.T, Y)

beta = np.matmul(A, B)
m, b = beta[1], beta[0]
print(m, b)
r = np.arange(0.4, 0.52, 0.01)
scatter(FG_pct, Y)
plot(r, m * r + b)
1
  • Thanks for the encouragement. I want to be a data scientist but I figured there is a "right" way and "wrong" way to do it. I don't see much benefit in typing in numbers into and API like sklearn if you don't know what is actually going on so... that's why I am in this position. I'll definitely take a look at Newton's method for finding zeros and I'll starting using "step size"! Commented Jun 5, 2020 at 18:39

Not the answer you're looking for? Browse other questions tagged or ask your own question.