So in order to better understand the data science topic of linear regression, I have been trying to recreate what scikitlearn's LinearRegression module does under the hood. The problem that I am having is that when I start a gradient descent of the slope and intercept using my data, I am unable to get the slope and intercept values to converge, no matter what step size I use or descent iterations. The data that I am trying to find the linear relationship between is NBA FG% and NBA W/L% which can be found here (it's only about 250 rows of data but I figured it would be easier to share in a pastebin...). You can recreate the graph the initial graph of the data by using:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
def graph1(axis = []):
x = FG_pct
y = W_L_pct
plt.scatter(x, y)
plt.title('NBA FG% vs. Win%')
plt.xlabel('FG pct (%)')
plt.ylabel('Win pct (%)')
if len(axis) > 1:
plt.axis(axis)
plt.legend()
It will look like this (minus the color):
There is a pretty obvious relationship between the two variables and you can basically take a pretty good guess at what the line of best fit would be (my guess was a slope of 5 and an intercept of around -1.75).
The gradient descent equations I used, which are derived by taking the derivatives of the loss function with respect to both slope and intercept, are these:
def get_b_gradient(x_pts, y_pts, m, b):
N = len(x_pts)
tot = 0
for x, y in zip(x_pts, y_pts):
tot += y - (m*x + b)
gradient = (-2/N)*tot
return gradient
def get_m_gradient(x_pts, y_pts, m, b):
N = len(x_pts)
tot = 0
for x, y in zip(x_pts, y_pts):
tot += x * (y - (m*x + b))
gradient = (-2/N)*tot
return gradient
def get_step(x_pts, y_pts, m, b, learning_rate):
init_b = get_b_gradient(x_pts, y_pts, m, b)
init_m = get_m_gradient(x_pts, y_pts, m, b)
final_b = b - (init_b*learning_rate)
final_m = m - (init_m*learning_rate)
return final_m, final_b
def gradient_descent(x_pts, y_pts, m, b, learning_rate, num_iterations):
for i in range(num_iterations):
m, b = get_step(x_pts, y_pts, m, b, learning_rate)
return m, b
After getting these it is just a matter of finding the right number of iterations and learning rate to get the slope and intercept to converge to the optimal value. Since I am unsure of a systematic way to find these values I simply try inputting different orders of magnitude into the gradient_descent function:
# 1000 iterations, learning rate of 0.1, and initial slope and intercept guess of 0
m, b = gradient_descent(df['FG%'], df['W/L%'], 0, 0, 0.1, 1000)
You can track the converge of your slope and intercept using a graph such as this:
def convergence_graph(iterations, learning_rate, m, b):
plt.subplot(1, 2, 1)
for i in range(iterations):
plt.scatter(i,b, color='orange')
plt.title('convergence of b')
m, b = get_step(df['FG%'], df['W/L%'], m, b, learning_rate)
plt.subplot(1, 2, 2)
for i in range(iterations):
plt.scatter(i,m, color='blue')
plt.title('convergence of m')
m, b = get_step(df['FG%'], df['W/L%'], m, b, learning_rate)
And this is really where the problem is evident. Using the same iterations (1000) and learning_rate as before (0.1) you see a graph that looks like this:
I would say that the linearity of those graphs mean that it is still converging at that point so the answer would be to increase the learning rate but no matter what order of magnitude I choose for the learning rate (all the way up to millions) the graphs still retain linearity and never converge. I also tried going with a smaller learning rate and messing with the # of iterations... nothing. Ultimately I decided to throw it into sklearn to see if it would have any trouble:
FG_pct = np.array(FG_pct)
FG_pct = FG_pct.reshape(-1, 1)
line_fitter = LinearRegression().fit(FG_pct, W_L_pct)
win_loss_predict = line_fitter.predict(FG_pct)
It had no problem:
So this is getting rather long and I am sorry for that. I don't have any data sciency people to ask directly and no professors around so I figured I'd throw it up here. Ultimately, I am unsure of whether the issues arises in 1) my gradient descent equations or 2) my approach in finding a proper learning rate and # of iterations. If anyone could point out what is happening, why the slope and intercept aren't converging, and what I am doing wrong that would be much appreciated!
learning_rate=0.5
andnum_iterations=50_000
gavem = 6.506
andb = -2.46
which are practically same as what scikit-learn gives. So it is all hyperparameter tuning and sometimes trying parameters by hand is ok but when there are many params and possibilities, cross-validation is the way to go.