I have a pandas dataframe that looks like
import pandas as pd
data = {
"Race_ID": [1,1,1,2,2,2,2,2,3,3,3,4,4,5,5,5,5,5,5],
"Student_ID": [3,5,4,1,2,3,4,5,4,3,7,2,3,9,10,2,3,6,5],
"theta": [3,4,6,8,9,2,12,4,9,0,6,5,2,5,30,3,2,1,50]
}
df = pd.DataFrame(data)
and I would like to create a new column df['feature']
by the following method: with each Race_ID
, suppose the Student_ID
is equal to $i$, then we define feature to be
$$\sum_{k=\not= i}\sum_{j\not= i,k} f(k,j,i), \ \ f(k,j,i):=\frac{\theta_j+\theta_i}{\theta_i+\theta_j+\theta_k\cdot \theta_i}$$,
where $k,j$ are the Student_ID
s within the same Race_ID
and $theta_i$ is theta
with Student_ID
equals to i. So for example for Race_ID
$=1$, we have
feature for Student_ID
$= 3$: $f(5,4,3)+f(4,5,3)=124/175=0.708577$
feature for Student_ID
$= 5$: $f(3,4,5)+f(4,3,5)=232/341=0.680352$
feature for Student_ID
$= 4$: $f(3,5,4)+f(5,3,4)=97/154=0.629870$
Here is what I have tried but it doesn't seem to work and also seems to be very slow when the dataframe is large:
def compute_sum(row, df):
Race_ID = row['Race_ID']
n_RaceID = df.groupby('Race_ID')['Student_ID'].nunique()[Race_ID]
theta_i_values = df[(df['Race_ID'] == Race_ID) & (df['Student_ID'] == row['Student_ID'])]['theta'].values
theta_values = df[(df['Race_ID'] == Race_ID) & (df['Student_ID'] != row['Student_ID'])]['theta'].values
sum_ = sum([((theta_values[j] + theta_i_values[i]) / (theta_i_values[i] + theta_values[j] + theta_values[k] * theta_i_values[i])) for i in range(len(theta_i_values)) for k in range(len(theta_values)) if k != i for j in range(len(theta_values)) if j != i and j != k])
return sum_
df['feature'] = df.apply(lambda row: compute_sum(row, df), axis=1)