I am creating IDF values and my python code runs much slower compared to the pyspark implementation (2+ hours for mine versus seconds) and I am interested why that is so. I know pyspark is Java based, but the difference seems to be more than Python vs. Java. I'm using a simple function like so:
def calc_idf(data, terms):
# data is a list of lists filled with tokenized data
# terms is a list of the tokens to calculate IDF values
num_docs = len(data)
idf_values = []
for term in tqdm(terms, desc="IDF", position=0, leave=True):
idf_val = 0
for doc in data:
if term in doc:
idf_val += 1
idf_values.append(math.log2((num_docs+1)/(idf_val+1))) # Using base 2 as original paper did
return idf_values
The IDF I am using is from this documentation (https://spark.apache.org/docs/3.5.1/api/python/reference/api/pyspark.mllib.feature.IDF.html). I don't think my implementation is relevant, but it can be found on this question (Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors), just know that it is significantly slower.
Could somebody please advise on how I could improve the speed of my IDF calculation
Edit: Each doc in data is indeed a list as mentioned by Jerome in the comments. Converted doc to a set and it is about 68x faster! Thank you!
doc
, the code can be even slower or not. This information is unfortunately not provided. Ifdoc
is adict
orset
, then accesses are inO(1)
, and if it is a list, then it isO(n)
(much slower for large list). You can certainly convertdoc
to aset
before the loop to make it faster if this is a list (with eg. >10 items).