0

I am creating IDF values and my python code runs much slower compared to the pyspark implementation (2+ hours for mine versus seconds) and I am interested why that is so. I know pyspark is Java based, but the difference seems to be more than Python vs. Java. I'm using a simple function like so:

def calc_idf(data, terms):
    # data is a list of lists filled with tokenized data
    # terms is a list of the tokens to calculate IDF values
    num_docs = len(data)

    idf_values = []
    for term in tqdm(terms, desc="IDF", position=0, leave=True):
        idf_val = 0
        for doc in data:
            if term in doc:
                idf_val += 1
        idf_values.append(math.log2((num_docs+1)/(idf_val+1))) # Using base 2 as original paper did

    return idf_values

The IDF I am using is from this documentation (https://spark.apache.org/docs/3.5.1/api/python/reference/api/pyspark.mllib.feature.IDF.html). I don't think my implementation is relevant, but it can be found on this question (Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors), just know that it is significantly slower.

Could somebody please advise on how I could improve the speed of my IDF calculation

Edit: Each doc in data is indeed a list as mentioned by Jerome in the comments. Converted doc to a set and it is about 68x faster! Thank you!

2
  • 1
    I do not know PySpark, but be aware that Python codes are typically executed by CPython which is an interpreter. There are non-standard Python JIT (eg. PyPy, Jython, etc.) but they often do not fully support all Python features and are not so fast (PyPy is the fastest). Interpreters are generally VERY slow, especially CPython. Commented Jul 9 at 23:37
  • 2
    By the way, regarding what is doc, the code can be even slower or not. This information is unfortunately not provided. If doc is a dict or set, then accesses are in O(1), and if it is a list, then it is O(n) (much slower for large list). You can certainly convert doc to a set before the loop to make it faster if this is a list (with eg. >10 items). Commented Jul 9 at 23:43

0

Browse other questions tagged or ask your own question.