Why is my IDF Python code running much slower than pyspark

I am creating IDF values and my python code runs much slower compared to the pyspark implementation (2+ hours for mine versus seconds) and I am interested why that is so. I know pyspark is Java based, but the difference seems to be more than Python vs. Java. I'm using a simple function like so:

def calc_idf(data, terms):
    # data is a list of lists filled with tokenized data
    # terms is a list of the tokens to calculate IDF values
    num_docs = len(data)

    idf_values = []
    for term in tqdm(terms, desc="IDF", position=0, leave=True):
        idf_val = 0
        for doc in data:
            if term in doc:
                idf_val += 1
        idf_values.append(math.log2((num_docs+1)/(idf_val+1))) # Using base 2 as original paper did

    return idf_values

The IDF I am using is from this documentation (https://spark.apache.org/docs/3.5.1/api/python/reference/api/pyspark.mllib.feature.IDF.html). I don't think my implementation is relevant, but it can be found on this question (Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors), just know that it is significantly slower.

Could somebody please advise on how I could improve the speed of my IDF calculation

Edit: Each doc in data is indeed a list as mentioned by Jerome in the comments. Converted doc to a set and it is about 68x faster! Thank you!

edited Jul 10 at 2:39

asked Jul 9 at 19:33

Caden

675 bronze badges

1

I do not know PySpark, but be aware that Python codes are typically executed by CPython which is an interpreter. There are non-standard Python JIT (eg. PyPy, Jython, etc.) but they often do not fully support all Python features and are not so fast (PyPy is the fastest). Interpreters are generally VERY slow, especially CPython.
– Jérôme Richard
Commented Jul 9 at 23:37
2

By the way, regarding what is doc, the code can be even slower or not. This information is unfortunately not provided. If doc is a dict or set, then accesses are in O(1), and if it is a list, then it is O(n) (much slower for large list). You can certainly convert doc to a set before the loop to make it faster if this is a list (with eg. >10 items).
– Jérôme Richard
Commented Jul 9 at 23:43

Add a comment |

Collectives™ on Stack Overflow

Why is my IDF Python code running much slower than pyspark

0

Browse other questions tagged
python
python-3.x
performance
loops
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged pythonpython-3.xperformanceloops or ask your own question.

Linked

Browse other questions tagged
python
python-3.x
performance
loops
or ask your own question.