Speed-up literal_eval in a DataFrame apply

Question

I have a pandas DataFrame with the following columns:

id | value        | somedate
------------------------------
1  | [10, 13, 14] | 2024-06-01
2  | [5, 6, 7]    | 2024-07-01
3  | [1, 2, 3]    | 2024-06-01

I'm doing the following transformation to parse the value column and explode the DataFrame:

import ast
import pandas as pd

data = pd.DataFrame({"id": [1, 2, 3], "value": ["[10, 13, 14]", "[5, 6, 7]", "[1, 2, 3]"], "somedate": ["2024-06-01", "2024-07-01", "2024-06-01"]})
data["parsed_value"] = data["value"].apply(lambda x: ast.literal_eval(x))
data = data.explode(column="parsed_value")

This works well on small DataFrames, but becomes incredibly slow at very large DataFrames (tens to hundreds of millions of rows).

Obviously, there are options like parallel processing, switching to other packages like dask or polars, but I just wanted to first make sure that I'm not missing some obvious solution within the current tech stack.

mozway · Accepted Answer · 2024-07-09 09:38:08Z

If you make the assumption that the lists are valid and always represent lists of integers, you could use a regex with str.extractall, then join:

out = data.join(data['value'].str.extractall(r'(\d+)')[0]
                             .astype(int)
                             .droplevel('match').rename('parsed_value'))

Output:

   id         value    somedate parsed_value
0   1  [10, 13, 14]  2024-06-01           10
0   1  [10, 13, 14]  2024-06-01           13
0   1  [10, 13, 14]  2024-06-01           14
1   2     [5, 6, 7]  2024-07-01            5
1   2     [5, 6, 7]  2024-07-01            6
1   2     [5, 6, 7]  2024-07-01            7
2   3     [1, 2, 3]  2024-06-01            1
2   3     [1, 2, 3]  2024-06-01            2
2   3     [1, 2, 3]  2024-06-01            3

timings

It's almost 2 times faster for DataFrames of 1000+ rows:

PaulS · Accepted Answer · 2024-07-09 11:28:34Z

Another possible solution, using json.loads, which converts the string lists to lists:

import json 

data.assign(parsed_value = data.value.map(json.loads)).explode('parsed_value')

Output:

   id         value    somedate parsed_value
0   1  [10, 13, 14]  2024-06-01           10
0   1  [10, 13, 14]  2024-06-01           13
0   1  [10, 13, 14]  2024-06-01           14
1   2     [5, 6, 7]  2024-07-01            5
1   2     [5, 6, 7]  2024-07-01            6
1   2     [5, 6, 7]  2024-07-01            7
2   3     [1, 2, 3]  2024-06-01            1
2   3     [1, 2, 3]  2024-06-01            2
2   3     [1, 2, 3]  2024-06-01            3

Collectives™ on Stack Overflow

Speed-up literal_eval in a DataFrame apply

2 Answers 2

timings

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

timings

Not the answer you're looking for? Browse other questions tagged pythonpandasdataframe or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
dataframe
or ask your own question.