0

I have a pandas DataFrame with the following columns:

id | value        | somedate
------------------------------
1  | [10, 13, 14] | 2024-06-01
2  | [5, 6, 7]    | 2024-07-01
3  | [1, 2, 3]    | 2024-06-01

I'm doing the following transformation to parse the value column and explode the DataFrame:

import ast
import pandas as pd

data = pd.DataFrame({"id": [1, 2, 3], "value": ["[10, 13, 14]", "[5, 6, 7]", "[1, 2, 3]"], "somedate": ["2024-06-01", "2024-07-01", "2024-06-01"]})
data["parsed_value"] = data["value"].apply(lambda x: ast.literal_eval(x))
data = data.explode(column="parsed_value")

This works well on small DataFrames, but becomes incredibly slow at very large DataFrames (tens to hundreds of millions of rows).

Obviously, there are options like parallel processing, switching to other packages like dask or polars, but I just wanted to first make sure that I'm not missing some obvious solution within the current tech stack.

2 Answers 2

1

If you make the assumption that the lists are valid and always represent lists of integers, you could use a regex with str.extractall, then join:

out = data.join(data['value'].str.extractall(r'(\d+)')[0]
                             .astype(int)
                             .droplevel('match').rename('parsed_value'))

Output:

   id         value    somedate parsed_value
0   1  [10, 13, 14]  2024-06-01           10
0   1  [10, 13, 14]  2024-06-01           13
0   1  [10, 13, 14]  2024-06-01           14
1   2     [5, 6, 7]  2024-07-01            5
1   2     [5, 6, 7]  2024-07-01            6
1   2     [5, 6, 7]  2024-07-01            7
2   3     [1, 2, 3]  2024-06-01            1
2   3     [1, 2, 3]  2024-06-01            2
2   3     [1, 2, 3]  2024-06-01            3
timings

It's almost 2 times faster for DataFrames of 1000+ rows:

comparison ast.litera_eval and regex extractall

0

Another possible solution, using json.loads, which converts the string lists to lists:

import json 

data.assign(parsed_value = data.value.map(json.loads)).explode('parsed_value')

Output:

   id         value    somedate parsed_value
0   1  [10, 13, 14]  2024-06-01           10
0   1  [10, 13, 14]  2024-06-01           13
0   1  [10, 13, 14]  2024-06-01           14
1   2     [5, 6, 7]  2024-07-01            5
1   2     [5, 6, 7]  2024-07-01            6
1   2     [5, 6, 7]  2024-07-01            7
2   3     [1, 2, 3]  2024-06-01            1
2   3     [1, 2, 3]  2024-06-01            2
2   3     [1, 2, 3]  2024-06-01            3

Not the answer you're looking for? Browse other questions tagged or ask your own question.