I have a pandas
DataFrame with the following columns:
id | value | somedate
------------------------------
1 | [10, 13, 14] | 2024-06-01
2 | [5, 6, 7] | 2024-07-01
3 | [1, 2, 3] | 2024-06-01
I'm doing the following transformation to parse the value
column and explode the DataFrame:
import ast
import pandas as pd
data = pd.DataFrame({"id": [1, 2, 3], "value": ["[10, 13, 14]", "[5, 6, 7]", "[1, 2, 3]"], "somedate": ["2024-06-01", "2024-07-01", "2024-06-01"]})
data["parsed_value"] = data["value"].apply(lambda x: ast.literal_eval(x))
data = data.explode(column="parsed_value")
This works well on small DataFrames, but becomes incredibly slow at very large DataFrames (tens to hundreds of millions of rows).
Obviously, there are options like parallel processing, switching to other packages like dask
or polars
, but I just wanted to first make sure that I'm not missing some obvious solution within the current tech stack.