I have a use case where I want to convert the dataframe to a map. For that i am using groupbykey and mapGroups operations. But they are running into memory issues.
Eg:
EmployeeDataFrame
EmployeeId, JobLevel, JobCode
1, 4, 50
1, 5, 60
2, 4, 70
2, 5, 80
3, 7, 90
case class EmployeeModel(EmployeeId: String ,JobLevel: String, JobCode: String)
val ds : Dataset[EmployeeModel] = EmployeeDataFrame.as[EmployeeModel]
val groupedData = ds
.groupByKey(_.EmployeeId)
.mapGroups((key, rows) => (key, rows.toList))
.collect()
.toMap
Expected Map
1, [(),()]
2, [(), ()]
3, [()]
Is there a better way of doing this?
Map
.Map
for each partition work for your use case? That is having several "partial" maps, one on each worker node.