0

I have a use case where I want to convert the dataframe to a map. For that i am using groupbykey and mapGroups operations. But they are running into memory issues.

Eg:

EmployeeDataFrame

EmployeeId, JobLevel, JobCode
1, 4, 50
1, 5, 60
2, 4, 70
2, 5, 80
3, 7, 90
case class EmployeeModel(EmployeeId: String ,JobLevel: String, JobCode: String)
val ds : Dataset[EmployeeModel] = EmployeeDataFrame.as[EmployeeModel]
    val groupedData = ds
      .groupByKey(_.EmployeeId)
      .mapGroups((key, rows) => (key, rows.toList))
      .collect()
      .toMap

Expected Map

1, [(),()]
2, [(), ()]
3, [()]

Is there a better way of doing this?

6
  • 3
    By collecting the map, you're bringing the entire resulting dataset onto the driver node. How big is your resulting dataset? What do you intend to do with it once collected to the driver node?
    – memoryz
    Commented Sep 21, 2021 at 4:13
  • 2
    There is no way to do it in a "better" way. You will need to have the driver node with more RAM than required for this Map. Commented Sep 21, 2021 at 6:19
  • Would obtaining a Map for each partition work for your use case? That is having several "partial" maps, one on each worker node.
    – Gaël J
    Commented Sep 21, 2021 at 17:31
  • @Gael Yes, that can work. Can you give me an example of how to do it?
    – itisha
    Commented Sep 22, 2021 at 14:36
  • @memoryz my dataset is quite large but I am trying to create a lookup table and that is the best I could do to create a lookup. I am open to suggestions for creating lookup.
    – itisha
    Commented Sep 22, 2021 at 14:38

0

Browse other questions tagged or ask your own question.