apache spark - efficient way of transforming a dataframe to map in scala

I have a use case where I want to convert the dataframe to a map. For that i am using groupbykey and mapGroups operations. But they are running into memory issues.

Eg:

EmployeeDataFrame

EmployeeId, JobLevel, JobCode
1, 4, 50
1, 5, 60
2, 4, 70
2, 5, 80
3, 7, 90

case class EmployeeModel(EmployeeId: String ,JobLevel: String, JobCode: String)

val ds : Dataset[EmployeeModel] = EmployeeDataFrame.as[EmployeeModel]
    val groupedData = ds
      .groupByKey(_.EmployeeId)
      .mapGroups((key, rows) => (key, rows.toList))
      .collect()
      .toMap

Expected Map

1, [(),()]
2, [(), ()]
3, [()]

Is there a better way of doing this?

edited Sep 20, 2021 at 22:39

asked Sep 20, 2021 at 22:31

itisha

475 bronze badges

3

By collecting the map, you're bringing the entire resulting dataset onto the driver node. How big is your resulting dataset? What do you intend to do with it once collected to the driver node?
– memoryz
Commented Sep 21, 2021 at 4:13
2

There is no way to do it in a "better" way. You will need to have the driver node with more RAM than required for this Map.
– sarveshseri
Commented Sep 21, 2021 at 6:19
Would obtaining a Map for each partition work for your use case? That is having several "partial" maps, one on each worker node.
– Gaël J
Commented Sep 21, 2021 at 17:31
@Gael Yes, that can work. Can you give me an example of how to do it?
– itisha
Commented Sep 22, 2021 at 14:36
@memoryz my dataset is quite large but I am trying to create a lookup table and that is the best I could do to create a lookup. I am open to suggestions for creating lookup.
– itisha
Commented Sep 22, 2021 at 14:38

| Show 1 more comment

Collectives™ on Stack Overflow

efficient way of transforming a dataframe to map in scala

0

Browse other questions tagged
scala
apache-spark
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged scalaapache-spark or ask your own question.

Browse other questions tagged
scala
apache-spark
or ask your own question.