How to process scala dataframe in chunks in scala?

Question

We have a huge dataframe in scala of around 120000 rows. We want to process the dataframe into chunks of 25 each and do 1 http request for 25 rows together as we divide. What is the best way to divide the dataframe and do some operations on each chunk.

For Example:

Consider this dataframe val df = Seq((1, 2, 3, 4, 5, 6, 7, 8, 9, 10)).toDF() and let's suppose we want to first take 4 rows and perform some operation on those, then next 4 perform operation on them and then perform operation on remaining 2

Koedlt · Accepted Answer · 2023-01-04 08:40:47Z

You could repartition your dataframe to chop it up in the size that you want, and then use foreachPartition to do a certain operation on each of these partitions.

On your small example, it could look something like the following:

val df = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).toDF().repartition(3)

df.rdd.foreachPartition(iterator => {
    println(s"printing: ${iterator.mkString(",")}")
})

printing: [1],[3],[8]
printing: [5],[7],[9]
printing: [2],[4],[6],[10]

In here, we're using .repartition(3) to chop up our dataset in the desired size. Note that in this case the size of our partitions is 3 or 4 (since 10 is not a multiple of 4).

So in your case, you simply have to:

change .repartition(3) to .repartition(4800) (120000/25 = 4800)
change the println(s"printing: ${iterator.mkString(",")}") line to whatever operation you want to be doing

Collectives™ on Stack Overflow

How to process scala dataframe in chunks in scala?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
scala
apache-spark
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged scalaapache-spark or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
scala
apache-spark
or ask your own question.