All Questions
2,353
questions
0
votes
1
answer
22
views
How to filter dataframe by column from different dataframe?
I want to filter dataframe by column with Strings from different dataframe.
val booksReadBestAuthors = userReviewsWithBooksDetails.filter(col("authors").isin(userReviewsAuthorsManyRead:_*))
...
0
votes
1
answer
31
views
What is the maximum number of entries a array can in a Spark column can hold?
I've created a struct with the data of some columns combined. Large numbers of these structs now occur for my unique identifier values. I want to combine these structs into an array using collect_list....
0
votes
1
answer
75
views
How to create data-frame on RocksDB (SST files)
We hold our documents in RocksDB. We will be syncing these RocksDB sst files to S3. I would like to create a dataframe on the SST files and later run an SQL query. When I googled, I was not able to ...
0
votes
0
answers
22
views
Flattening nested json with back slash in apache spark scala Dataframe
{
"messageBody": "{\"task\":{\"taskId\":\"c6d9fb0e-42ba-4a3e-bd39-f2a32a6958c1\",\"serializedTaskData\":\"{\\\"clientId\\\":\\\&...
0
votes
0
answers
34
views
Spark : Read special characters from the content of dat file without corrupting it in scala
I have to read all the special characters in some dat file (e.g.- testdata.dat) without being corrupted and initialise it into a dataframe in scala using spark.
I have one dat file (eg - testdata.dat),...
0
votes
0
answers
42
views
Spark re computes the cached Dataframes
Working on a Spark application written in Scala. Have six functions. Each function takes two Dataframes as an input, processes them and emits one result DF. I am caching the result of each function's ...
1
vote
2
answers
68
views
Scala - Convert Map to Dataframe where Keys re the Column Titles
I wish to create a dataframe by using a map such that the keys of the map are the column titles, and the values of the map are the data itself. In python and pyspark, this can be done quite easily in ...
0
votes
1
answer
28
views
It is possible to use spark Dataframe/Dataset api with accumulators?
I read and filter data, need to count how each filter operation affects result.
Is it possible to somehow mixin spark accumulators while using Dataframe/Dataset api?
Sample code:
sparkSession.read
....
0
votes
1
answer
17
views
Group spark dataframe column values based on variable scale
I have a dataframe with survey results. Each question has varying numeric scale from 4 - 6. I would like to bucket results based on the scale with the highest two answers being good results and the ...
0
votes
1
answer
32
views
Illegal start of simple expression when calling scala function
I have a function declared outside the main method to melt a wide data frame that i got from this post How to unpivot Spark DataFrame without hardcoding column names in Scala?
def melt(preserves: Seq[...
0
votes
0
answers
34
views
Scala : create dataframe column from array where the array size is variable
I have a variable like
val activityId = "activity_" + activityNum + "_id"
the variable activityNum is incremented from a loop (1,2,3,...)
so I want to create an array where it can ...
0
votes
1
answer
52
views
How to read a dataframe with inferschema as true
I have a dataframe df1 with all the columns are in string(100+ Columns), now i want to cast it to appropriate type with inferschema
Like, for example what we do if we have a csv file and we want the ...
0
votes
1
answer
42
views
Convert a spark DataFrame to a slighly different case class?
I have some data in HDFS that is in parquet-protobuf.
Due to some project constraints, I want to read that data with a spark DataFrame (easy) and then convert to a case class that is slightly ...
0
votes
1
answer
31
views
get reference of case class from it's fully qualified name to be used to convert dataframe to dataset
I have a fully qualified name of case classes. For my use case at runtime I need to get the reference of case class to be used to convert dataframe to dataset.
eg.
I have the FQN as: com.org.common....
-1
votes
1
answer
25
views
Filter out and log null values from Spark dataframe
I have this dataframe :
+------+-------------------+-----------+
|brand |original_timestamp |weight |
+------+-------------------+-----------+
|BR1 |1632899456 |4.0 |
|BR2 |...