Questions tagged [apache-spark]
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.
apache-spark
82,619
questions
0
votes
0
answers
9
views
Spark: cast timeStamp with date and time to DateType
Given the dateTime-value:
2021-02-12T16:21:22
I tried to cast it to DateType as following:
to_date(to_timestamp(col("Date"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd")
...
0
votes
0
answers
15
views
Deduplication based on other dataframe columns
I have pyspark df1 as
+----+
|name|
+----+
| A|
| B|
| C|
| D|
+----+
and df2 as
+------+------+
|name_a|name_b|
+------+------+
| A| B|
| B| A|
| C| A|
| A| C|...
1
vote
1
answer
15
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
0
votes
0
answers
17
views
Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options
I created this Spark Shell programm, but i got this error while running it:
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Install the latest PowerShell for new features ...
-3
votes
0
answers
19
views
Write filter on DataFrame in Spark Scala on mulitple different columns
I have 3 columns in my data frame that I want to run my filter on.
Filter conditions:
dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z))
Requirement is :
To exclude data from ...
0
votes
3
answers
19
views
How to drop records after date based on condition
I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID.
For ...
0
votes
0
answers
16
views
Unable to read a dataframe from s3
I am getting the following error:
24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/25 21:29:53 ...
0
votes
0
answers
19
views
Error while trying to show the results using PySpark show function
I am trying to show my results in PySpark.
I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set.
Some answers suggesting adding this:
import findspark
findspark....
0
votes
0
answers
24
views
Can we create multiple Spark executors within a single driver node on a Databricks cluster?
I have a power user compute with a single driver node and I'm trying to parallelize forecasting across multiple series by aggregating the data and doing a groupBy and then an apply on the groupBy.
The ...
0
votes
1
answer
17
views
How to filter dataframe by column from different dataframe?
I want to filter dataframe by column with Strings from different dataframe.
val booksReadBestAuthors = userReviewsWithBooksDetails.filter(col("authors").isin(userReviewsAuthorsManyRead:_*))
...
0
votes
1
answer
14
views
Spark: cast to DateType value with only month and year
Given date-value with month and year:
03.2020
I tried to cast it to DateType as following:
to_timestamp(col("Date"), "MM.yyyy").cast(DateType)
But this return something, that I ...
0
votes
1
answer
19
views
Reading Parquets From S3 With Apache Spark Slows Down At Later Stages
I have millions of parquets files on s3 with directory structure as code/day=xx/hour=/*.parquets.
At max under hour folder we have 2000 parquest file with average size of 100kb.
I am not able to ...
0
votes
0
answers
12
views
Spark doesn't recognize not null fields while reading from the source
I'm trying to read data from database and then save it to parquet file using Kotlin and Apache Spark.
JDBC Driver I use: com.mysql.cj.jdbc.Driver
val customerDf = spark
.read()
.jdbc(
...
0
votes
0
answers
7
views
spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"
I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1
i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...
0
votes
0
answers
13
views
Could not find py4J jar at
I am trying to run my PMML pre-trained model in Python 3.9, but no matter what I do I have this error Could Not find py4j jar at.
None of the solutions provided on the blogs are working. And, even if ...