Questions tagged [apache-spark]
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.
apache-spark
10
questions
0
votes
0
answers
11
views
Pyspark Regex Lookbehind Beginning Of String [duplicate]
My string in column "Key" is:
"+One+Two+Three-Four"
I want to extract all words following the "+" sign:
df.select(regexp_extract_all("Key", F.lit(r"(?<=...
0
votes
0
answers
10
views
Is it a good practice to have a Spark application running indefinitely
I am new to Apache Spark, and would like to have your expert advice on my situtation:
I am developing a Spark Application in Scala.
I would like to be able to execute this application when triggered ...
0
votes
1
answer
14
views
Spark: cast timeStamp with date and time to DateType
Given the dateTime-value:
2021-02-12T16:21:22
I tried to cast it to DateType as following:
to_date(to_timestamp(col("Date"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd")
...
0
votes
1
answer
31
views
Deduplication based on other dataframe columns
I have pyspark df1 as
+----+
|name|
+----+
| A|
| B|
| C|
| D|
+----+
and df2 as
+------+------+
|name_a|name_b|
+------+------+
| A| B|
| B| A|
| C| A|
| A| C|...
1
vote
1
answer
17
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
0
votes
0
answers
18
views
Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options
I created this Spark Shell programm, but i got this error while running it:
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
Install the latest PowerShell for new features ...
-4
votes
0
answers
21
views
Write filter on DataFrame in Spark Scala on mulitple different columns [closed]
I have 3 columns in my data frame that I want to run my filter on.
Filter conditions:
dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z))
Requirement is :
To exclude data from ...
0
votes
3
answers
20
views
How to drop records after date based on condition
I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID.
For ...
0
votes
0
answers
24
views
Unable to read a dataframe from s3
I am getting the following error:
24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/25 21:29:53 ...
0
votes
0
answers
22
views
Error while trying to show the results using PySpark show function
I am trying to show my results in PySpark.
I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set.
Some answers suggesting adding this:
import findspark
findspark....