Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

0 votes
0 answers
11 views

Pyspark Regex Lookbehind Beginning Of String [duplicate]

My string in column "Key" is: "+One+Two+Three-Four" I want to extract all words following the "+" sign: df.select(regexp_extract_all("Key", F.lit(r"(?<=...
shwan's user avatar
  • 568
0 votes
0 answers
10 views

Is it a good practice to have a Spark application running indefinitely

I am new to Apache Spark, and would like to have your expert advice on my situtation: I am developing a Spark Application in Scala. I would like to be able to execute this application when triggered ...
frank's user avatar
  • 1
0 votes
1 answer
14 views

Spark: cast timeStamp with date and time to DateType

Given the dateTime-value: 2021-02-12T16:21:22 I tried to cast it to DateType as following: to_date(to_timestamp(col("Date"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd") ...
Jelly's user avatar
  • 1,192
0 votes
1 answer
31 views

Deduplication based on other dataframe columns

I have pyspark df1 as +----+ |name| +----+ | A| | B| | C| | D| +----+ and df2 as +------+------+ |name_a|name_b| +------+------+ | A| B| | B| A| | C| A| | A| C|...
abd's user avatar
  • 81
1 vote
1 answer
17 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,685
0 votes
0 answers
18 views

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

I created this Spark Shell programm, but i got this error while running it: Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. Install the latest PowerShell for new features ...
Ronit Jain's user avatar
-4 votes
0 answers
21 views

Write filter on DataFrame in Spark Scala on mulitple different columns [closed]

I have 3 columns in my data frame that I want to run my filter on. Filter conditions: dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z)) Requirement is : To exclude data from ...
MrWayne's user avatar
  • 331
0 votes
3 answers
20 views

How to drop records after date based on condition

I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID. For ...
maximodesousadias's user avatar
0 votes
0 answers
24 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...
Minu's user avatar
  • 7
0 votes
0 answers
22 views

Error while trying to show the results using PySpark show function

I am trying to show my results in PySpark. I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set. Some answers suggesting adding this: import findspark findspark....
rayenpe12's user avatar