Questions tagged [apache-spark]

Ask Question

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

10 questions

0 votes

0 answers

11 views

Pyspark Regex Lookbehind Beginning Of String [duplicate]

My string in column "Key" is: "+One+Two+Three-Four" I want to extract all words following the "+" sign: df.select(regexp_extract_all("Key", F.lit(r"(?<=...

shwan

asked 7 hours ago

0 votes

0 answers

10 views

Is it a good practice to have a Spark application running indefinitely

I am new to Apache Spark, and would like to have your expert advice on my situtation: I am developing a Spark Application in Scala. I would like to be able to execute this application when triggered ...

frank

asked 8 hours ago

0 votes

1 answer

14 views

Spark: cast timeStamp with date and time to DateType

Given the dateTime-value: 2021-02-12T16:21:22 I tried to cast it to DateType as following: to_date(to_timestamp(col("Date"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd") ...

Jelly

1,192

asked 16 hours ago

0 votes

1 answer

31 views

Deduplication based on other dataframe columns

I have pyspark df1 as +----+ |name| +----+ | A| | B| | C| | D| +----+ and df2 as +------+------+ |name_a|name_b| +------+------+ | A| B| | B| A| | C| A| | A| C|...

abd

asked 17 hours ago

1 vote

1 answer

17 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,685

asked 17 hours ago

0 votes

0 answers

18 views

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

I created this Spark Shell programm, but i got this error while running it: Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. Install the latest PowerShell for new features ...

Ronit Jain

asked 19 hours ago

-4 votes

0 answers

21 views

Write filter on DataFrame in Spark Scala on mulitple different columns [closed]

I have 3 columns in my data frame that I want to run my filter on. Filter conditions: dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z)) Requirement is : To exclude data from ...

MrWayne

asked 20 hours ago

0 votes

3 answers

20 views

How to drop records after date based on condition

I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID. For ...

maximodesousadias

asked 21 hours ago

0 votes

0 answers

24 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...

Minu

asked 22 hours ago

0 votes

0 answers

22 views

Error while trying to show the results using PySpark show function

I am trying to show my results in PySpark. I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set. Some answers suggesting adding this: import findspark findspark....

rayenpe12

asked yesterday

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Pyspark Regex Lookbehind Beginning Of String [duplicate]

Is it a good practice to have a Spark application running indefinitely

Spark: cast timeStamp with date and time to DateType

Deduplication based on other dataframe columns

avg() over a whole dataframe causing different output

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

Write filter on DataFrame in Spark Scala on mulitple different columns [closed]

How to drop records after date based on condition

Unable to read a dataframe from s3

Error while trying to show the results using PySpark show function

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Related Tags