Questions tagged [apache-spark]

Ask Question

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

58 questions

0 votes

0 answers

11 views

Pyspark Regex Lookbehind Beginning Of String [duplicate]

My string in column "Key" is: "+One+Two+Three-Four" I want to extract all words following the "+" sign: df.select(regexp_extract_all("Key", F.lit(r"(?<=...

shwan

asked 7 hours ago

0 votes

0 answers

10 views

Is it a good practice to have a Spark application running indefinitely

I am new to Apache Spark, and would like to have your expert advice on my situtation: I am developing a Spark Application in Scala. I would like to be able to execute this application when triggered ...

frank

asked 8 hours ago

0 votes

1 answer

14 views

Spark: cast timeStamp with date and time to DateType

Given the dateTime-value: 2021-02-12T16:21:22 I tried to cast it to DateType as following: to_date(to_timestamp(col("Date"), "yyyy-MM-dd'T'HH:mm:ss"), "yyyy-MM-dd") ...

Jelly

1,192

asked 15 hours ago

0 votes

1 answer

31 views

Deduplication based on other dataframe columns

I have pyspark df1 as +----+ |name| +----+ | A| | B| | C| | D| +----+ and df2 as +------+------+ |name_a|name_b| +------+------+ | A| B| | B| A| | C| A| | A| C|...

abd

asked 17 hours ago

1 vote

1 answer

17 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,685

asked 17 hours ago

0 votes

0 answers

18 views

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

I created this Spark Shell programm, but i got this error while running it: Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. Install the latest PowerShell for new features ...

Ronit Jain

asked 19 hours ago

-4 votes

0 answers

21 views

Write filter on DataFrame in Spark Scala on mulitple different columns [closed]

I have 3 columns in my data frame that I want to run my filter on. Filter conditions: dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z)) Requirement is : To exclude data from ...

MrWayne

asked 20 hours ago

0 votes

3 answers

20 views

How to drop records after date based on condition

I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID. For ...

maximodesousadias

asked 21 hours ago

0 votes

0 answers

24 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...

Minu

asked 22 hours ago

0 votes

0 answers

22 views

Error while trying to show the results using PySpark show function

I am trying to show my results in PySpark. I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set. Some answers suggesting adding this: import findspark findspark....

rayenpe12

asked yesterday

0 votes

0 answers

24 views

Can we create multiple Spark executors within a single driver node on a Databricks cluster?

I have a power user compute with a single driver node and I'm trying to parallelize forecasting across multiple series by aggregating the data and doing a groupBy and then an apply on the groupBy. The ...

Manav Karthikeyan

asked yesterday

0 votes

1 answer

21 views

How to filter dataframe by column from different dataframe?

I want to filter dataframe by column with Strings from different dataframe. val booksReadBestAuthors = userReviewsWithBooksDetails.filter(col("authors").isin(userReviewsAuthorsManyRead:_*)) ...

Joanna Kois

asked yesterday

0 votes

1 answer

14 views

Spark: cast to DateType value with only month and year

Given date-value with month and year: 03.2020 I tried to cast it to DateType as following: to_timestamp(col("Date"), "MM.yyyy").cast(DateType) But this return something, that I ...

Jelly

1,192

asked yesterday

0 votes

1 answer

21 views

Reading Parquets From S3 With Apache Spark Slows Down At Later Stages

I have millions of parquets files on s3 with directory structure as code/day=xx/hour=/*.parquets. At max under hour folder we have 2000 parquest file with average size of 100kb. I am not able to ...

chaos

asked yesterday

0 votes

0 answers

12 views

Spark doesn't recognize not null fields while reading from the source

I'm trying to read data from database and then save it to parquet file using Kotlin and Apache Spark. JDBC Driver I use: com.mysql.cj.jdbc.Driver val customerDf = spark .read() .jdbc( ...

anthis

asked yesterday

15 30 50 per page

2 3 4 Next

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Pyspark Regex Lookbehind Beginning Of String [duplicate]

Is it a good practice to have a Spark application running indefinitely

Spark: cast timeStamp with date and time to DateType

Deduplication based on other dataframe columns

avg() over a whole dataframe causing different output

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

Write filter on DataFrame in Spark Scala on mulitple different columns [closed]

How to drop records after date based on condition

Unable to read a dataframe from s3

Error while trying to show the results using PySpark show function

Can we create multiple Spark executors within a single driver node on a Databricks cluster?

How to filter dataframe by column from different dataframe?

Spark: cast to DateType value with only month and year

Reading Parquets From S3 With Apache Spark Slows Down At Later Stages

Spark doesn't recognize not null fields while reading from the source

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [apache-spark]

Related Tags