Newest 'pyspark' Questions

0 votes

1 answer

20 views

Deduplication based on other dataframe columns

I have pyspark df1 as +----+ |name| +----+ | A| | B| | C| | D| +----+ and df2 as +------+------+ |name_a|name_b| +------+------+ | A| B| | B| A| | C| A| | A| C|...

abd

81

asked 7 hours ago

1 vote

1 answer

16 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,685

asked 7 hours ago

0 votes

0 answers

12 views

java.io.UncheckedIOException: io.netty.channel.StacklessClosedChannelException while writing to adx

I'm trying to write some data to azure data explorer table, but getting the below exception. The adx cluster is open to all the networks and could see the connectivity between databricks and adx is ...

SONIA_29

56

asked 7 hours ago

0 votes

2 answers

22 views

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?

I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark. ...

Krushna

13

asked 8 hours ago

0 votes

0 answers

17 views

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

I created this Spark Shell programm, but i got this error while running it: Windows PowerShell Copyright (C) Microsoft Corporation. All rights reserved. Install the latest PowerShell for new features ...

Ronit Jain

9

asked 8 hours ago

-3 votes

0 answers

19 views

Write filter on DataFrame in Spark Scala on mulitple different columns

I have 3 columns in my data frame that I want to run my filter on. Filter conditions: dataframe.filter(col(ID) =!= X) || col(y) =!= null || col (y) =!= col(z)) Requirement is : To exclude data from ...

MrWayne

333

asked 9 hours ago

0 votes

3 answers

19 views

How to drop records after date based on condition

I'm looking for an elegant way to drop all records in a DataFrame that occur before the latest occurrence of 'TEST_COMPONENT' being 'UNSATISFACTORY', based on their 'TEST_DT' value for each ID. For ...

maximodesousadias

35

asked 10 hours ago

0 votes

0 answers

20 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...

Minu

7

asked 11 hours ago

0 votes

0 answers

19 views

Error while trying to show the results using PySpark show function

I am trying to show my results in PySpark. I am using Spark 3.5.1 and PySpark 3.5.1 with Java 8 installed and everything is well set. Some answers suggesting adding this: import findspark findspark....

rayenpe12

56

asked 14 hours ago

0 votes

0 answers

24 views

Can we create multiple Spark executors within a single driver node on a Databricks cluster?

I have a power user compute with a single driver node and I'm trying to parallelize forecasting across multiple series by aggregating the data and doing a groupBy and then an apply on the groupBy. The ...

Manav Karthikeyan

43

asked 18 hours ago

0 votes

1 answer

25 views

Pandas UDF to derive new column

In Spark/Databricks, I have a pandas dataframe with a string column. I need to perform multiple actions on this column (data cleansing type stuff), and produce a new column from the result. Here's my ...

Andrew

8,613

asked 20 hours ago

0 votes

0 answers

17 views

Read multiple files parallel into separate dataframe in Pyspark

I am trying to read large txt files into dataframe. Each file is 10-15 GB in size, as the IO is taking long time. I want to read multiple file in parallel and get them in separate dataframe. I tried ...

Tejas

401

asked 21 hours ago

0 votes

0 answers

13 views

Could not find py4J jar at

I am trying to run my PMML pre-trained model in Python 3.9, but no matter what I do I have this error Could Not find py4j jar at. None of the solutions provided on the blogs are working. And, even if ...

Tidiane Sall

1

asked yesterday

1 vote

1 answer

35 views

Replacing carriage return and line feeds in a text file using Azure Synapse

I am currently working on a project where I have to get data from an Excel file in SharePoint, this file has multiple tabs, and I am using Azure Synapse Analytics to call a SharePoint API to get each ...

NatalieEV

23

asked yesterday

0 votes

0 answers

12 views

Spark OverwritePartitions always triggering shuffle of data

I have a spark job that is reading from an oracle data source and writing to a iceberg table. There are multiple queries executing in multi-threading, each query hits just one partition (in iceberg). ...

LuisR

1

asked yesterday

Collectives™ on Stack Overflow

Questions tagged [pyspark]

Deduplication based on other dataframe columns

avg() over a whole dataframe causing different output

java.io.UncheckedIOException: io.netty.channel.StacklessClosedChannelException while writing to adx

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?

Spark Shell: spark.executor.extraJavaOptions is not allowed to set Spark options

Write filter on DataFrame in Spark Scala on mulitple different columns

How to drop records after date based on condition

Unable to read a dataframe from s3

Error while trying to show the results using PySpark show function

Can we create multiple Spark executors within a single driver node on a Databricks cluster?

Pandas UDF to derive new column

Read multiple files parallel into separate dataframe in Pyspark

Could not find py4J jar at

Replacing carriage return and line feeds in a text file using Azure Synapse

Spark OverwritePartitions always triggering shuffle of data

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [pyspark]

Related Tags