Newest 'dataframe+pyspark' Questions

1 vote

1 answer

18 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,685

asked yesterday

0 votes

2 answers

29 views

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?

I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark. ...

Krushna

13

asked yesterday

0 votes

1 answer

41 views

Pyspark Data frame not returning rows having value more than 8 digits

I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...

Deveshwari Devi

1

asked 2 days ago

0 votes

0 answers

40 views

Error in converting pandas dataframe into spark dataframe

I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow: Create a Pandas DataFrame. Create a Spark session ...

harshwardhan Singh Dodiya

1

asked Jul 24 at 6:56

0 votes

1 answer

30 views

Pyspark select after join raises ambiguity but column should only be present in one of the dataframes

I'm doing a join on two dataframes that come from the same original dataframe. These then suffer some aggregations and the columns selected are not equal except for the ones that are used to join. So ...

Miguel Rodrigues

1

asked Jul 22 at 22:14

0 votes

0 answers

38 views

Random stratified sampling in pyspark

I have created a pandas dataframe as follows: import pandas as pd import numpy as np ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4], 'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,...

Giampaolo Levorato

1,426

asked Jul 22 at 12:52

0 votes

0 answers

34 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...

Dozel

159

asked Jul 19 at 22:59

1 vote

1 answer

34 views

How to apply an expression from a column to another column in pyspark dataframe?

I would like to know if it is possible to apply. for example, I have this table: new_feed_dt regex_to_apply expr_to_apply 053021 | _(\d+) | date_format(to_date(new_feed_dt, '...

Tomás Jullier

95

asked Jul 19 at 15:26

0 votes

2 answers

63 views

Pyspark Filtering Array inside a Struct column

MathLal

392

asked Jul 19 at 1:25

1 vote

1 answer

46 views

Pyspark efficient ways to iterate over 1M columns

I have a pyspark dataframe as below: +--------+-------------+---------+---------+---------+ | code| updatedAt|S0x223433|S1yd33333|S4r256467| +--------+-------------+---------+---------+---------+...

datawiz879

11

asked Jul 15 at 22:34

2 votes

0 answers

31 views

PYSPARK : df.show() failed to run at 2nd run

During the first run for creating data frame in spark is successful,df.show() can output the data frame with no errors, but during the second run it failed but the code is still the same first run ...

Window Man

13

asked Jul 12 at 17:57

0 votes

1 answer

32 views

PySpark: Element-wise sum ALL DenseVectors in each cell of a dataframe

I need to element-wise sum ALL DenseVectors for each sentence. There can be multiple DenseVectors (per sentence) in each row's cell, i.e. embeddings. I'll explain with the help of multiple lists in a ...

Shruti

13

asked Jul 10 at 15:20

8 votes

2 answers

88 views

Where can I find an exhaustive list of actions for spark?

I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame. It's my understanding that only actions trigger the execution of the transformations in ...

HappilyCoding

412

asked Jul 8 at 21:20

0 votes

0 answers

45 views

Update sql table in AWS Glue through pyspark

I am writing an ETL job in AWS Glue using pyspark. I am reading the data from S3 and loading it in dataframe. After doing the manipulation in dataframe I want to update the data in a sql table. Since ...

Akash agrawal

11

asked Jul 8 at 17:51

2 votes

1 answer

49 views

Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame

I'm working with PySpark and have a challenging scenario where I need to dynamically retrieve the value of a field specified in another field of the same DataFrame. I then need to compare this ...

Piotr Wojcik

21

asked Jul 5 at 21:16

Collectives™ on Stack Overflow

All Questions

avg() over a whole dataframe causing different output

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?

Pyspark Data frame not returning rows having value more than 8 digits

Error in converting pandas dataframe into spark dataframe

Pyspark select after join raises ambiguity but column should only be present in one of the dataframes

Random stratified sampling in pyspark

spark sql query returns column has 0 length but non null

How to apply an expression from a column to another column in pyspark dataframe?

Pyspark Filtering Array inside a Struct column

Pyspark efficient ways to iterate over 1M columns

PYSPARK : df.show() failed to run at 2nd run

PySpark: Element-wise sum ALL DenseVectors in each cell of a dataframe

Where can I find an exhaustive list of actions for spark?

Update sql table in AWS Glue through pyspark

Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags