Skip to main content
The 2024 Developer Survey results are live! See the results

All Questions

Tagged with
1 vote
1 answer
18 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,685
0 votes
2 answers
29 views

How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?

I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark. ...
Krushna's user avatar
  • 13
0 votes
1 answer
41 views

Pyspark Data frame not returning rows having value more than 8 digits

I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
Deveshwari Devi's user avatar
0 votes
0 answers
40 views

Error in converting pandas dataframe into spark dataframe

I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow: Create a Pandas DataFrame. Create a Spark session ...
harshwardhan Singh Dodiya's user avatar
0 votes
1 answer
30 views

Pyspark select after join raises ambiguity but column should only be present in one of the dataframes

I'm doing a join on two dataframes that come from the same original dataframe. These then suffer some aggregations and the columns selected are not equal except for the ones that are used to join. So ...
Miguel Rodrigues's user avatar
0 votes
0 answers
38 views

Random stratified sampling in pyspark

I have created a pandas dataframe as follows: import pandas as pd import numpy as np ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4], 'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,...
Giampaolo Levorato's user avatar
0 votes
0 answers
34 views

spark sql query returns column has 0 length but non null

I have a spark dataframe for a parquet file. The column is string type. spark.sql("select col_a, length(col_a) from df where col_a is not null") +-------------------+------------------------...
Dozel's user avatar
  • 159
1 vote
1 answer
34 views

How to apply an expression from a column to another column in pyspark dataframe?

I would like to know if it is possible to apply. for example, I have this table: new_feed_dt regex_to_apply expr_to_apply 053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
Tomás Jullier's user avatar
0 votes
2 answers
63 views

Pyspark Filtering Array inside a Struct column

I have a column in my Spark DataFrame that has this schema: root |-- my_feature_name: struct (nullable = true) | |-- first_profiles: map (nullable = true) | | |-- key: string | | |--...
MathLal's user avatar
  • 392
1 vote
1 answer
46 views

Pyspark efficient ways to iterate over 1M columns

I have a pyspark dataframe as below: +--------+-------------+---------+---------+---------+ | code| updatedAt|S0x223433|S1yd33333|S4r256467| +--------+-------------+---------+---------+---------+...
datawiz879's user avatar
2 votes
0 answers
31 views

PYSPARK : df.show() failed to run at 2nd run

During the first run for creating data frame in spark is successful,df.show() can output the data frame with no errors, but during the second run it failed but the code is still the same first run ...
Window Man's user avatar
0 votes
1 answer
32 views

PySpark: Element-wise sum ALL DenseVectors in each cell of a dataframe

I need to element-wise sum ALL DenseVectors for each sentence. There can be multiple DenseVectors (per sentence) in each row's cell, i.e. embeddings. I'll explain with the help of multiple lists in a ...
Shruti's user avatar
  • 13
8 votes
2 answers
88 views

Where can I find an exhaustive list of actions for spark?

I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame. It's my understanding that only actions trigger the execution of the transformations in ...
HappilyCoding's user avatar
0 votes
0 answers
45 views

Update sql table in AWS Glue through pyspark

I am writing an ETL job in AWS Glue using pyspark. I am reading the data from S3 and loading it in dataframe. After doing the manipulation in dataframe I want to update the data in a sql table. Since ...
Akash agrawal's user avatar
2 votes
1 answer
49 views

Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame

I'm working with PySpark and have a challenging scenario where I need to dynamically retrieve the value of a field specified in another field of the same DataFrame. I then need to compare this ...
Piotr Wojcik's user avatar

15 30 50 per page
1
2 3 4 5
258