All Questions
3,856
questions
1
vote
1
answer
18
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
0
votes
2
answers
29
views
How to get nested xml structure as a string from an xml document using xpath in pyspark dataframe?
I have a dataframe with a string datatype column with XML string. Now I want to create a new column with a nested XML structure from the original column. For this, I tried using XPath in PySpark.
...
0
votes
1
answer
41
views
Pyspark Data frame not returning rows having value more than 8 digits
I have created a sample data frame in Pyspark and the ID column contains a few values having more than 8 digits number. But it returns only those rows having less than 8 digits values in ID field. Can ...
0
votes
0
answers
40
views
Error in converting pandas dataframe into spark dataframe
I'm encountering an issue in Jupyter Notebook when working with Pandas and Spark on Kubernetes (k8s). Here's the sequence of steps I follow:
Create a Pandas DataFrame.
Create a Spark session ...
0
votes
1
answer
30
views
Pyspark select after join raises ambiguity but column should only be present in one of the dataframes
I'm doing a join on two dataframes that come from the same original dataframe. These then suffer some aggregations and the columns selected are not equal except for the ones that are used to join.
So ...
0
votes
0
answers
38
views
Random stratified sampling in pyspark
I have created a pandas dataframe as follows:
import pandas as pd
import numpy as np
ds = {'col1' : [1,1,1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,4,4,4,4],
'col2' : [12,3,4,5,4,3,2,3,4,6,7,8,3,3,...
0
votes
0
answers
34
views
spark sql query returns column has 0 length but non null
I have a spark dataframe for a parquet file. The column is string type.
spark.sql("select col_a, length(col_a) from df where col_a is not null")
+-------------------+------------------------...
1
vote
1
answer
34
views
How to apply an expression from a column to another column in pyspark dataframe?
I would like to know if it is possible to apply.
for example, I have this table:
new_feed_dt regex_to_apply expr_to_apply
053021 | _(\d+) | date_format(to_date(new_feed_dt, '...
0
votes
2
answers
63
views
Pyspark Filtering Array inside a Struct column
I have a column in my Spark DataFrame that has this schema:
root
|-- my_feature_name: struct (nullable = true)
| |-- first_profiles: map (nullable = true)
| | |-- key: string
| | |--...
1
vote
1
answer
46
views
Pyspark efficient ways to iterate over 1M columns
I have a pyspark dataframe as below:
+--------+-------------+---------+---------+---------+
| code| updatedAt|S0x223433|S1yd33333|S4r256467|
+--------+-------------+---------+---------+---------+...
2
votes
0
answers
31
views
PYSPARK : df.show() failed to run at 2nd run
During the first run for creating data frame in spark is successful,df.show() can output the data frame with no errors, but during the second run it failed but the code is still the same
first run
...
0
votes
1
answer
32
views
PySpark: Element-wise sum ALL DenseVectors in each cell of a dataframe
I need to element-wise sum ALL DenseVectors for each sentence. There can be multiple DenseVectors (per sentence) in each row's cell, i.e. embeddings.
I'll explain with the help of multiple lists in a ...
8
votes
2
answers
88
views
Where can I find an exhaustive list of actions for spark?
I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame.
It's my understanding that only actions trigger the execution of the transformations in ...
0
votes
0
answers
45
views
Update sql table in AWS Glue through pyspark
I am writing an ETL job in AWS Glue using pyspark.
I am reading the data from S3 and loading it in dataframe. After doing the manipulation in dataframe I want to update the data in a sql table.
Since ...
2
votes
1
answer
49
views
Pyspark - Retrieve the value from the field dynamically specified in other field of the same data frame
I'm working with PySpark and have a challenging scenario where I need to dynamically retrieve the value of a field specified in another field of the same DataFrame. I then need to compare this ...