Questions tagged [rdd]
Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.
4,071
questions
1
vote
1
answer
17
views
avg() over a whole dataframe causing different output
I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy.
...
0
votes
0
answers
30
views
How does spark determine the number of partition to create in a RDD during its initial data load?
I was using Pyspark interactive shell with following
Python version 3.12.4
Spark : 3.5.1
I tried to read a csv file of size 910MB and tried to query the number of partitions in it.
The commands were :
...
2
votes
1
answer
44
views
Casting RDD to a different type (from float64 to double)
I have a code like below, which uses pyspark.
test_truth_value = RDD.
test_predictor_rdd = RDD.
valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
1
vote
1
answer
29
views
Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors
I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
0
votes
1
answer
90
views
Why is my PySpark row_number column messed up when applying a schema?
I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
0
votes
0
answers
31
views
PySpark with RDD - How to calculate and compare averages?
I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
0
votes
1
answer
46
views
Order PySpark Dataframe by applying a function/lambda
I have a PySpark DataFrame which needs ordering on a column ("Reference").
The values in the column typically look like:
["AA.1234.56", "AA.1101.88", "AA.904.33"...
-1
votes
1
answer
37
views
Problem with pyspark mapping - Index out of range after split
When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result.
The dataset is structured like this:
X,Y,FID,...
0
votes
1
answer
54
views
Save text files as binary format using saveAsPickleFile with pyspark
I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...
0
votes
0
answers
16
views
Why is not result of rdd.getNumPartitions() equal to number of table`s partitions stored in hdfs?
In my task is necessary to get number of partitions of spark DataFrame. For this purpose I have tried to convert spark DataFrame to RDD and call getNumPartitions method. In case I read a table from ...
1
vote
1
answer
35
views
Reading file using Spark RDD vs DF
i have a 2MB file, when i read it using
df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t")
df.rdd.getNumPartitions() # ...
0
votes
0
answers
13
views
Splitting Spark dataset / rdd into X smaller datasets, like randomSplit but w/o random
I thought that spark randomSplit with equal weights with split dataset into equal parts w/o duplicates or records losses. Seems like it's wrong assumption. https://medium.com/udemy-engineering/pyspark-...
0
votes
1
answer
33
views
Linear RDD Plot only shows two data points
I have attempted to run the below code:
data(house)
house_rdd = rdd_data(x=x, y=y, data=house, cutpoint=0)
summary(house_rdd)
plot(house_rdd)
When I plot it, I get this, which make sense.
...
1
vote
1
answer
45
views
I can't save RDD object into text files using PySpark
I am trying to create a Spark program to read the airport data from airport.text file, find all the airports which are located in United States and output the airport's name and city's name to an ...
0
votes
0
answers
20
views
RDD lookup operation performing weirdly
I'm currently exploring PySpark and attempting to implement Dijkstra's algorithm using it. However, my query doesn't pertain to the algorithm itself; it's regarding the unexpected behavior of the ...