Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [rdd]

Resilient Distributed Datasets (RDDs) are a distributed memory abstraction that allows programmers to perform in-memory computations on large clusters while retaining the fault tolerance of data flow models like MapReduce.

1 vote
1 answer
17 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,685
0 votes
0 answers
30 views

How does spark determine the number of partition to create in a RDD during its initial data load?

I was using Pyspark interactive shell with following Python version 3.12.4 Spark : 3.5.1 I tried to read a csv file of size 910MB and tried to query the number of partitions in it. The commands were : ...
Shubham Chandra's user avatar
2 votes
1 answer
44 views

Casting RDD to a different type (from float64 to double)

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
Inkyu Kim's user avatar
  • 145
1 vote
1 answer
29 views

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
Caden's user avatar
  • 67
0 votes
1 answer
90 views

Why is my PySpark row_number column messed up when applying a schema?

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
stats_guy's user avatar
  • 697
0 votes
0 answers
31 views

PySpark with RDD - How to calculate and compare averages?

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
Yoel Ha's user avatar
0 votes
1 answer
46 views

Order PySpark Dataframe by applying a function/lambda

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...
pymat's user avatar
  • 1,172
-1 votes
1 answer
37 views

Problem with pyspark mapping - Index out of range after split

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...
Toxicone 7's user avatar
0 votes
1 answer
54 views

Save text files as binary format using saveAsPickleFile with pyspark

I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...
Rushank Patil's user avatar
0 votes
0 answers
16 views

Why is not result of rdd.getNumPartitions() equal to number of table`s partitions stored in hdfs?

In my task is necessary to get number of partitions of spark DataFrame. For this purpose I have tried to convert spark DataFrame to RDD and call getNumPartitions method. In case I read a table from ...
noonmare's user avatar
1 vote
1 answer
35 views

Reading file using Spark RDD vs DF

i have a 2MB file, when i read it using df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t") df.rdd.getNumPartitions() # ...
Youssef Alaa Etman's user avatar
0 votes
0 answers
13 views

Splitting Spark dataset / rdd into X smaller datasets, like randomSplit but w/o random

I thought that spark randomSplit with equal weights with split dataset into equal parts w/o duplicates or records losses. Seems like it's wrong assumption. https://medium.com/udemy-engineering/pyspark-...
Capacytron's user avatar
  • 3,709
0 votes
1 answer
33 views

Linear RDD Plot only shows two data points

I have attempted to run the below code: data(house) house_rdd = rdd_data(x=x, y=y, data=house, cutpoint=0) summary(house_rdd) plot(house_rdd) When I plot it, I get this, which make sense. ...
Andy_H's user avatar
  • 3
1 vote
1 answer
45 views

I can't save RDD object into text files using PySpark

I am trying to create a Spark program to read the airport data from airport.text file, find all the airports which are located in United States and output the airport's name and city's name to an ...
Uche Kalu's user avatar
0 votes
0 answers
20 views

RDD lookup operation performing weirdly

I'm currently exploring PySpark and attempting to implement Dijkstra's algorithm using it. However, my query doesn't pertain to the algorithm itself; it's regarding the unexpected behavior of the ...
solSeeker's user avatar

15 30 50 per page
1
2 3 4 5
272