Newest 'rdd' Questions - Stack Overflow

1 vote

1 answer

17 views

avg() over a whole dataframe causing different output

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...

anurag86

1,685

asked 18 hours ago

0 votes

0 answers

30 views

How does spark determine the number of partition to create in a RDD during its initial data load?

I was using Pyspark interactive shell with following Python version 3.12.4 Spark : 3.5.1 I tried to read a csv file of size 910MB and tried to query the number of partitions in it. The commands were : ...

Shubham Chandra

11

asked Jul 21 at 19:53

2 votes

1 answer

44 views

Casting RDD to a different type (from float64 to double)

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...

Inkyu Kim

145

asked Jul 3 at 12:09

1 vote

1 answer

29 views

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...

Caden

67

asked Jun 25 at 19:15

0 votes

1 answer

90 views

Why is my PySpark row_number column messed up when applying a schema?

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...

stats_guy

697

asked Jun 24 at 12:35

0 votes

0 answers

31 views

PySpark with RDD - How to calculate and compare averages?

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...

Yoel Ha

1

asked Jun 14 at 11:38

0 votes

1 answer

46 views

Order PySpark Dataframe by applying a function/lambda

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...

pymat

1,172

asked Jun 10 at 9:13

-1 votes

1 answer

37 views

Problem with pyspark mapping - Index out of range after split

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...

Toxicone 7

93

asked Jun 7 at 13:41

0 votes

1 answer

54 views

Save text files as binary format using saveAsPickleFile with pyspark

I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...

Rushank Patil

3

asked Jun 6 at 8:11

0 votes

0 answers

16 views

Why is not result of rdd.getNumPartitions() equal to number of table`s partitions stored in hdfs?

In my task is necessary to get number of partitions of spark DataFrame. For this purpose I have tried to convert spark DataFrame to RDD and call getNumPartitions method. In case I read a table from ...

noonmare

1

asked May 7 at 21:46

1 vote

1 answer

35 views

Reading file using Spark RDD vs DF

i have a 2MB file, when i read it using df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t") df.rdd.getNumPartitions() # ...

Youssef Alaa Etman

67

asked May 5 at 17:36

0 votes

0 answers

13 views

Splitting Spark dataset / rdd into X smaller datasets, like randomSplit but w/o random

I thought that spark randomSplit with equal weights with split dataset into equal parts w/o duplicates or records losses. Seems like it's wrong assumption. https://medium.com/udemy-engineering/pyspark-...

Capacytron

3,709

asked May 2 at 6:53

0 votes

1 answer

33 views

Linear RDD Plot only shows two data points

I have attempted to run the below code: data(house) house_rdd = rdd_data(x=x, y=y, data=house, cutpoint=0) summary(house_rdd) plot(house_rdd) When I plot it, I get this, which make sense. ...

Andy_H

3

asked Apr 30 at 20:57

1 vote

1 answer

45 views

I can't save RDD object into text files using PySpark

I am trying to create a Spark program to read the airport data from airport.text file, find all the airports which are located in United States and output the airport's name and city's name to an ...

Uche Kalu

11

asked Apr 30 at 12:02

0 votes

0 answers

20 views

RDD lookup operation performing weirdly

I'm currently exploring PySpark and attempting to implement Dijkstra's algorithm using it. However, my query doesn't pertain to the algorithm itself; it's regarding the unexpected behavior of the ...

solSeeker

1

asked Apr 27 at 11:50

Collectives™ on Stack Overflow

Questions tagged [rdd]

avg() over a whole dataframe causing different output

How does spark determine the number of partition to create in a RDD during its initial data load?

Casting RDD to a different type (from float64 to double)

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

Why is my PySpark row_number column messed up when applying a schema?

PySpark with RDD - How to calculate and compare averages?

Order PySpark Dataframe by applying a function/lambda

Problem with pyspark mapping - Index out of range after split

Save text files as binary format using saveAsPickleFile with pyspark

Why is not result of rdd.getNumPartitions() equal to number of table`s partitions stored in hdfs?

Reading file using Spark RDD vs DF

Splitting Spark dataset / rdd into X smaller datasets, like randomSplit but w/o random

Linear RDD Plot only shows two data points

I can't save RDD object into text files using PySpark

RDD lookup operation performing weirdly

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [rdd]

Related Tags