Newest 'amazon-s3+apache-spark' Questions

0 votes

0 answers

24 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...

Minu

7

asked 22 hours ago

0 votes

1 answer

21 views

Reading Parquets From S3 With Apache Spark Slows Down At Later Stages

I have millions of parquets files on s3 with directory structure as code/day=xx/hour=/*.parquets. At max under hour folder we have 2000 parquest file with average size of 100kb. I am not able to ...

chaos

1

asked yesterday

0 votes

0 answers

48 views

Spark EOF Error (Parquet Read from S3)- Spark to Pandas conversion

I am reading close to 1 million rows stored in S3 as parquet files into a dataframe (900 MB size data in a bucket). Filtering the dataframe based on values and then later converting to a Pandas ...

Don Woodward

132

asked 2 days ago

0 votes

2 answers

31 views

Too many "Authorized committer" errors after upgrading to Pyspark==3.5.1

The problem I have recently upgraded my apps to run on Spark3.5.1+YARN3.3.6, and observing frequent failures saying "Authorized committer". The apps run PySpark and I observe the error ...

akki

2,202

asked 2 days ago

0 votes

0 answers

19 views

Apache Spark On EKS master, failed to connect S3 using IAM role

We are running our Spark application on EKS as master. And trying to access(read/write) files in S3 bucket using IAM role. We have configured SA and attached IAM role to that Service account using ...

Rajashekhar Meesala

329

asked Jul 15 at 7:41

0 votes

0 answers

33 views

Glue job keeps running while throwing "ErrorMessage: Partition already exists." error

My PySpark script joins several tables and writes the result with the code below: sink = glueContext.getSink(connection_type="s3", path="s3://bucket1234/", ...

user3048641

105

asked Jul 4 at 13:10

0 votes

0 answers

31 views

Simultaneous overwrite and read causing file not found exception in S3

As part of my requirement we have a raw s3 bucket where initially the data is dumped as binary file and then the file are separately processed and consolidated to json file. The processor job which ...

Ashit_Kumar

591

asked Jun 14 at 7:27

0 votes

1 answer

47 views

Speed up the data save to S3 buckets using spark scala

I am looking out for some pointers by which I can fasten the speed at which data is being persisted to S3. So I am currently persisting data to s3 buckets based on the below example path s3://...

Ashit_Kumar

591

asked Jun 12 at 6:19

2 votes

0 answers

23 views

Pyspark - Error in writing same dataframe to multiple directories on s3

I trying to save same dataframe to two different directories. print(out_path) s3://.../out/2012-02/ print(curr_repo_path) s3://.../consolidate_repo_hist/ new_consolidated_repo.write.mode("...

Gaurav Singhal

1,072

asked Jun 10 at 19:47

1 vote

3 answers

98 views

Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs

I'm using AWS Glue Spark/python jobs to ingest data into hudi tables in a s3 bucket. I'm hitting major s3 slowdown issues, in a way that goes beyond reasonable, but unable to pin down the root cause. ...

Aamit

211

asked Jun 2 at 0:34

0 votes

0 answers

83 views

Error reading from S3 using spark-connect on AWS EMR

Im trying to use the new spark-connect functionality in AWS EMR. I'm using AWS EMR version 7.1.0 with these software installed in the cluster: Spark 3.5.0 Hive 3.1.3 Hadoop 3.3.6 I start running the ...

Shadowtrooper

1,444

asked May 28 at 11:15

0 votes

0 answers

66 views

How to build Apache Spark 3.0 Data Source V2 or V1?

I am trying to build a spark 3.0 custom data source (either v1 or v2). There are a large number of tutorials online but many of them use TableProvider. I am interested in using RelationProvider ...

Moois

103

asked May 23 at 3:03

0 votes

0 answers

54 views

Spark 3.5.1 is not reading s3 object content

I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files, even returning the correct size of the object. But the ...

Amin mosayyebzadeh

23

asked May 20 at 2:41

1 vote

0 answers

35 views

Kafka Unable to push data onto S3 Bucket. -- Serialization Error

I am quite new to this, so please bear with me. So I am trying to Push Realtime Streaming Data Produced by Kafka on my Aws S3 Bucket. Kafka is able to produce data in the broker. And in the consumer ...

Subhamoy Paul

23

asked May 17 at 6:55

0 votes

0 answers

28 views

Unable to download from Minio using java sdk if the object is written by Spark

What I'm doing Use Spark (version 3.4.1) to write a CSV to Minio Use Minio Java SDK (version 8.5.3) in my Java spring app to Read the CSV from Minio All the applications are deployed on K8s. All ...

Ronn M

1

asked May 14 at 17:38

Collectives™ on Stack Overflow

All Questions

Unable to read a dataframe from s3

Reading Parquets From S3 With Apache Spark Slows Down At Later Stages

Spark EOF Error (Parquet Read from S3)- Spark to Pandas conversion

Too many "Authorized committer" errors after upgrading to Pyspark==3.5.1

Apache Spark On EKS master, failed to connect S3 using IAM role

Glue job keeps running while throwing "ErrorMessage: Partition already exists." error

Simultaneous overwrite and read causing file not found exception in S3

Speed up the data save to S3 buckets using spark scala

Pyspark - Error in writing same dataframe to multiple directories on s3

Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs

Error reading from S3 using spark-connect on AWS EMR

How to build Apache Spark 3.0 Data Source V2 or V1?

Spark 3.5.1 is not reading s3 object content

Kafka Unable to push data onto S3 Bucket. -- Serialization Error

Unable to download from Minio using java sdk if the object is written by Spark

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags