Skip to main content
The 2024 Developer Survey results are live! See the results

All Questions

Tagged with
0 votes
0 answers
24 views

Unable to read a dataframe from s3

I am getting the following error: 24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/07/25 21:29:53 ...
Minu's user avatar
  • 7
0 votes
1 answer
21 views

Reading Parquets From S3 With Apache Spark Slows Down At Later Stages

I have millions of parquets files on s3 with directory structure as code/day=xx/hour=/*.parquets. At max under hour folder we have 2000 parquest file with average size of 100kb. I am not able to ...
chaos's user avatar
  • 1
0 votes
0 answers
48 views

Spark EOF Error (Parquet Read from S3)- Spark to Pandas conversion

I am reading close to 1 million rows stored in S3 as parquet files into a dataframe (900 MB size data in a bucket). Filtering the dataframe based on values and then later converting to a Pandas ...
Don Woodward's user avatar
0 votes
2 answers
31 views

Too many "Authorized committer" errors after upgrading to Pyspark==3.5.1

The problem I have recently upgraded my apps to run on Spark3.5.1+YARN3.3.6, and observing frequent failures saying "Authorized committer". The apps run PySpark and I observe the error ...
akki's user avatar
  • 2,202
0 votes
0 answers
19 views

Apache Spark On EKS master, failed to connect S3 using IAM role

We are running our Spark application on EKS as master. And trying to access(read/write) files in S3 bucket using IAM role. We have configured SA and attached IAM role to that Service account using ...
Rajashekhar Meesala's user avatar
0 votes
0 answers
33 views

Glue job keeps running while throwing "ErrorMessage: Partition already exists." error

My PySpark script joins several tables and writes the result with the code below: sink = glueContext.getSink(connection_type="s3", path="s3://bucket1234/", ...
user3048641's user avatar
0 votes
0 answers
31 views

Simultaneous overwrite and read causing file not found exception in S3

As part of my requirement we have a raw s3 bucket where initially the data is dumped as binary file and then the file are separately processed and consolidated to json file. The processor job which ...
Ashit_Kumar's user avatar
0 votes
1 answer
47 views

Speed up the data save to S3 buckets using spark scala

I am looking out for some pointers by which I can fasten the speed at which data is being persisted to S3. So I am currently persisting data to s3 buckets based on the below example path s3://...
Ashit_Kumar's user avatar
2 votes
0 answers
23 views

Pyspark - Error in writing same dataframe to multiple directories on s3

I trying to save same dataframe to two different directories. print(out_path) s3://.../out/2012-02/ print(curr_repo_path) s3://.../consolidate_repo_hist/ new_consolidated_repo.write.mode("...
Gaurav Singhal's user avatar
1 vote
3 answers
98 views

Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs

I'm using AWS Glue Spark/python jobs to ingest data into hudi tables in a s3 bucket. I'm hitting major s3 slowdown issues, in a way that goes beyond reasonable, but unable to pin down the root cause. ...
Aamit's user avatar
  • 211
0 votes
0 answers
83 views

Error reading from S3 using spark-connect on AWS EMR

Im trying to use the new spark-connect functionality in AWS EMR. I'm using AWS EMR version 7.1.0 with these software installed in the cluster: Spark 3.5.0 Hive 3.1.3 Hadoop 3.3.6 I start running the ...
Shadowtrooper's user avatar
0 votes
0 answers
66 views

How to build Apache Spark 3.0 Data Source V2 or V1?

I am trying to build a spark 3.0 custom data source (either v1 or v2). There are a large number of tutorials online but many of them use TableProvider. I am interested in using RelationProvider ...
Moois's user avatar
  • 103
0 votes
0 answers
54 views

Spark 3.5.1 is not reading s3 object content

I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files, even returning the correct size of the object. But the ...
Amin mosayyebzadeh's user avatar
1 vote
0 answers
35 views

Kafka Unable to push data onto S3 Bucket. -- Serialization Error

I am quite new to this, so please bear with me. So I am trying to Push Realtime Streaming Data Produced by Kafka on my Aws S3 Bucket. Kafka is able to produce data in the broker. And in the consumer ...
Subhamoy Paul's user avatar
0 votes
0 answers
28 views

Unable to download from Minio using java sdk if the object is written by Spark

What I'm doing Use Spark (version 3.4.1) to write a CSV to Minio Use Minio Java SDK (version 8.5.3) in my Java spring app to Read the CSV from Minio All the applications are deployed on K8s. All ...
Ronn M's user avatar
  • 1

15 30 50 per page
1
2 3 4 5
97