All Questions
Tagged with amazon-s3 apache-spark
1,447
questions
0
votes
0
answers
24
views
Unable to read a dataframe from s3
I am getting the following error:
24/07/25 21:29:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/25 21:29:53 ...
0
votes
1
answer
21
views
Reading Parquets From S3 With Apache Spark Slows Down At Later Stages
I have millions of parquets files on s3 with directory structure as code/day=xx/hour=/*.parquets.
At max under hour folder we have 2000 parquest file with average size of 100kb.
I am not able to ...
0
votes
0
answers
48
views
Spark EOF Error (Parquet Read from S3)- Spark to Pandas conversion
I am reading close to 1 million rows stored in S3 as parquet files into a dataframe (900 MB size data in a bucket). Filtering the dataframe based on values and then later converting to a Pandas ...
0
votes
2
answers
31
views
Too many "Authorized committer" errors after upgrading to Pyspark==3.5.1
The problem
I have recently upgraded my apps to run on Spark3.5.1+YARN3.3.6, and observing frequent failures saying "Authorized committer". The apps run PySpark and I observe the error ...
0
votes
0
answers
19
views
Apache Spark On EKS master, failed to connect S3 using IAM role
We are running our Spark application on EKS as master. And trying to access(read/write) files in S3 bucket using IAM role.
We have configured SA and attached IAM role to that Service account using ...
0
votes
0
answers
33
views
Glue job keeps running while throwing "ErrorMessage: Partition already exists." error
My PySpark script joins several tables and writes the result with the code below:
sink = glueContext.getSink(connection_type="s3", path="s3://bucket1234/",
...
0
votes
0
answers
31
views
Simultaneous overwrite and read causing file not found exception in S3
As part of my requirement we have a raw s3 bucket where initially the data is dumped as binary file and then the file are separately processed and consolidated to json file.
The processor job which ...
0
votes
1
answer
47
views
Speed up the data save to S3 buckets using spark scala
I am looking out for some pointers by which I can fasten the speed at which data is being persisted to S3. So I am currently persisting data to s3 buckets based on the below example path
s3://...
2
votes
0
answers
23
views
Pyspark - Error in writing same dataframe to multiple directories on s3
I trying to save same dataframe to two different directories.
print(out_path)
s3://.../out/2012-02/
print(curr_repo_path)
s3://.../consolidate_repo_hist/
new_consolidated_repo.write.mode("...
1
vote
3
answers
98
views
Unexplained s3 slowdowns when ingesting data to hudi tables using spark/python Glue jobs
I'm using AWS Glue Spark/python jobs to ingest data into hudi tables in a s3 bucket. I'm hitting major s3 slowdown issues, in a way that goes beyond reasonable, but unable to pin down the root cause.
...
0
votes
0
answers
83
views
Error reading from S3 using spark-connect on AWS EMR
Im trying to use the new spark-connect functionality in AWS EMR. I'm using AWS EMR version 7.1.0 with these software installed in the cluster:
Spark 3.5.0
Hive 3.1.3
Hadoop 3.3.6
I start running the ...
0
votes
0
answers
66
views
How to build Apache Spark 3.0 Data Source V2 or V1?
I am trying to build a spark 3.0 custom data source (either v1 or v2). There are a large number of tutorials online but many of them use TableProvider. I am interested in using RelationProvider ...
0
votes
0
answers
54
views
Spark 3.5.1 is not reading s3 object content
I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1.
I see it can access the bucket and list the files, even returning the correct size of the object. But the ...
1
vote
0
answers
35
views
Kafka Unable to push data onto S3 Bucket. -- Serialization Error
I am quite new to this, so please bear with me.
So I am trying to Push Realtime Streaming Data Produced by Kafka on my Aws S3 Bucket. Kafka is able to produce data in the broker. And in the consumer ...
0
votes
0
answers
28
views
Unable to download from Minio using java sdk if the object is written by Spark
What I'm doing
Use Spark (version 3.4.1) to write a CSV to Minio
Use Minio Java SDK (version 8.5.3) in my Java spring app to Read the CSV from Minio
All the applications are deployed on K8s.
All ...