Questions tagged [amazon-emr]
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
amazon-emr
3,397
questions
0
votes
0
answers
7
views
spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"
I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1
i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...
0
votes
0
answers
24
views
EMR Serverless SparkSession builder error: ClassNotFoundException issues
I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...
0
votes
0
answers
24
views
Does spark shuffle/exchange converts compress data to uncompress form?
I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...
0
votes
1
answer
49
views
Spark-Scala vs Pyspark Dag is different?
I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
0
votes
0
answers
15
views
Apache oozie JA008 error - job state changed from SUCCEDED to FAILED
I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...
0
votes
0
answers
9
views
AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long
In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv.
In Hive app I use ...
0
votes
0
answers
15
views
Airflow error while creating EMR cluster via DAG
I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...
3
votes
0
answers
30
views
Spark Repartition/shuffle optimization
I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?.
Cluster: AWS EMR,200 Task ...
1
vote
0
answers
39
views
Spark EMR Shuffle Read Fetch Wait Time is in 4hrs
One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved.
Spark-submit
spark-...
0
votes
0
answers
27
views
Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless
Objective:
To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket.
Steps Taken:...
0
votes
1
answer
35
views
EMR-Spark Job creating max 1000 partitions/task when AQE is enabled
I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...
0
votes
2
answers
56
views
What does retry in SparkUI means?
I have spark executed in two different instances:
spark.sql.adaptive.coalescePartitions.enabled=false
spark.sql.adaptive.coalescePartitions.enabled=true
In the first instance, the stage graphs have ...
0
votes
0
answers
28
views
ClassCastException in Spark SQL Incremental Load with DBT
I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan:
org.apache.hive....
1
vote
1
answer
60
views
Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?
I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...
0
votes
1
answer
63
views
How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?
I am using Terraform to set up Trino cluster managed by Amazon EMR.
Here is my Terraform code:
resource "aws_emr_cluster" "hm_amazon_emr_cluster" {
name ...