Skip to main content
The 2024 Developer Survey results are live! See the results

Questions tagged [amazon-emr]

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

0 votes
0 answers
7 views

spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"

I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1 i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...
shayms8's user avatar
  • 741
0 votes
0 answers
24 views

EMR Serverless SparkSession builder error: ClassNotFoundException issues

I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...
si1287's user avatar
  • 1
0 votes
0 answers
24 views

Does spark shuffle/exchange converts compress data to uncompress form?

I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...
user3858193's user avatar
  • 1,438
0 votes
1 answer
49 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...
user3858193's user avatar
  • 1,438
0 votes
0 answers
15 views

Apache oozie JA008 error - job state changed from SUCCEDED to FAILED

I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...
Stefan Ss's user avatar
0 votes
0 answers
9 views

AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long

In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv. In Hive app I use ...
Vape's user avatar
  • 131
0 votes
0 answers
15 views

Airflow error while creating EMR cluster via DAG

I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...
Anngva82's user avatar
3 votes
0 answers
30 views

Spark Repartition/shuffle optimization

I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?. Cluster: AWS EMR,200 Task ...
user3858193's user avatar
  • 1,438
1 vote
0 answers
39 views

Spark EMR Shuffle Read Fetch Wait Time is in 4hrs

One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved. Spark-submit spark-...
user3858193's user avatar
  • 1,438
0 votes
0 answers
27 views

Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless

Objective: To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket. Steps Taken:...
user26129742's user avatar
0 votes
1 answer
35 views

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...
user3858193's user avatar
  • 1,438
0 votes
2 answers
56 views

What does retry in SparkUI means?

I have spark executed in two different instances: spark.sql.adaptive.coalescePartitions.enabled=false spark.sql.adaptive.coalescePartitions.enabled=true In the first instance, the stage graphs have ...
user3858193's user avatar
  • 1,438
0 votes
0 answers
28 views

ClassCastException in Spark SQL Incremental Load with DBT

I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan: org.apache.hive....
Raul Zinezi's user avatar
1 vote
1 answer
60 views

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...
user3858193's user avatar
  • 1,438
0 votes
1 answer
63 views

How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?

I am using Terraform to set up Trino cluster managed by Amazon EMR. Here is my Terraform code: resource "aws_emr_cluster" "hm_amazon_emr_cluster" { name ...
Hongbo Miao's user avatar
  • 48.7k

15 30 50 per page
1
2 3 4 5
227