Questions tagged [amazon-emr]

Ask Question

Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

3,397 questions

0 votes

0 answers

7 views

spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"

I use spark on EMR with versions: emr-6.13.0, Spark 3.4.1 i try to run a simple spark streaming job that read from kafka and write to memory-table using foreachBatch and get failure "Error while ...

shayms8

asked yesterday

0 votes

0 answers

24 views

EMR Serverless SparkSession builder error: ClassNotFoundException issues

I am trying to create a job in EMR Studio to run in an EMR Serverless application. It's a relatively basic script to use PySpark to read some Athena tables, do some joins, create an output dataframe ...

si1287

asked Jul 23 at 11:22

0 votes

0 answers

24 views

Does spark shuffle/exchange converts compress data to uncompress form?

I have input dataset which is 450gb in s3 parquet compressed format. However during exchange it's showing 10 TB. is there any way to tune it. Tow large table are getting joined and and no other ...

user3858193

1,438

asked Jul 21 at 22:03

0 votes

1 answer

49 views

Spark-Scala vs Pyspark Dag is different?

I am converting pyspark job to Scala and jobs executes in emr. The parameter and data and code is same. However I see the run time is different and so also the dag getting created is different. Here I ...

user3858193

1,438

asked Jul 19 at 17:39

0 votes

0 answers

15 views

Apache oozie JA008 error - job state changed from SUCCEDED to FAILED

I'm running oozie HA 5.2.1 on EMR and I have an issue with this temporary directory. I have a workflow which has start node -> action node -> end node. The job start running -> runs for 10-15 ...

Stefan Ss

asked Jul 18 at 12:15

0 votes

0 answers

9 views

AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long

In my daily job I use EMR to process large amount of data. This data are stored in CSV files on S3 bucket. The idea I had was to try to process ziped csv files instead of plain csv. In Hive app I use ...

Vape

asked Jul 18 at 8:14

0 votes

0 answers

15 views

Airflow error while creating EMR cluster via DAG

I am looking to create an EMR cluster via airflow DAG using EmrCreateJobFlowOperator using a role called dev-emr-ec2-profile-role for jobFlow. This role is used to provision EMR cluster via Terraform ...

Anngva82

asked Jul 16 at 15:11

3 votes

0 answers

30 views

Spark Repartition/shuffle optimization

I am trying to repartiton before applying any transformation logic. This takes a lot of time. Here is code and snapshot of UI below. Any optimization can be applied here?. Cluster: AWS EMR,200 Task ...

user3858193

1,438

asked Jul 10 at 20:02

1 vote

0 answers

39 views

Spark EMR Shuffle Read Fetch Wait Time is in 4hrs

One of my spark job failed due emr-spark-shuffle-fetchfailedexception-with-65tb-data-with-aqe-enabled has high Shuffle Read Fetch Wait Time. is there any way it can be improved. Spark-submit spark-...

user3858193

1,438

asked Jul 7 at 17:56

0 votes

0 answers

27 views

Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless

Objective: To set up a streaming job on Amazon EMR Serverless to process weather data from Amazon MSK (Managed Streaming for Apache Kafka) and write the word count results to an S3 bucket. Steps Taken:...

user26129742

asked Jul 7 at 11:04

0 votes

1 answer

35 views

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

I see always 1000 task/partitions getting created for a spark jobs with AQE enabled. If I execute job for monthly(4 times weekly data) or a week data, the shuffle partitions are same.Whis is nothing ...

user3858193

1,438

asked Jul 5 at 14:42

0 votes

2 answers

56 views

What does retry in SparkUI means?

I have spark executed in two different instances: spark.sql.adaptive.coalescePartitions.enabled=false spark.sql.adaptive.coalescePartitions.enabled=true In the first instance, the stage graphs have ...

user3858193

1,438

asked Jul 5 at 10:54

0 votes

0 answers

28 views

ClassCastException in Spark SQL Incremental Load with DBT

I'm encountering a ClassCastException error when running an incremental load using DBT and Spark SQL. The error message indicates an issue with casting in the Spark execution plan: org.apache.hive....

Raul Zinezi

asked Jul 4 at 15:23

1 vote

1 answer

60 views

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

I see the number of task in spark job is only 1000 after initial read, where as number of cores available is 9000 (1800 executors*5 core each). I have enabled aqe and coalesce shuffle partition. In ...

user3858193

1,438

asked Jul 2 at 12:10

0 votes

1 answer

63 views

How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?

I am using Terraform to set up Trino cluster managed by Amazon EMR. Here is my Terraform code: resource "aws_emr_cluster" "hm_amazon_emr_cluster" { name ...

Hongbo Miao

48.7k

asked Jun 29 at 9:01

15 30 50 per page

2 3 4 5

…

227 Next

Collectives™ on Stack Overflow

Questions tagged [amazon-emr]

spark on EMR error when using `foreachBatch`: "terminated with exception: Error while obtaining a new communication channel"

EMR Serverless SparkSession builder error: ClassNotFoundException issues

Does spark shuffle/exchange converts compress data to uncompress form?

Spark-Scala vs Pyspark Dag is different?

Apache oozie JA008 error - job state changed from SUCCEDED to FAILED

AWS EMR - reading multiple "zip" files from S3 bucket returns Your key is too long

Airflow error while creating EMR cluster via DAG

Spark Repartition/shuffle optimization

Spark EMR Shuffle Read Fetch Wait Time is in 4hrs

Troubleshooting Kafka Integration with Spark Streaming on Amazon EMR Serverless

EMR-Spark Job creating max 1000 partitions/task when AQE is enabled

What does retry in SparkUI means?

ClassCastException in Spark SQL Incremental Load with DBT

Spark emr jobs: Is the number of task defined by AQE (adaptive.enabled)?

How to enable "Use for Hive table metadata" in "AWS Glue Data Catalog settings" using Terraform?

Hot Network Questions

Collectives™ on Stack Overflow

Questions tagged [amazon-emr]

Related Tags