GumGum’s Kafka Journey: Producer Enhancements and Optimizing Cost

Parth Patwari
GumGum Tech Blog
Published in
9 min readJun 27, 2024

Introduction

In the fast-paced world of digital advertising, efficient data handling is paramount. At GumGum, Apache Kafka plays a critical role as a message bus within the ad exchange ecosystem, facilitating seamless communication between the Ad Exchange and Data Pipelines. This case study delves into the ambitious project undertaken to replace the existing Kafka streaming service (V3) with a new, more efficient system (V4), exploring the motivations behind this transition, the challenges encountered, and the solutions implemented.

Note: V3 and V4 is the internal versioning of the Kafka clusters in GumGum Ad Exchange.

What is an Ad Exchange and a data pipeline in GumGum?

GumGum’s ad exchange allows for real-time bidding on advertising impressions, ensuring that advertisers can target their ads effectively and publishers can maximize their revenue.

For GumGum, data pipelines drive business and operational decisions by providing accurate, timely data for analysis and reporting.

Background: What is Kafka and Its Role in GumGum Ad Exchange? 🔄

Apache Kafka is a distributed streaming platform that relays messages between the Ad Exchange and reporting pipelines at GumGum. It handles various essential data streams, including inventory impressions, real-time bidding (RTB) events, and ad events. Kafka’s primary role is to ensure efficient data transfer from Ad Exchange to Data Pipeline, making it a critical component for maintaining the flow of information. By acting as a reliable message bus, Kafka allows different parts of the system to communicate and share data in real-time, which is vital for the functionality and performance of the ad exchange.

Project Overview: Goals and Implementation 🎯

Motivation for the Transition

The project aimed to replace the existing Kafka cluster (V3) with a new cluster (V4) that utilizes modern architecture. This transition was motivated by several factors:

  • Cost Reduction: The new cluster was designed to reduce operational costs significantly. By moving to a more efficient architecture, the team aimed to achieve the same performance with fewer resources, leading to cost savings.
  • The V3 cluster was implemented with 5 x GP2 volumes with each of its nodes as RAID0. There are benefits with RAID0 implementations. However, with critical systems such as Kafka, this poses a danger as failure of one of the volumes would lead to failure of a Kafka service.
  • Rebalancing of partitions required much time and effort due to the limited network resource of AWS EC2 instance type m4.2xlarge. If scaling of the cluster size was required, the rebalancing process would take up to 8+ hours due to handling replication and incoming/outgoing traffic.
  • Improve performance: There were several factors that required improvements such as network throttles during rebalancing process and failure of a single volume leading to a Kafka service disabled. With the new instance and storage system, we are able to reduce the time and effort as well as reduce failures of Kafka.

V3 vs. V4: Cluster Differences 📊

V3 vs V4 Cluster Differences

The new configuration of V4, with its ARM64 or AWS Graviton architecture and reduced node count, promised significant cost savings while externalizing log parsing using Avro Schema which decoupled Data Engineering pipeline from Ad Exchange log parsers.

Implementation Details⚙️

A new Kafka cluster was created, and producer configurations were set up using a registry key service integral to the Ad Exchange (AE). This registry service acts as a key-value store, managing crucial aspects such as traffic throttling and feature flag values across various parts of the ad exchange. Implementing this registry service ensured that the new Kafka cluster operated efficiently, meeting the system’s requirements.

To monitor and optimize performance, we successfully created a comprehensive Grafana dashboard using all the metrics generated from the Kafka producer. The dashboard was instrumental in tracking and analyzing the system’s performance and accuracy, enabling us to identify and address any issues promptly. These metrics play a vital role in the project, helping us debug issues and assess the performance of both the old and new clusters. For more details on the metrics, you can refer to the Kafka producer metrics documentation.

The metrics allowed the team to monitor key performance indicators such as message throughput, latency, and error rates, providing insights necessary for maintaining high throughput and availability.

Challenges and Debugging 🛠️

Data Parity🔍

One of the most significant challenges faced was ensuring data parity between the old and new systems. We noticed discrepancies between the number of messages sent by the producer and those received at the broker end. This discrepancy could lead to missing data, translating to incomplete reports and potentially lost revenue.

Discrepancy between V3 and V4 record writes

Ad Exchange Latency 🕐

Another major challenge was the significant increase in latency within the Ad Exchange (AE). What used to be a smooth, sub-millisecond operation was now taking upwards of a thousand milliseconds. This latency caused back pressure, leading to latency spikes in Ad Exchange. The root cause of this issue was traced to a bug in the Ubuntu kernel release for the ARM architecture, which was causing TCP memory usage to spike. Identifying and resolving this bug was critical to stabilizing the system.

High TCP Memory Usage 🧠

High TCP memory usage was observed, contributing to message discrepancies and latency increases. This issue was directly linked to the Ubuntu kernel bug affecting the ARM architecture, necessitating a focused effort to address the underlying problem. Refer to: Bug #2037335 “kernel leaking TCP_MEM” — Ubuntu for more details about this issue.

Strategic Debugging 🛠️

The team employed a systematic debugging approach to tackle these challenges:

  • Producer Metrics: The team monitored key metrics such as connection creation and closure, error rates, and latency through Grafana. These metrics provided insights into the performance and issues of the producer. As we can see in the image below, the graph shows a high number of connection create/ close rate, error rate.
    - Connection create rate: A high rate suggests frequent new connections, which can indicate network or resource issues.
    - Connection close rate: High close rates can signal instability or misconfiguration, impacting performance.
    - Error rate: A high error rate can indicate underlying issues with the producer or network, affecting data reliability.
Producer metrics with high connection create rate, connection close rate and error rate
  • Logs Analysis: Using SumoLogic, the team analyzed logs from Kafka clusters and Tomcat, seeking patterns and anomalies. This painstaking process eventually led to the discovery of the kernel bug affecting TCP memory.
  • Broker Metrics: Regular checks were conducted for dead brokers, resource outages (CPU, memory, TCP memory, network, and disk), and replication lags to ensure the brokers were operating correctly.
  • AdExchange Investigations: The team examined logs and metrics at the pod level to pinpoint latency sources and other issues affecting the Ad Exchange.

Solutions and Outcomes✅

With our new Kafka observability and debugging capabilities we were able to diagnose, test, and resolve several issues in our Kafka cluster configuration and our Producer configuration.

Kafka Side Solutions 📈

To address the identified issues, various solutions were implemented on the Kafka side:

  • Broker side configurations have been matched for incoming size and traffic from producers.
  • Increased number of I/O and network threads for improved IOPS for both network and storage.
  • Increased number of replica fetchers for improved performance of replication.
  • tcp_mem issue resolved by changing ubuntu kernel version from 6.2.0 to 5.15.0
  • Changed the instance type to c7gn.2xlarge.
  • Changed from 5 x GP2 EBS volumes to a single GP3 volume per node.

Ad Exchange(Producer) Solutions 🔧

Specific solutions were also implemented on the Ad Exchange side:

  • Turn Off Acknowledgements: Disabling acknowledgements temporarily helped reduce latency, although this was a short-term measure.
  • Turn Off Retries: Setting the retries value to 0 in the producer configuration reduced latency by avoiding unnecessary retry attempts.
  • Change Max Block Time: Adjusting `max.block.ms` to less than or equal to 40 ms helped improve latency by reducing the time spent in synchronous operations during message production.
  • Stop Traffic to Inventory Cluster: As a last resort, stopping traffic to the inventory cluster provided immediate relief from issues, although this was a high-risk measure that required careful monitoring and recovery procedures.
  • Registry Config — A central configuration tool was established to manage the producer configuration, allowing these settings to be adjusted asynchronously and without making code changes for each configuration change.
  • Following are the some of the key producer configuration settings to adjust and measure the performance:
acks
retries
max.block.ms
linger.ms
batch.size
Producer Metrics after making changes with low connection create/close rate and error rate

Error Log Samples and Solutions 📝

Outcomes 💸

  • With the new c7gn.2xlarge (ARM/Graviton) Instances, we have increased network performance of up to 50x in comparison to the previous m4.2xlarge.
  • The increased network performance allows us to rebalance all the partitions without network throttle.
    - Previous rebalancing process took 8+ hours due to network throttle.
    - Current rebalancing takes 2 hours for all topics.
  • New storage system with a single GP3 volume per node reduces storage failures in comparison to the previous 5 GP2 volumes per node by 5 times.
  • The increased resources with the Graviton instances allowed us to decrease the cluster size from 24 to 4 nodes.
  • With the implementation of the new cluster, we were able to reduce the cost by 75% of the original cost.

Conclusion 🎉

The transition from V3 to V4 Kafka clusters was a complex but necessary project to enhance the efficiency and maintainability of GumGum’s Ad Exchange ecosystem. Despite encountering significant challenges, the project team successfully implemented solutions that improved system performance and reduced costs. This transition highlights the importance of thorough planning, robust debugging processes, and the ability to adapt solutions to evolving technical landscapes. By leveraging Avro Schema for data serialization, optimizing producer configurations, and implementing effective monitoring and debugging strategies, the we achieved the project goals and set the stage for continued improvements and innovations in the future.

This case study underscores the critical role that careful planning and execution play in large-scale system transitions. The migration to V4 was not the end, but the beginning of a new era — one where the impossible becomes possible, and every new challenge is met with the unyielding spirit that defines GumGum. The future beckons, and with it, a myriad of opportunities waiting to be seized. This is not the end of the story, but merely the end of one epic chapter in the ongoing saga of innovation and excellence.

Further Reading and Resources 🔗

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram

--

--