Set Up Monitoring Pipeline using K8S and TIG stack: An Experience

A personal experience of creating my very first monitoring system

5 min readOct 24, 2020

Last year, I got a task to create monitoring system for our existing Kafka consumer. So previously we already had several Kafka consumers that deployed and ran on Google Kubernetes Engine (GKE). There was a time when consumer was stopped for one of the issues due to configuration mismatch and we didn’t even realize it for several days until product team report that there was no update data. This is the case when we do not have proper monitoring in place. In order to tackle this, we need to have a monitoring system.

First thing first, what is monitoring system?

Monitoring system can collects data from your servers or other kinds of network devices so you can analyze it for trends or problems. Monitoring system mainly comprises of metrics — value pertaining to your systems at a specific point in time e.g., the number of users currently logged in to a web application), monitoring — process of collecting, storing, and analyzing data, and alerting — used to alert user when the threshold is met.

The only thing I had on my mind was, “What metrics I can collect from there? Is there something on the consumer that can help me to collect it?”. The next day, I found that there are Jolokia and Telegraf stated on the config and after do some researches, I come up with TIG stack as the result.

TIG (Telegraf, InfluxDB, and Grafana) is an end-to-end open-source solution for monitoring system. It has three components; Telegraf for collecting metrics from Kafka consumer services, the InfluxDB for storing data, and Grafana is used to visualize and alerting.

How is the integration?

This is the explanation of language and technology I used and how those components interact and affect each other.

1. Container orchestration: Google Kubernetes Engine

Google Kubernetes Engine (GKE) is a managed, production-ready environment for running containerized applications. As we use GKE as the main environment to consume Kafka, I deploy all services for monitoring system also on Kubernetes.

2. Programming or scripting language: Kubectl and Kubernetes Manifest

Kubectl is a command line tool used to run commands for every possible Kubernetes operation — such as create, modify and delete — meanwhile Kubernetes manifest is also used to operate the Kubernetes resources but we provide the information in a .yaml file.

3. Gateway and routing: Istio-system

Istio is an open source service mesh platform that provides a way to control how microservices share data with one another. It designed to run in a variety of environments, including Kubernetes. In this case I use Istio to secure pod-to-pod or service-to-service communication at the network and application layers. For instance, if we want to access Grafana outside the Kubernetes cluster, we don’t have to create an internal load balancer since it’s already provided by Istio system.

4. Java Management Extensions (JMX) monitoring: Jolokia

Jolokia is an agent that can be deployed on JVMs to expose their MBeans through a REST-like HTTP endpoint, making JVM metrics easily available to non-Java applications running on the same host.

5. Data collector: Telegraf

Telegraf is an agent written in Go for collecting metrics and writing them into InfluxDB or other possible outputs. If we are using linux for instance, then we should open the file /etc/telegraf/telegraf.conf and edit it:

Set jolokia2_agent as an input, add a Jolokia REST agent endpoint to query, and add metrics to read (CPU, memory, disk utilizations, etc).
Set InfluxDB as an output, add a InfluxDB host, database, username and password to store the metrics.

6. Database: InfluxDB

InfluxDB is an open-source time series database with no external dependencies that useful for recording metrics. We can use InfluxDB Helm chart to deploy an InfluxDB as a StatefulSet on a Kubernetes cluster (I prefer to use StatefulSet instead of Deployment since it has its own Persistance Volume Claim).

7. Visualizing and alerting: Grafana

In this case, I use Grafana to create a dashboard to monitor Kafka consumers and alerting that send to specific channel if something wrong happen to the consumer. There’s also Grafana Helm chart to bootstrap a Grafana service on a Kubernetes cluster.

8. Target alerting: Slack

Slack is a channel-based messaging platform. We use Slack not only to work together more effectively, but also to connect all our software tools and services using Slack Integrations or Slack Incoming Webhook, e.g.: get notifications about Kafka consumers from Grafana.

High level picture of how I implement TIG stack in the environment.

From the picture above, you can see how the processes happen:

The following Kafka components require Jolokia to be deployed and started, as the modern and efficient interface to JMX that is collected by Telegraf.
Telegraf store Kafka component metrics to InfluxDB in near realtime (per 10 seconds).
Once I add InfluxDB instance in Grafana and query the data, it will periodically fetch the query (per 5 seconds, 10 seconds, etc).
Grafana push an alert to Slack once the metric conditions meet any of alert rules (dead processes, revoked partition).

After properly loading the InfluxDB data, the dashboard look like:

Graph panels contain information of Kafka consumer status, e.g. number of running or dead Kafka processes for each hour

I create a notification channel in Grafana using the Slack incoming webhooks to push the notification to a specific Slack channel. Then in the Alert tab of graph panel, I configure how often the alert rule should be evaluated and what conditions that need to be met for the alert to trigger its notification channel.

From many of the metrics, I choose dead process to trigger the alert because it indicates that the consuming process is stopped and this is how the result look like:

Summary and Constraints

The monitoring system is is near realtime (per 10 seconds) and scalable enough. Some people like myself prefer to not only get the alerts, but also see details from a visually dashboard and I can do it by creating a graph panel in Grafana. Thanks to Helm chart, the stack itself is easy to setup.

Nevertheless, there are several concerns of implementing this stack:

I need a big learning curve to learn about TIG stack, and it takes some time to learn about it *of course :).
I can’t install several Jolokia metrics in our environment due to incompatible library version in Java.
There is no User Interface for Kubernetes resources so I have to create, update or delete the manifest by enter to the system using Kubectl command line.
It’s hard to implement Istio on Kubernetes, especially for Statefulset. I need some helps from Dev-Ops team regarding Istio setup.

Here are several guidelines I use to implement TIG stack:

Getting started with Telegraf (https://github.com/influxdata/docs.influxdata.com/blob/master/content/telegraf/v0.12/introduction/getting-started-telegraf.md)
Monitoring with Telegraf, InfluxDB and Grafana — An introduction to the TIG stack and a tutorial to set it up.(https://stanislas.blog/2018/04/monitoring-telegraf-influxdb-grafana/#influxdbinstallation)
How to Deploy InfluxDB / Telegraf / Grafana on K8s? — Blog post related to Kraken’s deployment on Kubernetes (https://octoperf.com/blog/2019/09/19/kraken-kubernetes-influxdb-grafana-telegraf/#map-a-configuration-file-using-configmap)