Create Prometheus Monitoring Server With Grafana Dashboard

07.04.2020 — Prometheus, Grafana, Golang — 5 min read

Prometheus is a tool which is used for Systems and Service Monitoring. The piece of code programmers write is to be run on actual hardware which means we are utilizing a physical resource that costs us money to operate on. So we better utilize this resource in the most efficient manner. This is where systems monitoring comes in picture. Running complex applications on actual servers is complicated and things can go haywire because of several reasons. Some potential problems which can occur are:

Disk Full -> No new data can be stored
Software Bug -> Request Errors
High Temperature -> Hardware Failure
Network Outage -> Services cannot communicate
Low Memory Utilization -> money wasted

These problems occur more often than you think, so it necessary to monitor your systems and services to keep check of their health.

Monitoring means to get information from your system, to get insights and act on those insights. There are different ways in which you could do the monitoring.

Check Based Monitoring -> Run scripts periodically to check the health of servers. Very Static. Very Local Context based on individual machines.
Logs/ Events -> Record full details about each event. Can be structured and unstructured. Further analysis required. (Loki, InfluxDB). Lack of interservice correlation.
Metrics/ Time Series -> Numeric values, sampled over time. (OpenTSDB, Prometheus), good for aggregate health monitoring. Need logs for detailed analysis though.

For inter-service correlation, we use Tracing, track single requests through the entire stack. (Jaeger).

Prometheus is metrics-based monitoring & alerting stack made for dynamic cloud environments. It doesn’t do logging or tracing.

Architecture of Prometheus:

There are things called as targets from which Prometheus server pulls time-series metrics at regular intervals and store it in a local disk.

These targets can be of two types, one in which you have control over the source code like your web app or your API which is what we will be doing, here you can use a Prometheus client library to execute an endpoint from which Prometheus can gather this data. Another target is something which you don’t necessarily have control upon like a Linux VM or a SQL DB instance. In this case, you use something called an exporter which sits on top of this system and sends metrics to Prometheus.

Prometheus server is then configured to pull or scrape data from these targets and stores these time series metrics on a local disk.

Having given you enough explanation about what and how of Prometheus, let us create our service and monitor it using Prometheus. You can check out the Github Repository which contains all the code here.

Let’s start by creating our server.

main.go

1package main
2
3import (
4    "log"
5    "math/rand"
6    "net/http"
7    "strconv"
8    "time"
9
10    "github.com/prometheus/client_golang/prometheus"
11    "github.com/prometheus/client_golang/prometheus/promauto"
12    "github.com/prometheus/client_golang/prometheus/promhttp"
13)
14
15var (
16    listOfCodes = []int{http.StatusOK, http.StatusNotFound, http.StatusForbidden}
17)
18
19func init() {
20    rand.Seed(time.Now().UnixNano())
21}
22
23func main() {
24    http.Handle("/metrics", promhttp.Handler())
25    http.HandleFunc("/api/v1", getRoute)
26    http.HandleFunc("/api/v2", getRoute)
27    err := http.ListenAndServe(":9000", nil)
28    if err != nil {
29        log.Fatal(err)
30    }
31}
32
33func getRoute(w http.ResponseWriter, r *http.Request) {
34    rnd_no := 1 + rand.Intn(9)
35    statusCode := listOfCodes[0]
36    if rnd_no == 7 {
37        statusCode = listOfCodes[2]
38    }
39    if rnd_no == 3 || rnd_no == 5 {
40        statusCode = listOfCodes[1]
41    }
42
43    statusCodeProcess.With(prometheus.Labels{"endpoint": r.URL.Path, "status_code": strconv.Itoa(statusCode)}).Inc()
44
45    if statusCode == http.StatusOK {
46        w.WriteHeader(statusCode)
47        w.Write([]byte("200 - Status OK!"))
48    }
49    if statusCode == http.StatusNotFound {
50        w.WriteHeader(statusCode)
51        w.Write([]byte("404 - Status Not Found!"))
52    }
53    if statusCode == http.StatusForbidden {
54        w.WriteHeader(statusCode)
55        w.Write([]byte("403 - Status Forbidden!"))
56    }
57
58}
59
60var (
61    statusCodeProcess = promauto.NewCounterVec(prometheus.CounterOpts{
62        Name: "myapp_status_codes",
63        Help: "Gives status codes of particular route",
64    },
65        []string{"endpoint", "status_code"},
66    )
67)

So what we are essentially doing in this piece of code is creating two endpoints, which are /api/v1 and /api/v2. Whenever you hit these endpoints, you get either a 200 or 404 or 403. The function is coded in such a way that the probability of getting a 200 is more than 404 which in turn is more than 403.

The metrics which we will be monitoring are displayed on the /metrics endpoint, which we have already populated with some metrics by using the promhttp.Handler(). Some of these pre-populated metrics include go_goroutines: Number of goroutines that currently exist, go_memstats_alloc_bytes: Number of bytes allocated and still in use, go_threads: Number of OS threads created and many more.

There are four metric types which you could use, I am using the Counter type metric. You can check other metrics here. Every time series is uniquely identified by its metric name and optional key-value pairs called labels. Here, we have used the metric name myapp_status_codes and are using two labels endpoint and status_code. The endpoint will give us the name of the endpoint which is being hit while the status_code will return status code.

statusCodeProcess.With(prometheus.Labels{“endpoint”: r.URL.Path, “status_code”: strconv.Itoa(statusCode)}).Inc()

The above line of code Increments the counter and sets the labels with appropriate values.

Now, we can run the server by executing go run main.go. You can visit http://localhost:9000/api/v1 or http://localhost:9000/api/v2 and check the responses. The counter value can be searched in http://localhost:9000/metrics. You will see something like

# TYPE myapp_status_codes counter
myapp_status_codes{endpoint="/api/v1",status_code="200"} 1

Here 1 indicates the number of times endpoint /api/v1 was hit and returned a 200 response.

Now as our target is ready, we need to create a Prometheus server that will scrape our API server. To create a Prometheus server, we need to create a config file.

prometheus.yml

1scrape_configs:
2  - job_name: 'prometheus'
3    static_configs:
4    - targets: ['localhost:9090']
5
6  - job_name: 'api-server'
7    scrape_interval: 5s
8    static_configs:
9    - targets: ['localhost:9000']

The format for writing this config file can be found here. We can now run the server by executing

prometheus --config.file=./prometheus.yml

Now, you can go to http://localhost:9090. You will be greeted will the below-shown screen

You can now enter a PromQL expression and query the metrics. You can find a guide about how to query using PromQL here. You can do a lot of useful stuff here alone, but we will go ahead and connect Prometheus with Grafana and make beautiful graphs that will give us useful insights about our service as well.

Grafana is open-source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics no matter where they are stored. In plain English, it provides you with tools to turn your time-series database (TSDB) data into beautiful graphs and visualizations.

You can install Grafana by following instructions mentioned here. Once installed you can run a Grafana server by executing

grafana-server

This will start a Grafana server on http://localhost:3000. The default username and password both are admin. Before you start creating beautiful graphs, you first need to Import Prometheus as a data source. You can find to option to Add Datasource on the home screen itself. Select Prometheus as a data source and it will direct you to a Settings page, where you need to fill in the appropriate details as shown below.

Now, click on Save & Test. You should be seeing a notification that says Data source is working.

After adding Prometheus as a data source, we will make a new dashboard, to do that, press the + icon in the left pane and click on the dashboard. You will be greeted by the below screen.

Click Add Query, you will be taken to the New Dashboard Screen, where we can make those beautiful graphs using PromQL queries as discussed above.

The query which we will be performing is the rate at which we get requested on a particular endpoint and what response is being returned. The [15s] means that we are getting rate as measured over the last 15 seconds.

rate(myapp_status_codes{endpoint="/api/v1", status_code="200"}[15s])

Similar to the above rate query for 200 response, we have done the same for 403 and 404 as well.

You can do the same now for the other endpoint, below shown is a sample for the other endpoint.

To generate traffic on our custom server, I am using a tool called Vegeta. It is a tool used to simulate traffic on a given endpoint. You can install the tool from its Github Repository. To use Vegeta to simulate traffic, write the command.

~ $ echo "GET http://localhost:9000/api/v1" | vegeta attack -duration=600s | tee results.bin | vegeta report
~ $ echo "GET http://localhost:9000/api/v2" | vegeta attack -duration=600s | tee results.bin | vegeta report

These commands will hit the particular endpoint on the server for 600s and you can watch the same on the Grafana Dashboard you just created.

You may think what is the need for checking what status codes which are being returned on a particular endpoint, so let’s take an example of a Payments Gateway service, it will be important to keep a track of successful and unsuccessful responses which are being returned when a user is paying using the service, a sudden spike in unsuccessful responses should mean that something is wrong with the server and you should immediately rectify whatever the problem.

The final dashboard will look something like below.

Hope you were able to understand and follow along on making a Monitoring Dashboard with Prometheus + Grafana. Thanks for reading. If you have any questions, feel free to leave a response.

Resources for further reading:

https://www.youtube.com/watch?v=5O1djJ13gRU (An amazing video to understand why the need for Prometheus)
https://www.scaleway.com/en/docs/vegeta-load-testing/
https://tomgregory.com/the-four-types-of-prometheus-metrics/
https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085