How to read metrics from prometheus endpoint ?

jozefrebjak

We successfully configured prometheus endpoint like https://docs.netgate.com/tnsr/en/latest/monitoring/prometheus.html

Everything is working, but how to read that data ? Is there some grafana dashboard for example or something else where we can learn how to use that stats ?

We really want to use TNSR but we need statistics from dataplane.

I believe that somebody figured it out and can help us.

Thank you.

kiokoman

from what I read in the docs, Prometheus is running on top of centos8 and listening on the host management IP address
I think any how-to you can find on the web can be used
like this one
https://prometheus.io/docs/visualization/grafana/
also

https://prometheus.io/docs/visualization/grafana/#importing-pre-built-dashboards-from-grafana-com
Grafana.com maintains a collection of shared dashboards which can be downloaded and used
https://grafana.com/grafana/dashboards?dataSource=prometheus

xhun

It's quite straightforward to pull data from prometheus with Grafana as mentioned by kiokoman. We've done it 3 weeks ago in our test environment and it's working fine so far. I didn't find any dashboard 'ready to use' online but it would be nice if netgate team could provide a 'complete' dashboard later.

There's one metric that I didn't find though which is the CPU utilization. Being able to know how much CPU is used by DPDK can be quite usefull. It is possible to get how many threads are being used, but not exactly how much is being used from each thread or in total. Is it possible to get this data somehow?

jimp

@xhun said in How to read metrics from prometheus endpoint ?:

There's one metric that I didn't find though which is the CPU utilization. Being able to know how much CPU is used by DPDK can be quite usefull. It is possible to get how many threads are being used, but not exactly how much is being used from each thread or in total. Is it possible to get this data somehow?

The dataplane will use 100% of a core at all times when polling data. CPU usage is not a relevant metric to TNSR.

xhun

@jimp said in How to read metrics from prometheus endpoint ?:

The dataplane will use 100% of a core at all times when polling data. CPU usage is not a relevant metric to TNSR.

Hi,

I'm aware that in the host/kernel it always shows 100% of utilization for each core assigned to the dataplane, which is the common behaviour with DPDK. I've seen in other vRouters a metric showing how much is actually being used by the dataplane.

Lets suppose the router is in "idle" with a couple of hundreds pps running through it, and then with 1M pps.. it does make sense to have some measurement of CPU utilization showing how it increases and decreases with PPS and so on, because that's actually the case or not? It would be useful in order to know when and if needed to add additional cores to the dataplane.

jimp

I don't have a list of the current TNSR prometheus data handy but what you need to look for there are stats from the dataplane. I don't think we have a command in the TNSR CLI to get these specific ones yet, but the interesting ones are in the top part of sudo vppctl show runtime like the vector rate and loops/sec. (Higher vector rate means higher load, less loops/sec also means higher load)

xhun

@jimp
Thank you for the output. I'll check vppctl.

jimp

I had a chance to look at the data from Prometheus on TNSR and the nodes you'll be interested in to track load appear to be:

_sys_vector_rate
_sys_vector_rate_per_worker

That's on 20.10 which will be out soon. I didn't have a 20.08 system with Prometheus handy to see if it had the same data.