Grafana Dashboard using Telegraf with additional plugins
-
This is as granular as you can really get with the data collected from pfblocker. You could go deeper, but I don't think it would provide any real value. The blank borders around the DNSBL stuff really bug me!
-
Nice work there. I think you have a typo on one of the charts on the left hand side
IP-Top 10 Blocked - OUT (By Host/Protocol) should read IN.
You could probably layout the objects in a different way to end up with less black space in the bottom left /right. Hard to visualise off the top of my head. You might find the layout might work better in columns. i.e. IP - Top 10 Block - * is 3 items in a row. perhaps pivoting it all to be vertical may work better?
To use up the free space, one thing that could be useful is a variation on the blocked packet stats graph. You could create stacked bar charts grouped by ip. This would give a visual indication of the proportion of traffic that is coming from one of your internal devices that is being blocked for outbound, and for inbound would visually highlight when a particular ip is attacking you.
What do you think to my suggestion above about hyperlinking the public ip's to greynoise.io? Or another alternative? Maybe could put the url fragment in a variable so people can pick their favourite ip analysis website?
-
@victorrobellini 7.1.1, but I wouldn't think that would make a difference regarding my particular issue. Your design applies the value mapping function to all fields within the panel, but instead it should only be applied to the Status column exclusively under the Override settings. I've fixed this on my own dashboard and I started to attempt a pull request to fix it, but didn't have the patience to manually adjust only the relevant lines in your JSON file and reimport for testing.
-
I made a few updates to the plugins, telegraf config, and the dashboard.
More gateway details
More pfBlocker details (in a separate panel so you can hide it if you don't care)Screenshot to be updated
-
@victorrobellini exciting. I look forward to the screenshot / git commit.
-
Screenshot updated. The only thing I need to figure out is what to put in the pfSense Details blank space. Adding graphs for the sake of filling space means unnecessary influx queries and CPU usage on the host. I'm running my influx on a J5005 with a bunch of other services. I'll deal with the blank space for now.
I added the extra pfBlocker details to its own panel so it can be collapsed if it's not used.
-
@victorrobellini i'll pull down the latest update this eve and take a look. (thanks in advance) I have a few more cpu cycles available in my setup:
NUC10i7FNK -> proxmox -> debian vm -> docker -> influxdb
Latest update on the ntopng experiment. It is up and running. Had to customise the startup script such that I could create custom applications in the GUI. Bit of a pain that a key feature isn't supported OOTB.
I've noticed that Network Discovery gets itself confused and assigns device names incorrectly based on mdns. Haven't found a way to turn this off yet. For example, sometimes ntopng thinks my firewall network interface is called "google home mini", which is clearly incorrect. It might be related to running avahi / pimd to get chromecast running across vlans. More investigations to follow.
Haven't tried exporting ntopng to influxdb yet.
I'm in 2 minds to skip over ntopng and investigate elastiflow instead. -
@VictorRobellini I've pulled down the latest version of the conf, plugins and dashboard and I have noticed 1 thing that has stopped working.
This is the old gateway RTT dashboard. Also note the Gateway list:
🔒 Log in to viewThis is the new dashboard. The gateway names have changed and RTT chart is no longer working. Does it work for you?
🔒 Log in to viewWhen running telegraf --test I think this is the relevant snippet after changing the dashboard and plugins to the new version:
> gateways,host=fw,interface=igb0 defaultgw=1,delay=2.117,gwdescr="Interface WAN_DHCP Gateway",loss=0,monitor="192.168.0.1",source="192.168.0.30",status="online",stddev=3.885,substatus="none" 1619643426000000000 > gateways,host=fw,interface=igb0 defaultgw=0,delay=0,gwdescr="Interface WAN_DHCP6 Gateway",loss=0,monitor="",source="",status="",stddev=0,substatus="" 1619643426000000000 > gateways,host=fw,interface=igb0 defaultgw=1,delay=2.117,gwdescr="Interface WAN_DHCP Gateway",loss=0,monitor="192.168.0.1",source="192.168.0.30",status="online",stddev=3.885,substatus="none" 1619643426000000000 > gateways,host=fw,interface=igb0 defaultgw=0,delay=0,gwdescr="Interface WAN_DHCP6 Gateway",loss=0,monitor="",source="",status="",stddev=0,substatus="N/A" 1619643426000000000
-
@wrightsonm
Did you drop your gateways measurement? I eliminated unused tags in the influx data. I added a blurb in the Readme about things not rendering properly. -
Great looking dashboard!
Wondering if there are step by step instructions on how to go about installing and configuring, as I'm finding the github instructions rather lacking.
-
Ok so steps i've performed so far. I think there are bugs in the latest git repository.
docker exec -it influxdb /bin/sh influx delete --bucket pfsense --start '1970-01-01T00:00:00Z' --stop $(date +"%Y-%m-%dT%H:%M:%SZ") --predicate '_measurement="tail_dnsbl_log"' influx delete --bucket pfsense --start '1970-01-01T00:00:00Z' --stop $(date +"%Y-%m-%dT%H:%M:%SZ") --predicate '_measurement="tail_ip_block_log"' influx delete --bucket pfsense --start '1970-01-01T00:00:00Z' --stop $(date +"%Y-%m-%dT%H:%M:%SZ") --predicate '_measurement="gateways"' --org-id [id] influx delete --bucket pfsense --start '1970-01-01T00:00:00Z' --stop $(date +"%Y-%m-%dT%H:%M:%SZ") --predicate '_measurement="interface"' --org-id [id]
If your WAN config looks like mine with ipv4 and ipv6 enabled, then you end up with a problem with the gateways telegraf lines as igb0 exists twice. Gateways used to use gateway_name probably for this reason. It means that the tag grouping won't work if you have ivp4 and ipv6 enabled.
So i've had to change telegraph_gateways.php to include gateway_name and then adjust the grafana dashboard to use gateway_name instead of interface. (basically reverting it back to how it worked on the previous revision.
Next issue is telegraf is nolonger exporting the interface measurement.
Looks like this changeset replaced "interface" with "gateways". Looks like a copy paste issue to me.
github diffAs a result of this problem, none of the interfaces show up correctly on the dashboards.
The interface summaries work with the old version of telegraf_pfinterface.phpFinal note, is this dashboard now requires a 1440p monitor to view everything without horizontal scrolling. Would be nice if it would display on 1080p monitors
-
@wrightsonm
This is the dashboard working once again for me:telegraf --test --config /usr/local/etc/telegraf.conf
Grafana Gateway Variable changed back to:
SHOW TAG VALUES FROM "gateways" WITH KEY = "gateway_name" WHERE "host" =~ /^$Host$/
Modified plugins:
telegraf_gateways.php.txtReverted plugin:
telegraf_pfinterface.php.txtModified dashboard:
grafana.json.txt -
@wrightsonm said in Grafana Dashboard using Telegraf with additional plugins:
with ipv4 and ipv6 enabled, then you end up with a problem with the gateways telegraf lines as igb0 exists twice
Does this happen for the interface table? If you are willing to troubleshoot this with me, please open a github issue so we don't flood this thread with troubleshooting.
-
@jpetovello This is a pet project I built for use with my homelab and documented to help others. The main prereqs are that you have influx and grafana already set up. I didn't document that since everyone is going to have their systems set up differently. Luckily, there's no shortage of tutorials available online. My recommendation is to get your influxdb and grafana setup, read the github readme, read through this thread, and search the closed issues in the github project.
-
@victorrobellini yeah sure. I've not got any time until Tuesday now. The solution is in the attachments and pictures above. I can raise a Pull Request with the fixes on Tuesday if you like?
-
@wrightsonm I just reread what you wrote. I have no idea how that happened. Thanks for the heads up. I may just merge the 2 scripts since they are doing very similar stuff. I'll take a look this weekend.
Update: I merged the scripts and updated the repo. Everything should be working now.
-
@wrightsonm said in Grafana Dashboard using Telegraf with additional plugins:
Final note, is this dashboard now requires a 1440p monitor to view everything without horizontal scrolling.
Sorry, I run this on either a 4k or an ultrawide 1080p display. If you have it reformatted for 1080p just add the json with a new name and submit a PR.
-
I copied the new scripts for plugin and configured permissions on them (755), imported new dashboard but now I'm not getting any temperatures reading, IP or DNSBL lists, no network interface summary, no Gateway RTT or loss. Any hint what I missed in the upgrade process?
-
@von-papst Got temperatures working (CRLF from M$ messed up script). Still missing lists from pfblocker-ng and interface summary.
-
@von-papst check the CRLF on all the plugins. Then use the telegraf test command to check that there are no errors. Failing that add the debug logging option to telegraf and check the log file. Instructions for the above are on the GitHub readme.
-
@wrightsonm Got pfblocker working, changed queries and added some info. But Gateway RTT and loss still not working.
🔒 Log in to view -
@wrightsonm Got everything working now. I needed to modify some queries in the dashboard and added gateways.py script.
-
I'm a bit confused, where do the plugins need to be placed? Am I supposed to upload them to my pfSense install?
-
@jpetovello you should upload them to you pfsense.
-
I think there was some drift between my local system and my repo. I've updated the dashboard JSON. It should work with the updated gateways/interface plugin.
- 12 days later
-
@victorrobellini What is your influxdb ram utilisation looking like with the latest set of changes? With the grafana dashboard, my influxdb ram has increased to 12GB. I had to increase the RAM allocation on my docker VM (now at 20GB). I'm going to keep an eye on it.
I think high series cardinality might be related. Will do some investigations. Unfortunately InfluxDB OSS v2 doesn't currently implement the cardinality command (only Cloud version at the moment).
https://docs.influxdata.com/influxdb/v2.0/reference/flux/stdlib/influxdb/cardinality/I also think the logged data shows something port scanning me yesterday which may be related to the big increase in cardinality of the data.
- 27 days later
-
I found this neat little command playing around with PowerD options and it seems to be really light and work well for tracking CPU frequency changes.
sysctl dev.cpu.0.freq
Was thinking it would make a nice graph for those are also using PowerD which I would think would be most people but I could be wrong.
I would write this myself but it would just end up getting done better by others in this thread :)
@VictorRobellini what do you think about this, good add?
-
Interesting, I never messed with that, I just turned it on. This is an easy fit and can use the same script as the telegraf_temperature.sh script.
All you need to do is add the following line to the end of the script and build some graphs.
sysctl dev.cpu | fgrep "freq:" | tr -d '[:blank:]' | awk -v HOST="$HOSTNAME" -F '[.:]' '{print "temperature,sensor="$2$3",host="HOST" "$4"="$5""substr($7, 1, length($7)-1)}'
The better way to implement it (which I don't have the time for right now) is to completely rename the telegraf_temperature.sh script to be something like telegraf_sysctl.sh and update all of the commands to output with a similar format and then update the queries and graphs. If you just want to poke around and see what you can get, use the above recommendation.
sysctl dev.cpu | fgrep -e "freq:" -e temperature | tr -d '[:blank:]' | awk -v HOST="$HOSTNAME" -F '[.:]' '{print "sysctl,sensor="$2$3",host="HOST" "$4"="$5""substr($7, 1, length($6)-1)}' sysctl hw.acpi.thermal | fgrep temperature | tr -d '[:blank:]' | awk -v HOST="$HOSTNAME" -F '[.:]' '{print "sysctl,sensor="$4",host="HOST" "$5"="$6"." substr($7, 1, length($7)-1)}'
Something like this:
sysctl,sensor=cpu0,host=pfSense.home freq=1900 sysctl,sensor=cpu3,host=pfSense.home temperature=47.0 sysctl,sensor=cpu2,host=pfSense.home temperature=47.0 sysctl,sensor=cpu1,host=pfSense.home temperature=49.0 sysctl,sensor=cpu0,host=pfSense.home temperature=49.0 sysctl,sensor=tz1,host=pfSense.home temperature=29.9 sysctl,sensor=tz0,host=pfSense.home temperature=27.9
-
@victorrobellini said in Grafana Dashboard using Telegraf with additional plugins:
sysctl dev.cpu | fgrep "freq:" | tr -d '[:blank:]' | awk -v HOST="$HOSTNAME" -F '[.:]' '{print "temperature,sensor="$2$3",host="HOST" "$4"="$5""substr($7, 1, length($7)-1)}'
That worked like a charm, thank you so much.
-
Anyone else having problems with Uptime displaying?
Mine just shows N/A
-
@jpetovello This is what I have for my query
SELECT "uptime_format" FROM "system"
-
@jpetovello There's an entire section of the README dedicated to troubleshooting. Please review the docs and report your findings.
- Check that telegraf is actually able to collect the info (first half of the troubleshooting section)
- Check what is being stored in influx (second half of the troubleshooting section)
-
@wrightsonm My influx sits around 6GB. Every now and then I start reading up on Influx retention and downsampling, but I end up just truncating my db since it's faster.
-
@seamonkey said in Grafana Dashboard using Telegraf with additional plugins:
@victorrobellini 7.1.1, but I wouldn't think that would make a difference regarding my particular issue. Your design applies the value mapping function to all fields within the panel, but instead it should only be applied to the Status column exclusively under the Override settings. I've fixed this on my own dashboard and I started to attempt a pull request to fix it, but didn't have the patience to manually adjust only the relevant lines in your JSON file and reimport for testing.
I just updated my dashboard and this is still an issue. Anyone with interfaces that have a MAC address starting with 00 will have 'DOWN' listed as their physical address.
-
@seamonkey What does the data in influx show?
Good point about the value mapping and threshold. I've updated the interfaces widget to apply thresholds and value mapping to just the Status column.
-
@victorrobellini said in Grafana Dashboard using Telegraf with additional plugins:
What does the data in influx show?
Sounds like you got it fixed, but just in case it's still relevant...
> select * from interface where mac_address!='Unavailable' limit 2 name: interface time friendlyname host ip4_address ip4_subnet ip6_address ip6_subnet ip_address mac_address name source status ---- ------------ ---- ----------- ---------- ----------- ---------- ---------- ----------- ---- ------ ------ 1607846650000000000 LAN fallia.thegalaxy 192.168.0.1 00:15:17:xx:xx:xx em0 pfconfig 1 1607846650000000000 WAN fallia.thegalaxy 00.00.00.00 00:15:17:xx:xx:xx em1 pfconfig 1
-
@seamonkey
The data looks good.I've already committed the update that isolates the value/thresholds to status.Is your IP really null?
-
@victorrobellini The mac addresses are there, I've just censored the last three hex pairs for the sake of privacy. Same for the WAN IP. There's nothing wrong with the influx data, and as I mentioned previously, I was able to fix the problem by moving the value mappings under the Field tab to the Overrides tab in order to explicitly apply them to the Status field - which it sounds like you've implemented in the latest update.
- 25 days later
-
@victorrobellini I've done a bit of investigation into the Series Cardinality of the database.
We changed the ip_block_log grok pattern to tag more fields. As a result the cardinality of the database increased significantly. The downside to this is querying the database to show the new pfBlocker detail section became very RAM intensive. I was using > 20GB RAM in influxdb to display the last 10mins on the grafana dashboard. I had an OOM Out of Memory issue that crashed my Docker VM, so at this point I dropped the entire measurement and influx memory memory usage looked much happier again.
After 2 days of collecting new pfblocker data, I looked at the Cardinality of the database using this query (I am using Influx 2.0.4):
import "influxdata/influxdb/v1" cardinalityByTag = (bucket) => v1.tagKeys(bucket: bucket) |> map(fn: (r) => ({ tag: r._value, _value: if contains(set: ["_stop","_start"], value:r._value) then 0 else (v1.tagValues(bucket: bucket, tag: r._value) |> count() |> findRecord(fn: (key) => true, idx: 0))._value })) |> group(columns:["tag"]) |> sum() |> keep(columns: ["tag","_value"]) cardinalityByTag(bucket: "pfsense")
(Whilst Influx 2 does have a cardinality function, it is only currently available in the Cloud variant, not the OSS variant....
The above function does the job though)Cardinality was 34540! Influx is currently using 6GB RAM at this level.
Breaking this down it is attributed to:- src_port: 15780
- dest_port:10603
- src_ip:7980
- other metrics (not many)
I have now changed my telegraf config to tag less stuff. For now i've untagged src_ip, dest_ip, src_port, dest_port
grok_patterns = ["^%{SYSLOGTIMESTAMP:timestamp:ts-syslog},%{NUMBER:rulenum},%{DATA:interface},%{WORD:friendlyname},%{WORD:action},%{NUMBER:ip_version},%{NUMBER:protocolid},%{DATA:protocol:tag},%{IPORHOST:src_ip},%{IPORHOST:dest_ip},%{WORD:src_port},%{NUMBER:dest_port},%{WORD:direction},%{WORD:geoip_code:tag},%{DATA:ip_alias_name},%{DATA:ip_evaluated},%{DATA:feed_name:tag},%{HOSTNAME:resolvedhostname},%{GREEDYDATA:clienthostname},%{GREEDYDATA:ASN},%{GREEDYDATA:duplicateeventstatus}"]
The next step is to look into rewriting the dashboard to perform the required grouping and aggregation.
I've started with this table:
IP - Top 10 Blocked - IN (By Host/Port)Original query:
SELECT TOP("blocked",10),src_ip,dest_ip, protocol FROM ( SELECT count("action") as "blocked" FROM "autogen"."tail_ip_block_log" WHERE ("host" =~ /^$Host$/ AND "action" = 'block' AND "direction" = 'in' ) AND $timeFilter GROUP BY src_ip,dest_ip,protocol )
New V2 Query using Flux:
from(bucket: "pfsense") |> range(start: v.timeRangeStart, stop: v.timeRangeStop) |> filter(fn: (r) => r["_measurement"] == "tail_ip_block_log") |> filter(fn: (r) => r["_field"] == "src_ip" or r["_field"] == "dest_ip" or r["_field"] == "dest_port" or r["_field"] == "action" or r["_field"] == "direction" or r["_field"] == "protocolid" or r["_field"] == "host") |> pivot(rowKey:["_time"], columnKey: ["_field"], valueColumn: "_value") |> filter(fn: (r) => r.host =~ /^.*$/ and r.action == "block" and r.direction == "in" and r.protocolid =~ /^(6$|17$)/) |> group(columns:["src_ip","dest_ip","dest_port"]) |> rename(columns:{action: "Blocked"}) |> count(column: "Blocked") |> group() //use group to ungroup data and return to a single table |> top(n:10, columns: ["Blocked"]) |> sort(columns: ["Blocked"], desc: true) |> yield()
This is as far as I have got for the time being.
Thought I would share this before I got too far into updating the dashboard.
I suspect I won't get a good view of performance imrpovement until I've redone all of the pfBlocker Details section.
Also raising this now as it involves an update from Influx v1 to v2 to support the flux language.
Once you get used to flux, it is really quite powerful.
The old influxql language can still be used with v2, they are backwards compatible.You thoughts & opinions are appreciated.
- 8 months later
-
Is anyone else noticing some broken pfblocker panels with the latest pfblocker update?
Looks like the $timefilter seems to make the panel show no data for some reason, I tried messing with it with no success outside of removing the time filter all together which isn't really a usable solution.