dpinger and ISP package loss

fireix

I'm trying to find out why Internet connection went down about 3 minutes today, so I can take potential steps to fix it if it is something on my end. I still hoping it was my ISP and waiting for their report, but Pingdom that measures uptime from around the world, didn't record downtime on ISP_IP even though it was for minutes.

Pingdoms checks to the IP of my pfSense/WAN behind the ISP_GW however detected 100% packet loss at same time.

My ISP in the data center has provided a fiber-box ISP_IP, so that's my GW to Internet. As you see, it had losses according to dpinger on pfSense against the ISP and it lasted for some minutes (and it happens very rarely, previous time 7 months ago). Could there have been anything on my side (config in pfSense) that could cause these packet loss/downtime? I was sleeping, so didn't actively doing anything at this time, but if there is a sign of something going on with my pfSense, it would be nice to know.

For example, could it be some internal routing in pfSense that would take down pfSense temporarily or is dpinger pretty reliable in this sense so I can assume the error is on ISP's equipment?

Oct 31 01:00:36 fw1 dpinger[2268]: GW_WAN_2 ISP_IP: Alarm latency 668us stddev 2892us loss 22%
Apr 13 10:29:39 fw1 dpinger[34918]: GW_WAN_2 ISP_IP: Alarm latency 2300us stddev 8630us loss 22%

viragomann

@fireix
Check also the system log for hints at this time. Maybe the network connection went down temporarily for some reason.

fireix

@viragomann Happened again just now at exact same time as yesterday and this time it was actually a crash report waiting for me, pfSense has rebooted. Haven't had error like this for two-three years (then it happened about one time per month) and suddenly started again. Previous time, it was solved by breaking a LACP-lag. Not that the log help me understand anything..

Have checked the system.logs, ipsec logs, gateway logs and every single log file in the /logs directory and nothing going on the seconds before the crash. It just appeared out of the blue. Has very low traffic... Weird it has happened at exact same time two days ago.

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x18
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80e0fcc4
stack pointer = 0x0:0xfffffe00004d6800
frame pointer = 0x0:0xfffffe00004d6830
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 0 (if_io_tqg_2)
trap number = 12
panic: page fault
cpuid = 2
time = 1681804483
KDB: enter: panic

System logs only have this before it rebooted/crashed:

Apr 18 08:48:00 fw1 sshguard[24159]: Now monitoring attacks.
Apr 18 09:04:00 fw1 sshguard[24159]: Exiting on signal.
Apr 18 09:04:00 fw1 sshguard[39953]: Now monitoring attacks.
Apr 18 09:19:00 fw1 sshguard[39953]: Exiting on signal.
Apr 18 09:19:00 fw1 sshguard[27967]: Now monitoring attacks.
Apr 18 09:34:00 fw1 sshguard[27967]: Exiting on signal.
Apr 18 09:34:00 fw1 sshguard[13151]: Now monitoring attacks.
Apr 18 09:36:00 fw1 sshguard[13151]: Exiting on signal.
Apr 18 09:36:00 fw1 sshguard[38117]: Now monitoring attacks.
Apr 18 09:50:00 fw1 sshguard[38117]: Exiting on signal.
Apr 18 09:50:00 fw1 sshguard[50187]: Now monitoring attacks.
Apr 18 10:09:29 fw1 syslogd: kernel boot file is /boot/kernel/kernel
Apr 18 10:09:29 fw1 kernel: ---<<BOOT>>---
Apr 18 10:09:29 fw1 kernel: Copyright (c) 1992-2021 The FreeBSD Project.
Apr 18 10:09:29 fw1 kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994

jaspery

Not sure problem I had previously is similar to yours, but I'll just share a solution.

I noticed frequent dpginer restarts in logs at times of heavy network load especially upload.

I made sure my setup is perfectly and there were no misconfiguration which can cause this. I was also pretty much sure my ISP gateway is not actually going down.

So my hypothesis was that under heavy load dpinger couldn't just ping a gateway, because ping traffic was stuck in a queues.

I didn't want to disable pinging gateway ultimately. On the other hand I learned about latency threshold configuration on gateway screen in WebGui (System / Routing / <My Gateway>.

Basically dpinger issues a warning in log when it sees that ping's delay reaches "Low Threshold" value, and when delay reaches "High Threshold" it decides to restart itself, firewall, and who knows what other services. Apparently Internet connection failure was caused in my case by these restarts.

I experimented with different threshold setups for my environment,
and finally came up with Low=600 and High=900. I haven't seen Internet failures since then (more that a week, previously as I said it could be few times a day during heavy load).

fireix

This IPv6 issue shown just a month ago solved in bug tracker actually have same fault trap 12. Anyone know that this could be related to what I see or it can be something totally different? I reported it as possible bug with all logs, but rejected because I wasn't on latest dev-build.

https://redmine.pfsense.org/issues/14077

fireix

@jaspery Based on my 2nd episode with crash, I suspect it was crash that caused my dpinger to fail (in this case).