Temporary recurrent selective loss of traffic

clarknova

pfsense 2.1 i386
Soekris net5501-70

One of pfsense's gateways is at the far end of a 42km wireless link. The RRD quality graph (attached) shows spikes of packet loss at approximately 40 minute intervals but with no increase in latency. The WAN graph show no corresponding packet loss. In fact, latency on the WAN is fairly static and packet loss stays close to 0.

To determine what was going on with this other gateway, I jumped on the shell and started pinging the gateway. I also kept the GUI open to the RRD quality graph for this interface, hoping to see a correspondence. Unfortunately, having tried this a couple of times, at the moment when I was expecting to see packet loss, both the ssh shell and the web UI stopped responding. I had to kill the ssh session and reload the web page, at which point the RRD graph was updated and had a new spike showing packet loss. I immediately resumed pinging the gateway upon reestablishing the ssh connection and saw no packet loss or unusual latency.

So whatever is causing the reported packet loss appears to be associated with me being disconnected from the WAN on both ssh and ssl.

I have attached the processor graph as well, although I don't see anything that look correlated. When I look at 'top', it looks to me like 'yarrow' and 'check_reload_status' are showing unexpectedly high times.

last pid: 53857;  load averages:  0.16,  0.18,  0.18                                       up 1+12:53:05  11:38:51
99 processes:  2 running, 83 sleeping, 14 waiting
CPU:  0.0% user,  0.0% nice,  3.5% system,  4.7% interrupt, 91.8% idle
Mem: 75M Active, 34M Inact, 94M Wired, 96K Cache, 59M Buf, 279M Free
Swap:

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
   10    root        171 ki31     0K        8K  RUN      31.7H 90.97%  idle
   11    root        -68    -        0K    112K  WAIT    35:07  0.98%   intr{irq5: vr1}
   11    root        -68    -        0K    112K  WAIT    46:40  0.00%  intr{irq11: vr0}
   13    root        -16    -        0K        8K     -        24:47  0.00%  yarrow
   11    root        -68    -        0K    112K  WAIT    18:06  0.00%  intr{irq12: vr3}
  295   root         76   20  3352K  1152K kqread 15:11  0.00%  check_reload_status
   11    root        -68    -        0K     112K  WAIT   12:57  0.00%  intr{irq9: vr2}
    0     root        -68    0       0K       72K     -       11:01  0.00%  kernel{dummynet}
   11    root        -32    -        0K     112K  WAIT   10:27  0.00%  intr{swi4: clock}
43561 root         44    0  3412K   1412K  select   7:10  0.00%  syslogd

Can anybody offer a likely explanation for the reported packet loss? I will be checking the wireless monitoring as well, but it's temporarily down at the moment because of some recent changes to internal routing.

clockwork.PNG_thumb

proc.PNG_thumb

clarknova

Here are some interesting points to add to the discussion (monologue? soliloquy? :P)

1. You'll note in one of the Quality graphs above that the packet loss bars increase in size until Wednesday afternoon, then reset and start growing from nothing. Interestingly, I see the exact same trend for Thursday, where the loss dropped to zero Thursday afternoon and then started growing at regular 40-minute intervals.

2. I tried an experiment where I started pinging from three different hosts: 1) from an internet host to the WAN of the pfsense in question, 2) from an internal host to the internal interface of the pfsense in question, and 3) from the pfsense in question (via ssh) to the ISP's default gateway. After about 227 seconds the ssh session stopped responding so I immediately opened a new one and started the experiment again. Ping sequences 2 and 3 were started within 14 seconds of each other. After 1139 seconds the ssh session (3) stopped responding and hasn't begun responding after several minutes. The external host (1) saw no packet loss, and the internal host (2) received no response commencing 34 seconds after the ssh session hung, and then began seeing responses again after 49 seconds of none. At the same time, I had the WAN traffic graph displayed in a browser. The graph stopped updating around the same time, and then resumed a few seconds later with the traffic recommencing from 0.

So whatever the cause of the problem, it appears to be hanging an ssh session on the WAN and suppressing all traffic on an internal interface. Meanwhile, all other traffic continues normally on the WAN and other internal interfaces.

I'm at a loss to explain it.

clarknova

I finally got this. The AP on the backhaul has a ping watchdog that was not getting a response from pfsense due to a firewall ruleset oversight. I created an ICMP pass rule for the AP and I expect that will be the end of the packet loss on that interface.

I'm not sure how that explains the ssh lockups on the WAN. I'll know shortly if those are resolved.

stephenw10

Is it reseting all the firewall states when an interface goes down then comes back up? Not sure quire why that would happen, maybe if you have a downstream gateway.

Those loss spikes are disturbingly regular!

Steve

clarknova

The AP was rebooting itself every 20 minutes. I was thrown off the trail by the fact that the packet loss was showing up every 40 minutes, and that the rate of loss didn't appear consistent, except in chunks of 24 hours. The latter can be explained by rounding, since the rrd samples are 5 minutes, while the down time was less than a minute. I don't know how to explain the fact that every second outage was not manifest in the rrd graph though.

As for the ssh hanging, you're right, I didn't have the box checked to override state killing on gateway failure, so pfsense was killing all states when that backhaul went down.