Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Temporary recurrent selective loss of traffic

    Scheduled Pinned Locked Moved General pfSense Questions
    5 Posts 2 Posters 1.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C Offline
      clarknova
      last edited by

      pfsense 2.1 i386
      Soekris net5501-70

      One of pfsense's gateways is at the far end of a 42km wireless link. The RRD quality graph (attached) shows spikes of packet loss at approximately 40 minute intervals but with no increase in latency. The WAN graph show no corresponding packet loss. In fact, latency on the WAN is fairly static and packet loss stays close to 0.

      To determine what was going on with this other gateway, I jumped on the shell and started pinging the gateway. I also kept the GUI open to the RRD quality graph for this interface, hoping to see a correspondence. Unfortunately, having tried this a couple of times, at the moment when I was expecting to see packet loss, both the ssh shell and the web UI stopped responding. I had to kill the ssh session and reload the web page, at which point the RRD graph was updated and had a new spike showing packet loss. I immediately resumed pinging the gateway upon reestablishing the ssh connection and saw no packet loss or unusual latency.

      So whatever is causing the reported packet loss appears to be associated with me being disconnected from the WAN on both ssh and ssl.

      I have attached the processor graph as well, although I don't see anything that look correlated. When I look at 'top', it looks to me like 'yarrow' and 'check_reload_status' are showing unexpectedly high times.

      last pid: 53857;  load averages:  0.16,  0.18,  0.18                                       up 1+12:53:05  11:38:51
      99 processes:  2 running, 83 sleeping, 14 waiting
      CPU:  0.0% user,  0.0% nice,  3.5% system,  4.7% interrupt, 91.8% idle
      Mem: 75M Active, 34M Inact, 94M Wired, 96K Cache, 59M Buf, 279M Free
      Swap:
      
        PID USERNAME PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
         10    root        171 ki31     0K        8K  RUN      31.7H 90.97%  idle
         11    root        -68    -        0K    112K  WAIT    35:07  0.98%   intr{irq5: vr1}
         11    root        -68    -        0K    112K  WAIT    46:40  0.00%  intr{irq11: vr0}
         13    root        -16    -        0K        8K     -        24:47  0.00%  yarrow
         11    root        -68    -        0K    112K  WAIT    18:06  0.00%  intr{irq12: vr3}
        295   root         76   20  3352K  1152K kqread 15:11  0.00%  check_reload_status
         11    root        -68    -        0K     112K  WAIT   12:57  0.00%  intr{irq9: vr2}
          0     root        -68    0       0K       72K     -       11:01  0.00%  kernel{dummynet}
         11    root        -32    -        0K     112K  WAIT   10:27  0.00%  intr{swi4: clock}
      43561 root         44    0  3412K   1412K  select   7:10  0.00%  syslogd
      

      Can anybody offer a likely explanation for the reported packet loss? I will be checking the wireless monitoring as well, but it's temporarily down at the moment because of some recent changes to internal routing.
      clockwork.PNG
      clockwork.PNG_thumb
      proc.PNG
      proc.PNG_thumb

      db

      1 Reply Last reply Reply Quote 0
      • C Offline
        clarknova
        last edited by

        Here are some interesting points to add to the discussion (monologue? soliloquy? :P)

        1. You'll note in one of the Quality graphs above that the packet loss bars increase in size until Wednesday afternoon, then reset and start growing from nothing. Interestingly, I see the exact same trend for Thursday, where the loss dropped to zero Thursday afternoon and then started growing at regular 40-minute intervals.

        2. I tried an experiment where I started pinging from three different hosts: 1) from an internet host to the WAN of the pfsense in question, 2) from an internal host to the internal interface of the pfsense in question, and 3) from the pfsense in question (via ssh) to the ISP's default gateway. After about 227 seconds the ssh session stopped responding so I immediately opened a new one and started the experiment again. Ping sequences 2 and 3 were started within 14 seconds of each other. After 1139 seconds the ssh session (3) stopped responding and hasn't begun responding after several minutes. The external host (1) saw no packet loss, and the internal host (2) received no response commencing 34 seconds after the ssh session hung, and then began seeing responses again after 49 seconds of none. At the same time, I had the WAN traffic graph displayed in a browser. The graph stopped updating around the same time, and then resumed a few seconds later with the traffic recommencing from 0.

        So whatever the cause of the problem, it appears to be hanging an ssh session on the WAN and suppressing all traffic on an internal interface. Meanwhile, all other traffic continues normally on the WAN and other internal interfaces.

        I'm at a loss to explain it.

        db

        1 Reply Last reply Reply Quote 0
        • C Offline
          clarknova
          last edited by

          I finally got this. The AP on the backhaul has a ping watchdog that was not getting a response from pfsense due to a firewall ruleset oversight. I created an ICMP pass rule for the AP and I expect that will be the end of the packet loss on that interface.

          I'm not sure how that explains the ssh lockups on the WAN. I'll know shortly if those are resolved.

          db

          1 Reply Last reply Reply Quote 0
          • stephenw10S Offline
            stephenw10 Netgate Administrator
            last edited by

            Is it reseting all the firewall states when an interface goes down then comes back up? Not sure quire why that would happen, maybe if you have a downstream gateway.

            Those loss spikes are disturbingly regular!

            Steve

            1 Reply Last reply Reply Quote 0
            • C Offline
              clarknova
              last edited by

              The AP was rebooting itself every 20 minutes. I was thrown off the trail by the fact that the packet loss was showing up every 40 minutes, and that the rate of loss didn't appear consistent, except in chunks of 24 hours. The latter can be explained by rounding, since the rrd samples are 5 minutes, while the down time was less than a minute. I don't know how to explain the fact that every second outage was not manifest in the rrd graph though.

              As for the ssh hanging, you're right, I didn't have the box checked to override state killing on gateway failure, so pfsense was killing all states when that backhaul went down.

              db

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.