Tracing reason for brief 50% packet loss, lag, and connection termination



  • I have a machine and PFsense platform (ADSL) that should be able to cope with any sane demand placed on it, but at times doesn't, in a specific situation I get insane lag and packet loss, and the link is reset.  I'm not sure how to fault-trace the issue(s).  I think the issue is down to my ISP connection or PFsense. I really need ideas.

    The setup here is simple and high spec, and should be fine with a burst of (pretty modest) peak traffic here.  But for reasons unknown it isn't.

    Faced with a modest burst of http and dns requests, PFsense RRD>Quality shows that WAN RTT usually ~30ms increases to ~200 - 250ms and packet loss rises from ~ 0.6% to about 35 - 40%. That alone suggests the issue arises with PFsense or maybe the connection, because I don't think that would be affected by anything on the LAN or PC.  After a while the PPP connection resets.  PFsense has not shown any sign of overload (memory, CPU, sessions etc seem to remain normal and low throughout).  I have no idea why any of this happens.

    DATA BELOW - hardware/ISP/connection/config info, and PFsense system + PPP logs.

    Any ideas?

    SYSTEM SETUP AND CONFIG INFO

    WHAT HAPPENS

    RELEVANT SNIP OF PFSENSE SYSTEM LOG

    (Nothing for previous 30 mins)
    00:43:15 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #0 (Opened)
    00:43:15 pfsense ppp: [wan_link0] LCP: state change Opened --> Stopping
    00:43:15 pfsense ppp: [wan_link0] Link: Leave bundle "wan"
    00:43:15 pfsense ppp: [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps
    00:43:15 pfsense ppp: [wan] IPCP: Close event
    00:43:15 pfsense ppp: [wan] IPCP: state change Opened --> Closing
    00:43:15 pfsense ppp: [wan] IPCP: SendTerminateReq #129
    00:43:15 pfsense ppp: [wan] IPCP: LayerDown
    00:43:15 pfsense ppp-linkdown: Removing states from [PFSENSE PUBLIC IP]
    00:43:15 pfsense ppp-linkdown: Removing states to [ISP GATEWAY IP]
    00:43:15 pfsense check_reload_status: Rewriting resolv.conf
    00:43:15 pfsense ppp: [wan] IFACE: Down event
    00:43:15 pfsense ppp: [wan] IPCP: Down event
    00:43:15 pfsense ppp: [wan] IPCP: LayerFinish
    00:43:15 pfsense ppp: [wan] Bundle: No NCPs left. Closing links...
    00:43:15 pfsense ppp: [wan] IPCP: state change Closing --> Initial
    00:43:15 pfsense ppp: [wan_link0] LCP: SendTerminateAck #71
    00:43:15 pfsense ppp: [wan_link0] LCP: LayerDown
    00:43:15 pfsense ppp: [wan_link0] rec'd proto IP during terminate phase
    00:43:16 pfsense ppp: last message repeated 22 times
    00:43:16 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Stopping)
    00:43:16 pfsense ppp: [wan_link0] LCP: SendTerminateAck #72
    00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopping --> Stopped
    00:43:17 pfsense ppp: [wan_link0] LCP: LayerFinish
    00:43:17 pfsense ppp: [wan_link0] PPPoE: connection closed
    00:43:17 pfsense ppp: [wan_link0] Link: DOWN event
    00:43:17 pfsense ppp: [wan_link0] LCP: Down event
    00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopped --> Starting
    00:43:17 pfsense ppp: [wan_link0] LCP: LayerStart
    00:43:17 pfsense ppp: [wan_link0] Link: reconnection attempt 1 in 1 seconds
    00:43:18 pfsense ppp: [wan_link0] Link: reconnection attempt 1
    00:43:18 pfsense ppp: [wan_link0] PPPoE: Connecting to ''
    00:43:18 pfsense ppp: PPPoE: rec'd ACNAME "[ROUTERNAME] PPPoE"
    00:43:18 pfsense ppp: [wan_link0] PPPoE: connection successful
    00:43:18 pfsense ppp: [wan_link0] Link: UP event
    00:43:18 pfsense ppp: [wan_link0] LCP: Up event
    00:43:18 pfsense ppp: [wan_link0] LCP: state change Starting --> Req-Sent
    

    RELEVANT SNIP OF PFSENSE PPP LOG

    00:00:08 pfsense ppp: [wan] IFACE: Up event
    00:43:15 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #0 (Opened)
    00:43:15 pfsense ppp: [wan_link0] LCP: state change Opened --> Stopping
    00:43:15 pfsense ppp: [wan_link0] Link: Leave bundle "wan"
    00:43:15 pfsense ppp: [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps
    00:43:15 pfsense ppp: [wan] IPCP: Close event
    00:43:15 pfsense ppp: [wan] IPCP: state change Opened --> Closing
    00:43:15 pfsense ppp: [wan] IPCP: SendTerminateReq #129
    00:43:15 pfsense ppp: [wan] IPCP: LayerDown
    00:43:15 pfsense ppp: [wan] IFACE: Down event
    00:43:15 pfsense ppp: [wan] IPCP: Down event
    00:43:15 pfsense ppp: [wan] IPCP: LayerFinish
    00:43:15 pfsense ppp: [wan] Bundle: No NCPs left. Closing links...
    00:43:15 pfsense ppp: [wan] IPCP: state change Closing --> Initial
    00:43:15 pfsense ppp: [wan_link0] LCP: SendTerminateAck #71
    00:43:15 pfsense ppp: [wan_link0] LCP: LayerDown
    00:43:15 pfsense ppp: [wan_link0] rec'd proto IP during terminate phase
    00:43:16 pfsense ppp: last message repeated 22 times
    00:43:16 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Stopping)
    00:43:16 pfsense ppp: [wan_link0] LCP: SendTerminateAck #72
    00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopping --> Stopped
    00:43:17 pfsense ppp: [wan_link0] LCP: LayerFinish
    00:43:17 pfsense ppp: [wan_link0] PPPoE: connection closed
    00:43:17 pfsense ppp: [wan_link0] Link: DOWN event
    00:43:17 pfsense ppp: [wan_link0] LCP: Down event
    00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopped --> Starting
    00:43:17 pfsense ppp: [wan_link0] LCP: LayerStart
    00:43:17 pfsense ppp: [wan_link0] Link: reconnection attempt 1 in 1 seconds
    00:43:18 pfsense ppp: [wan_link0] Link: reconnection attempt 1
    00:43:18 pfsense ppp: [wan_link0] PPPoE: Connecting to ''
    00:43:18 pfsense ppp: PPPoE: rec'd ACNAME "[ROUTERNAME] PPPoE"
    00:43:18 pfsense ppp: [wan_link0] PPPoE: connection successful
    00:43:18 pfsense ppp: [wan_link0] Link: UP event
    00:43:18 pfsense ppp: [wan_link0] LCP: Up event
    00:43:18 pfsense ppp: [wan_link0] LCP: state change Starting --> Req-Sent
    00:43:18 pfsense ppp: [wan_link0] LCP: SendConfigReq #73
    

  • Netgate Administrator

    I am running a Draytek V120 connected to a 20Mbps ADSL connection in the UK and have not seen such problems.
    Do you have the latest firmware on the Draytek? Which device is it? I ask because I had trouble with it initially because it had the wrong firmware. Draytek released a series of firmwares tuned for different line conditions. For the V120 at least.

    If I run a bandwidth test I see the ping times go up substantially. See my screen shot for one day average.

    My initial thoughts on this were that apinger was resetting the connection due to excessive ping times but the system log shows that's not the case.

    Perhaps your ISP is resetting it due to excessive packets per second? Some automated process to prevent a packet flood attack. You could try setting a limiter to reduce the PPS.
    You seem to have a very large number of states from your one client machine.

    Steve

    ![adsl ping times.jpg](/public/imported_attachments/1/adsl ping times.jpg)
    ![adsl ping times.jpg_thumb](/public/imported_attachments/1/adsl ping times.jpg_thumb)



  • Thanks Stephen.

    I can ask the ISP on Monday. ISP's night support say they don't reset or do anything they know of to cause this, but will check anyway what they can.  Network here has peaked at 72k sessions in the past, transiently, during startup of this program.  It's a busy network for a household and a lot of intense burst use can happen now and then, in a way that's why I'm using a router like PFsense to handle it.  The Draytek is V2800 series and has the latest 3.3.7.3 firmware (October 2012) - standard "232201" version. I updated it from 3.3.7.1 (2011) as part of testing, and the behaviour didn't change.

    Increased bandwidth leading to slow ping responses is interesting in your comment. Why would that happen? I'm not saturating my pipe (as far as I know), and PFsense shows max 1.6 Mbits and 180 pkts/sec which is a modest burst compared to the 10 mbit link it is using, but even so it does seem something is saturating.  The RTT is from PFsense across copper to my ISP's own gateway, both PFsense and the ISP pipe should have no problems with this current kind of burst. My ping graph is below, a bit different from yours.

    The logs show WAN "LCP: rec'd Terminate Request #0" that seems to happen very early in the session shut down, and presumably has the same cause as lag and packet loss, but I'm not sure how the request is raised, whether by ISP or if it's mis described and raised internally by PFsense.  Also it doesn't explain packet lag/loss issues that precede it.  Would WAN packet capture show if this is actually an external signal?

    I suspect a bit more, that it may be due to a PFsense issue, because when I rebooted PFsense I found the behavior had reduced.  It now hits RTT ~ 150ms and ~20% packet loss, just by rebooting PFsense (was 200-250ms, 35-40% before). If it was an ISP issue then surely rebooting the router (not ADSL link) wouldn't help. Also there's an odd system log entry - the pf filter is reloaded at the same time but no obvious cause for this and I'm not doing anything at the time with filters.

    Does any of this help?

    ![Pfsense ping.png](/public/imported_attachments/1/Pfsense ping.png)
    ![Pfsense ping.png_thumb](/public/imported_attachments/1/Pfsense ping.png_thumb)


  • Netgate Administrator

    @stilez:

    Increased bandwidth leading to slow ping responses is interesting in your comment. Why would that happen?

    Because the outgoing and returning ping packets have to wait while other packets are processed. You can mitigate this to some extent using QoS to prioritise ping if it's all important.

    @stilez:

    My ping graph is below, a bit different from yours.

    Yes. That level of packet loss seems completely unacceptable whatever you're doing. You'll notice mine is at 0% max.

    @stilez:

    Would WAN packet capture show if this is actually an external signal?

    Good question. As far as know the packets capture only captures IP packets, layer 3. Thus if this is at the ppp layer it may not. I'm far from an expert in this though so I stand to be corrected.  ;)

    @stilez:

    If it was an ISP issue then surely rebooting the router (not ADSL link) wouldn't help. Also there's an odd system log entry - the pf filter is reloaded at the same time but no obvious cause for this and I'm not doing anything at the time with filters.

    The pf filter is reloaded when ever there is a change on any of the interfaces, even if the filter is set to not filter anything. That's normal.

    With an ADSL connection you are dealing with several networks. My ISP is Plusnet (who are awesome by the way!). My pfSense box talks directly to Plusnet's PPP servers. The Draytek modem talks to BT's equipment in the exchange but that connection should be transparent to the pfSense-Plusnet connection.
    Thus if you have a problem at your ISP rebooting your pfSense box may change which PPP server you end up connected to, for example.

    I have had similar problems in the past that turned out to be a faulty line card in the exchange. That's happened to me several times. Statistical anomaly or common fault?  ::)

    Do you have any line data from your modem that might indicate a problem? See my screenshot.

    Also this seems relevant: http://forum.pfsense.org/index.php/topic,32653.msg168773.html#msg168773

    Steve



  • Netgate Administrator

    Just noticed, 10Mbps is unusual for a UK ADSL connection which are usually 8, 20 or 24.
    Are you being rate adapted down? Are you at the end of a long line?

    Steve



  • Thanks Steve, I'll have to look at some of those ideas later, it's interesting.  Quick answer on the line rate and stats - the 10mbit is the usual rate adapted - it's usually between 9 - 10 mbits depending on the day. I'm 3/4 mile from the exchange, so getting near the top end of what I can hope for, I guess.  Though it's a bit lower than usual today at 5.3 mbit (still much higher than the router is trying to handle at <2 mbit/sec).  SNR is also usually reported as 5-6dB so it's high today, but also the connection's been up and down while testing so the speed's being downrated for that reason too.

    Current line stats:

    TX Cells = 1.567 M, RX Cells = 1.475 M, TX CRC errs = 0, RX CRC errs = 0
    ADSL = ADSL2(G.992.3) SHOWTIME, Up/Down = 893kbits/5.3mbits
    SNR = 14, Loop = 33

    24 mbit line I see! :)


  • Netgate Administrator

    Interesting. 5-6dB seems low for a 10Mb connection to me. Generally speaking you should be able to get a high connection speed at lower margins but at a payoff with stability. The rate adaptation should take care of that but it can be reset manually if it goes awry for whatever reason.
    However I don't think it's the cause of the disconnects, you would see that in the system logs and the modem uptime.
    It looks to me as though something at your ISP is sending the disconnect commands at the ppp layer, as JimP suggested in the other thread. Maybe you can capture those and present them as evidence to your ISP? I've never tried.

    @stilez:

    24 mbit line I see! :)

    Yes. My connection here at home has always been very good. It should be, I can see the exchange from the window!  :) Although I have a 24Mb line speed I only get 20Mb because Plusnet is not an LLU provider.

    As an aside there is a lot more information available from Draytek modems via the telnet interface.
    See: http://forum.pfsense.org/index.php/topic,52091.0.html

    Steve


Log in to reply