Tracing reason for brief 50% packet loss, lag, and connection termination
-
I have a machine and PFsense platform (ADSL) that should be able to cope with any sane demand placed on it, but at times doesn't, in a specific situation I get insane lag and packet loss, and the link is reset. I'm not sure how to fault-trace the issue(s). I think the issue is down to my ISP connection or PFsense. I really need ideas.
The setup here is simple and high spec, and should be fine with a burst of (pretty modest) peak traffic here. But for reasons unknown it isn't.
Faced with a modest burst of http and dns requests, PFsense RRD>Quality shows that WAN RTT usually ~30ms increases to ~200 - 250ms and packet loss rises from ~ 0.6% to about 35 - 40%. That alone suggests the issue arises with PFsense or maybe the connection, because I don't think that would be affected by anything on the LAN or PC. After a while the PPP connection resets. PFsense has not shown any sign of overload (memory, CPU, sessions etc seem to remain normal and low throughout). I have no idea why any of this happens.
DATA BELOW - hardware/ISP/connection/config info, and PFsense system + PPP logs.
Any ideas?
SYSTEM SETUP AND CONFIG INFO
WHAT HAPPENS
RELEVANT SNIP OF PFSENSE SYSTEM LOG
(Nothing for previous 30 mins) 00:43:15 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #0 (Opened) 00:43:15 pfsense ppp: [wan_link0] LCP: state change Opened --> Stopping 00:43:15 pfsense ppp: [wan_link0] Link: Leave bundle "wan" 00:43:15 pfsense ppp: [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps 00:43:15 pfsense ppp: [wan] IPCP: Close event 00:43:15 pfsense ppp: [wan] IPCP: state change Opened --> Closing 00:43:15 pfsense ppp: [wan] IPCP: SendTerminateReq #129 00:43:15 pfsense ppp: [wan] IPCP: LayerDown 00:43:15 pfsense ppp-linkdown: Removing states from [PFSENSE PUBLIC IP] 00:43:15 pfsense ppp-linkdown: Removing states to [ISP GATEWAY IP] 00:43:15 pfsense check_reload_status: Rewriting resolv.conf 00:43:15 pfsense ppp: [wan] IFACE: Down event 00:43:15 pfsense ppp: [wan] IPCP: Down event 00:43:15 pfsense ppp: [wan] IPCP: LayerFinish 00:43:15 pfsense ppp: [wan] Bundle: No NCPs left. Closing links... 00:43:15 pfsense ppp: [wan] IPCP: state change Closing --> Initial 00:43:15 pfsense ppp: [wan_link0] LCP: SendTerminateAck #71 00:43:15 pfsense ppp: [wan_link0] LCP: LayerDown 00:43:15 pfsense ppp: [wan_link0] rec'd proto IP during terminate phase 00:43:16 pfsense ppp: last message repeated 22 times 00:43:16 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Stopping) 00:43:16 pfsense ppp: [wan_link0] LCP: SendTerminateAck #72 00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopping --> Stopped 00:43:17 pfsense ppp: [wan_link0] LCP: LayerFinish 00:43:17 pfsense ppp: [wan_link0] PPPoE: connection closed 00:43:17 pfsense ppp: [wan_link0] Link: DOWN event 00:43:17 pfsense ppp: [wan_link0] LCP: Down event 00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopped --> Starting 00:43:17 pfsense ppp: [wan_link0] LCP: LayerStart 00:43:17 pfsense ppp: [wan_link0] Link: reconnection attempt 1 in 1 seconds 00:43:18 pfsense ppp: [wan_link0] Link: reconnection attempt 1 00:43:18 pfsense ppp: [wan_link0] PPPoE: Connecting to '' 00:43:18 pfsense ppp: PPPoE: rec'd ACNAME "[ROUTERNAME] PPPoE" 00:43:18 pfsense ppp: [wan_link0] PPPoE: connection successful 00:43:18 pfsense ppp: [wan_link0] Link: UP event 00:43:18 pfsense ppp: [wan_link0] LCP: Up event 00:43:18 pfsense ppp: [wan_link0] LCP: state change Starting --> Req-Sent
RELEVANT SNIP OF PFSENSE PPP LOG
00:00:08 pfsense ppp: [wan] IFACE: Up event 00:43:15 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #0 (Opened) 00:43:15 pfsense ppp: [wan_link0] LCP: state change Opened --> Stopping 00:43:15 pfsense ppp: [wan_link0] Link: Leave bundle "wan" 00:43:15 pfsense ppp: [wan] Bundle: Status update: up 0 links, total bandwidth 9600 bps 00:43:15 pfsense ppp: [wan] IPCP: Close event 00:43:15 pfsense ppp: [wan] IPCP: state change Opened --> Closing 00:43:15 pfsense ppp: [wan] IPCP: SendTerminateReq #129 00:43:15 pfsense ppp: [wan] IPCP: LayerDown 00:43:15 pfsense ppp: [wan] IFACE: Down event 00:43:15 pfsense ppp: [wan] IPCP: Down event 00:43:15 pfsense ppp: [wan] IPCP: LayerFinish 00:43:15 pfsense ppp: [wan] Bundle: No NCPs left. Closing links... 00:43:15 pfsense ppp: [wan] IPCP: state change Closing --> Initial 00:43:15 pfsense ppp: [wan_link0] LCP: SendTerminateAck #71 00:43:15 pfsense ppp: [wan_link0] LCP: LayerDown 00:43:15 pfsense ppp: [wan_link0] rec'd proto IP during terminate phase 00:43:16 pfsense ppp: last message repeated 22 times 00:43:16 pfsense ppp: [wan_link0] LCP: rec'd Terminate Request #3 (Stopping) 00:43:16 pfsense ppp: [wan_link0] LCP: SendTerminateAck #72 00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopping --> Stopped 00:43:17 pfsense ppp: [wan_link0] LCP: LayerFinish 00:43:17 pfsense ppp: [wan_link0] PPPoE: connection closed 00:43:17 pfsense ppp: [wan_link0] Link: DOWN event 00:43:17 pfsense ppp: [wan_link0] LCP: Down event 00:43:17 pfsense ppp: [wan_link0] LCP: state change Stopped --> Starting 00:43:17 pfsense ppp: [wan_link0] LCP: LayerStart 00:43:17 pfsense ppp: [wan_link0] Link: reconnection attempt 1 in 1 seconds 00:43:18 pfsense ppp: [wan_link0] Link: reconnection attempt 1 00:43:18 pfsense ppp: [wan_link0] PPPoE: Connecting to '' 00:43:18 pfsense ppp: PPPoE: rec'd ACNAME "[ROUTERNAME] PPPoE" 00:43:18 pfsense ppp: [wan_link0] PPPoE: connection successful 00:43:18 pfsense ppp: [wan_link0] Link: UP event 00:43:18 pfsense ppp: [wan_link0] LCP: Up event 00:43:18 pfsense ppp: [wan_link0] LCP: state change Starting --> Req-Sent 00:43:18 pfsense ppp: [wan_link0] LCP: SendConfigReq #73
-
I am running a Draytek V120 connected to a 20Mbps ADSL connection in the UK and have not seen such problems.
Do you have the latest firmware on the Draytek? Which device is it? I ask because I had trouble with it initially because it had the wrong firmware. Draytek released a series of firmwares tuned for different line conditions. For the V120 at least.If I run a bandwidth test I see the ping times go up substantially. See my screen shot for one day average.
My initial thoughts on this were that apinger was resetting the connection due to excessive ping times but the system log shows that's not the case.
Perhaps your ISP is resetting it due to excessive packets per second? Some automated process to prevent a packet flood attack. You could try setting a limiter to reduce the PPS.
You seem to have a very large number of states from your one client machine.Steve

 -
Thanks Stephen.
I can ask the ISP on Monday. ISP's night support say they don't reset or do anything they know of to cause this, but will check anyway what they can. Network here has peaked at 72k sessions in the past, transiently, during startup of this program. It's a busy network for a household and a lot of intense burst use can happen now and then, in a way that's why I'm using a router like PFsense to handle it. The Draytek is V2800 series and has the latest 3.3.7.3 firmware (October 2012) - standard "232201" version. I updated it from 3.3.7.1 (2011) as part of testing, and the behaviour didn't change.
Increased bandwidth leading to slow ping responses is interesting in your comment. Why would that happen? I'm not saturating my pipe (as far as I know), and PFsense shows max 1.6 Mbits and 180 pkts/sec which is a modest burst compared to the 10 mbit link it is using, but even so it does seem something is saturating. The RTT is from PFsense across copper to my ISP's own gateway, both PFsense and the ISP pipe should have no problems with this current kind of burst. My ping graph is below, a bit different from yours.
The logs show WAN "LCP: rec'd Terminate Request #0" that seems to happen very early in the session shut down, and presumably has the same cause as lag and packet loss, but I'm not sure how the request is raised, whether by ISP or if it's mis described and raised internally by PFsense. Also it doesn't explain packet lag/loss issues that precede it. Would WAN packet capture show if this is actually an external signal?
I suspect a bit more, that it may be due to a PFsense issue, because when I rebooted PFsense I found the behavior had reduced. It now hits RTT ~ 150ms and ~20% packet loss, just by rebooting PFsense (was 200-250ms, 35-40% before). If it was an ISP issue then surely rebooting the router (not ADSL link) wouldn't help. Also there's an odd system log entry - the pf filter is reloaded at the same time but no obvious cause for this and I'm not doing anything at the time with filters.
Does any of this help?

 -
Increased bandwidth leading to slow ping responses is interesting in your comment. Why would that happen?
Because the outgoing and returning ping packets have to wait while other packets are processed. You can mitigate this to some extent using QoS to prioritise ping if it's all important.
My ping graph is below, a bit different from yours.
Yes. That level of packet loss seems completely unacceptable whatever you're doing. You'll notice mine is at 0% max.
Would WAN packet capture show if this is actually an external signal?
Good question. As far as know the packets capture only captures IP packets, layer 3. Thus if this is at the ppp layer it may not. I'm far from an expert in this though so I stand to be corrected. ;)
If it was an ISP issue then surely rebooting the router (not ADSL link) wouldn't help. Also there's an odd system log entry - the pf filter is reloaded at the same time but no obvious cause for this and I'm not doing anything at the time with filters.
The pf filter is reloaded when ever there is a change on any of the interfaces, even if the filter is set to not filter anything. That's normal.
With an ADSL connection you are dealing with several networks. My ISP is Plusnet (who are awesome by the way!). My pfSense box talks directly to Plusnet's PPP servers. The Draytek modem talks to BT's equipment in the exchange but that connection should be transparent to the pfSense-Plusnet connection.
Thus if you have a problem at your ISP rebooting your pfSense box may change which PPP server you end up connected to, for example.I have had similar problems in the past that turned out to be a faulty line card in the exchange. That's happened to me several times. Statistical anomaly or common fault? ::)
Do you have any line data from your modem that might indicate a problem? See my screenshot.
Also this seems relevant: http://forum.pfsense.org/index.php/topic,32653.msg168773.html#msg168773
Steve
-
Just noticed, 10Mbps is unusual for a UK ADSL connection which are usually 8, 20 or 24.
Are you being rate adapted down? Are you at the end of a long line?Steve
-
Thanks Steve, I'll have to look at some of those ideas later, it's interesting. Quick answer on the line rate and stats - the 10mbit is the usual rate adapted - it's usually between 9 - 10 mbits depending on the day. I'm 3/4 mile from the exchange, so getting near the top end of what I can hope for, I guess. Though it's a bit lower than usual today at 5.3 mbit (still much higher than the router is trying to handle at <2 mbit/sec). SNR is also usually reported as 5-6dB so it's high today, but also the connection's been up and down while testing so the speed's being downrated for that reason too.
Current line stats:
TX Cells = 1.567 M, RX Cells = 1.475 M, TX CRC errs = 0, RX CRC errs = 0
ADSL = ADSL2(G.992.3) SHOWTIME, Up/Down = 893kbits/5.3mbits
SNR = 14, Loop = 3324 mbit line I see! :)
-
Interesting. 5-6dB seems low for a 10Mb connection to me. Generally speaking you should be able to get a high connection speed at lower margins but at a payoff with stability. The rate adaptation should take care of that but it can be reset manually if it goes awry for whatever reason.
However I don't think it's the cause of the disconnects, you would see that in the system logs and the modem uptime.
It looks to me as though something at your ISP is sending the disconnect commands at the ppp layer, as JimP suggested in the other thread. Maybe you can capture those and present them as evidence to your ISP? I've never tried.24 mbit line I see! :)
Yes. My connection here at home has always been very good. It should be, I can see the exchange from the window! :) Although I have a 24Mb line speed I only get 20Mb because Plusnet is not an LLU provider.
As an aside there is a lot more information available from Draytek modems via the telnet interface.
See: http://forum.pfsense.org/index.php/topic,52091.0.htmlSteve