Very busy link kills WAN PPPoE: LCP Echo Responses too slow?

martin42

Hi,

I'm been running PPPoE on my ADSL link using pfSense 1.2.3 embedded on a Soekris NET5501. The modem is a Draytek Vigor 120. Usually this setup seems rock solid.

However, when running unusual network applications (e.g. Nessus) resulting in a high number of packets-per-second, the ADSL ISP link fails. From the ISP's diagnostics it appears that the PPPoE software failed to respond to LCP echo requests quickly enough. A workaround is to use pfSense's traffic shaper to set limits, so that the PPPoE WAN link is never completely saturated.

Has anyone else seen this issue?

Is there any way to tell mpd to priorize LCP Echo responses over user traffic?

Many thanks!

Martin

martin42

Followup… Actually, it seems unlikely that slow LCP echo replies are the problem, as it seems they are only needed when the line is idle :-

/var/etc/mpd.conf on pfSense includes the line: "set link keep-alive 10 60".

The MPD man page states:-

set link keep-alive seconds max

This command enables the sending of LCP echo packets on the link. The first echo
packet is sent after seconds seconds of quiet time (i.e., no frames received from the
peer on that link). After seconds more seconds, another echo request is sent. If after
max seconds of doing this no echo reply has been received yet, the link is brought
down.

If seconds is zero, echo packets are disabled. The default values are five second
intervals with a maximum no-reply time of fourty.

This feature is especially useful with modems when the carrier detect signal is unre-
liable. However, in situations where lines are noisy and modems spend a lot of time
retraining, the max value may need to be bumped up to a more generous value.

Any ideas?

Is it simply be that the slow CPU is unable to run PPPoE fast enough when the packets-per-second gets high? Peak rate was about 330 PPS, but a sustained rate of 250 PPS worked OK with low CPU use, so really I have no idea what's going on.

EDIT: Maybe the Nessus application represents an extreme load, because running port-scans across multiple IP's means the firewall state table gets fairly full, when normally it's almost empty. The firewall interface I use for Nessus runs has an "allow any any" rule, but I guess that won't stop PF keeping state on every session.

binco

I've had the same error with a multilink L2TP Tunnel and finally found the reason 1:

There was a Bug in FreeBSD up to 10 which had a problem with wrapping L2TP Sequence Numbers, so when the Sequence Number hit 32767 all further packets where dropped until the connection was reestablished by keep-alive. This occured usually when the link was saturated.

I've opened a pfSense Ticket 2 and this error is already resolved in 2.2-SNAPSHOT and will also be fixed in 2.1.5.

martin42

That's great work!

How quickly does that counter advance? I wonder how often it hits 32767 in normal traffic conditions. In other words, why doesn't this cause links to flap more often?

Thanks!

Martin.