NanoBSD CPU usage high on MultiWan, PHP processes abound

bryan.paradis

I'm using the Mac Backblaze client to saturate the upstream bandwidth. Backblaze is a cloud based backup host. The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN. Everything else on the network goes through the primary WAN using the catchall default firewall rule.

Interestingly, pinging Google and OpenDNS servers using the configurator's tool using the secondary WAN as the source is unsurprisingly ugly (>3000ms and plenty of packet loss); however, pinging the same servers from the Mac that's being routed through the firewall to the secondary WAN show reasonable pings and no packet loss.

Can you post your /var/etc/apinger.conf ?

arriflex

Why would the pfSense pingtimes from the secondary WAN be so much crazier than the ones from a LAN side client routed through the secondary WAN? Is the Mac providing priority to those packets before they get to the router?

This is the apinger.conf I'm using right now that is keeping php from going crazy on me:

# pfSense apinger configuration file. Automatically Generated!

## User and group the pinger should run as
user "root"
group "wheel"

## Mailer to use (default: "/usr/lib/sendmail -t")
#mailer "/var/qmail/bin/qmail-inject"

## Location of the pid-file (default: "/var/run/apinger.pid")
pid_file "/var/run/apinger.pid"

## Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")
#timestamp_format "%Y%m%d%H%M%S"

status {
	## File where the status information should be written to
	file "/var/run/apinger.status"
	## Interval between file updates
	## when 0 or not set, file is written only when SIGUSR1 is received
	interval 5s
}

########################################
# RRDTool status gathering configuration
# Interval between RRD updates
rrd interval 60s;

## These parameters can be overridden in a specific alarm configuration
alarm default {
	command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
	command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
	combine 10s
}

## "Down" alarm definition.
## This alarm will be fired when target doesn't respond for 30 seconds.
alarm down "down" {
	time 10s
}

## "Delay" alarm definition.
## This alarm will be fired when responses are delayed more than 200ms
## it will be canceled, when the delay drops below 100ms
alarm delay "delay" {
	delay_low 200ms
	delay_high 500ms
}

## "Loss" alarm definition.
## This alarm will be fired when packet loss goes over 20%
## it will be canceled, when the loss drops below 10%
alarm loss "loss" {
	percent_low 10
	percent_high 20
}

target default {
	## How often the probe should be sent
	interval 1s

	## How many replies should be used to compute average delay
	## for controlling "delay" alarms
	avg_delay_samples 10

	## How many probes should be used to compute average loss
	avg_loss_samples 50

	## The delay (in samples) after which loss is computed
	## without this delays larger than interval would be treated as loss
	avg_loss_delay_samples 20

	## Names of the alarms that may be generated for the target
	alarms "down","delay","loss"

	## Location of the RRD
	#rrd file "/var/db/rrd/apinger-%t.rrd"
}
target "208.67.222.222" {
	description "WAN"
	srcip "174.134.xxx.xx"
	alarms override "loss","delay","down";
	rrd file "/var/db/rrd/WAN-quality.rrd"
}

alarm loss "WAN_STUDIO_DHCPloss" {
	percent_low 30
	percent_high 40
}
alarm delay "WAN_STUDIO_DHCPdelay" {
	delay_low 4000ms
	delay_high 5000ms
}
alarm down "WAN_STUDIO_DHCPdown" {
	time 120s
}
target "208.67.220.220" {
	description "WAN_STUDIO_DHCP"
	srcip "174.135.xxx.xx"
	interval 20s
	alarms override "WAN_STUDIO_DHCPloss","WAN_STUDIO_DHCPdelay","WAN_STUDIO_DHCPdown";
	rrd file "/var/db/rrd/WAN_STUDIO_DHCP-quality.rrd"
}

Whereas, this one is more the default configuration which causes the extra processes to take over:

# pfSense apinger configuration file. Automatically Generated!

## User and group the pinger should run as
user "root"
group "wheel"

## Mailer to use (default: "/usr/lib/sendmail -t")
#mailer "/var/qmail/bin/qmail-inject"

## Location of the pid-file (default: "/var/run/apinger.pid")
pid_file "/var/run/apinger.pid"

## Format of timestamp (%s macro) (default: "%b %d %H:%M:%S")
#timestamp_format "%Y%m%d%H%M%S"

status {
	## File where the status information should be written to
	file "/var/run/apinger.status"
	## Interval between file updates
	## when 0 or not set, file is written only when SIGUSR1 is received
	interval 5s
}

########################################
# RRDTool status gathering configuration
# Interval between RRD updates
rrd interval 60s;

## These parameters can be overridden in a specific alarm configuration
alarm default {
	command on "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
	command off "/usr/local/sbin/pfSctl -c 'service reload dyndns %T' -c 'service reload ipsecdns' -c 'service reload openvpn %T' -c 'filter reload' "
	combine 10s
}

## "Down" alarm definition.
## This alarm will be fired when target doesn't respond for 30 seconds.
alarm down "down" {
	time 10s
}

## "Delay" alarm definition.
## This alarm will be fired when responses are delayed more than 200ms
## it will be canceled, when the delay drops below 100ms
alarm delay "delay" {
	delay_low 200ms
	delay_high 500ms
}

## "Loss" alarm definition.
## This alarm will be fired when packet loss goes over 20%
## it will be canceled, when the loss drops below 10%
alarm loss "loss" {
	percent_low 10
	percent_high 20
}

target default {
	## How often the probe should be sent
	interval 1s

	## How many replies should be used to compute average delay
	## for controlling "delay" alarms
	avg_delay_samples 10

	## How many probes should be used to compute average loss
	avg_loss_samples 50

	## The delay (in samples) after which loss is computed
	## without this delays larger than interval would be treated as loss
	avg_loss_delay_samples 20

	## Names of the alarms that may be generated for the target
	alarms "down","delay","loss"

	## Location of the RRD
	#rrd file "/var/db/rrd/apinger-%t.rrd"
}
target "208.67.222.222" {
	description "WAN"
	srcip "174.134.xxx.xx"
	alarms override "loss","delay","down";
	rrd file "/var/db/rrd/WAN-quality.rrd"
}

target "208.67.220.220" {
	description "WAN_STUDIO_DHCP"
	srcip "174.135.xxx.xx"
	alarms override "loss","delay","down";
	rrd file "/var/db/rrd/WAN_STUDIO_DHCP-quality.rrd"
}

bryan.paradis

I am not sure but this is getting interesting! May be some sort of priority thing. I won't be able to do too much more work as I am flying across the country to do some IT work.

arriflex

Safe travels. Feel free to assume that I'm making some rookie mistake and float it out there for me to check.

I rebooted the router and sent another client into the secondary WAN to test the ping through it and found the same erratic behaviour that performing the ping directly from the secondary WAN interface on the router was causing. Then I went back to the original client that was saturating that WAN and had previously not had issue with pings to find that now it was also consistent with the poor ping performance.

I am unable to replicate the state now where I had good ping times from the client that was saturating the upstream while seeing bad ones from the router on that interface. The good news is that they are all inline with each other at least!

arri

bryan.paradis

@arriflex:

Safe travels. Feel free to assume that I'm making some rookie mistake and float it out there for me to check.

I rebooted the router and sent another client into the secondary WAN to test the ping through it and found the same erratic behaviour that performing the ping directly from the secondary WAN interface on the router was causing. Then I went back to the original client that was saturating that WAN and had previously not had issue with pings to find that now it was also consistent with the poor ping performance.

I am unable to replicate the state now where I had good ping times from the client that was saturating the upstream while seeing bad ones from the router on that interface. The good news is that they are all inline with each other at least!

arri

Indeed. I missed two flights. I have hate airlines. Canceled that contract. Horrible day.

Raise ICMP priority
Lower interval or use combine to lower alarm repetitions
Traffic shape so you leave a bit of space for the ICMP
Try setting the down time higher.

stephenw10

You might try a 2.1.1 snapshot that has many fixes for various things in it.
https://doc.pfsense.org/index.php/2.1.1_New_Features_and_Changes
The are still pre-release snapshots though so new bugs may be added. ;)

Steve

phil.davis

The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.

Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad.

arriflex

@phil.davis:

The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.

Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad.

You are correct, I noticed before the reboot that my rule was limited to TCP/UDP, and updated it to "any." That explains the new consistency. Nice catch!

It's working fine with the ludicrous numbers for monitoring delays and decreasing the number of checks that gateway does. If I find the time, I'll give a snapshot a try as this is the box for doing that on.

arri

bryan.paradis

@arriflex:

@phil.davis:

The Mac is connected to the Alix router through two unmanaged switches, and a firewall rule sends it through the secondary WAN.

Maybe the rule is only for TCP/UDP and ICMP ping from the Mac is still going out the default gateway? That would explain the ping being so good. You could traceroute from Mac to an external host to see which way the ICMP is really going.
It does sound like the link is so saturated that the apinger monitoring is struggling to see any decent numbers and is deciding the link is bad.

You are correct, I noticed before the reboot that my rule was limited to TCP/UDP, and updated it to "any." That explains the new consistency. Nice catch!

It's working fine with the ludicrous numbers for monitoring delays and decreasing the number of checks that gateway does. If I find the time, I'll give a snapshot a try as this is the box for doing that on.

arri

I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.

arriflex

@bryan.paradis:

I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.

I finally got around to trying this for you. With the gateway monitor off on the secondary wan, and fully saturated upstream traffic out of it, ping times from both the configurator page using the secondary wan gateway and from the client routed through the firewall (this time all traffic not just tcp/udp;) are from two to five seconds.

I think it's clearly just a congested interface the way Backblaze saturates it through their SSL tunnel.

bryan.paradis

@arriflex:

@bryan.paradis:

I am not sure if its going to change anything. Turn off gateway monitor on that wan and then test with backblaze and ping out and see if you see a difference. There could be something going on where the added load on the box is degrading the connection further or something. If you get the same loss/latency without apinger running on that WAN then you are dealing with normal saturation? This will need to be fixed using one of the things listed in my above post.

I finally got around to trying this for you. With the gateway monitor off on the secondary wan, and fully saturated upstream traffic out of it, ping times from both the configurator page using the secondary wan gateway and from the client routed through the firewall (this time all traffic not just tcp/udp;) are from two to five seconds.

I think it's clearly just a congested interface the way Backblaze saturates it through their SSL tunnel.

Is that same, worse or better than results with apinger on? Would have been interesting if apinger perpetuates the problem with the highly increased load.

AIMS-Informatique

Having the same CPU rising behaviour on our PCEngine ALIX with Nano Intall (2.1.4 stable).

In France, we experiment loads of trouble over xDSL connections. Mainly loss, caused even by a bad synchro or by a user that get the line to saturate because of big downloads / uploads.
This causes pf to experience a hight CPU load when GW is considered as offline by PF.

We did the trick of gateway polling in "Routing->Gateways->Edit gateway" :

- Advanced->Packet Loss Thresholds = 20% / 40% (default 10% / 20%)
- Probe Interval = 5s (default = 1s)
- Down = 60s (default = 10s)

For what we experienced so far with those values is a better responsivness of PHP UI, and RRD graph shows a fall of CPU load. Still having those settings for test for few hours on PF that are experimenting DSL sync difficulties.

It looks good so far, and looks like increasing apinger tests and faillure decision, gives the ALIX more time to execute what it has to execute, and CPU graph falls dramatically (so far…).

Sounds to be a good and quick idea to play with the values above.