PFsense on a Poweredge 1850

vman76

That is looking better for sure. mind posting the sysctl for that guy? Also what size packets are you using or were using in the test?

Sure, here is the current data . The firewall is now in production and averaging 150Mbps, @ 24,000 PPS with no issues since around noon. I tried various iperfs but the money spot was this one:

iperf -c –w 65000 –t 600 –P5

Which should use the full Ethernet frame. I tried a bunch of other windows sizes and more flows (up to -P 50) along with UDP tests. The above gave me the best results.Looking at the distribution of packets on the last firewall, and on routes netflow roue-cache the students use mostly applications with large packets (video streaming, filesharing etc). I'd like to have done some more testing but time constraints did not allow it.

dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.2
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x10a4 subvendor=0x8086 subdevice=0x10a4 class=0x020000
dev.em.0.%parent: pci14
dev.em.0.nvm: -1
dev.em.0.debug: -1
dev.em.0.fc: 3
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66
dev.em.0.rx_processing_limit: 100
dev.em.0.eee_control: 0
dev.em.0.link_irq: 0
dev.em.0.mbuf_alloc_fail: 0
dev.em.0.cluster_alloc_fail: 0
dev.em.0.dropped: 0
dev.em.0.tx_dma_fail: 0
dev.em.0.rx_overruns: 0
dev.em.0.watchdog_timeouts: 0
dev.em.0.device_control: 1209795137
dev.em.0.rx_control: 67141634
dev.em.0.fc_high_water: 30720
dev.em.0.fc_low_water: 29220
dev.em.0.queue0.txd_head: 192
dev.em.0.queue0.txd_tail: 192
dev.em.0.queue0.tx_irq: 0
dev.em.0.queue0.no_desc_avail: 0
dev.em.0.queue0.rxd_head: 531
dev.em.0.queue0.rxd_tail: 530
dev.em.0.queue0.rx_irq: 0
dev.em.0.mac_stats.excess_coll: 0
dev.em.0.mac_stats.single_coll: 0
dev.em.0.mac_stats.multiple_coll: 0
dev.em.0.mac_stats.late_coll: 0
dev.em.0.mac_stats.collision_count: 0
dev.em.0.mac_stats.symbol_errors: 0
dev.em.0.mac_stats.sequence_errors: 0
dev.em.0.mac_stats.defer_count: 5793
dev.em.0.mac_stats.missed_packets: 0
dev.em.0.mac_stats.recv_no_buff: 139
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.xon_recvd: 5929
dev.em.0.mac_stats.xon_txd: 120
dev.em.0.mac_stats.xoff_recvd: 5929
dev.em.0.mac_stats.xoff_txd: 120
dev.em.0.mac_stats.total_pkts_recvd: 397413786
dev.em.0.mac_stats.good_pkts_recvd: 397401928
dev.em.0.mac_stats.bcast_pkts_recvd: 2715
dev.em.0.mac_stats.mcast_pkts_recvd: 1528
dev.em.0.mac_stats.rx_frames_64: 11419946
dev.em.0.mac_stats.rx_frames_65_127: 24122771
dev.em.0.mac_stats.rx_frames_128_255: 5438765
dev.em.0.mac_stats.rx_frames_256_511: 2942593
dev.em.0.mac_stats.rx_frames_512_1023: 13221690
dev.em.0.mac_stats.rx_frames_1024_1522: 340256163
dev.em.0.mac_stats.good_octets_recvd: 504144384891
dev.em.0.mac_stats.good_octets_txd: 70175650866
dev.em.0.mac_stats.total_pkts_txd: 199599490
dev.em.0.mac_stats.good_pkts_txd: 199599248
dev.em.0.mac_stats.bcast_pkts_txd: 1616
dev.em.0.mac_stats.mcast_pkts_txd: 2
dev.em.0.mac_stats.tx_frames_64: 83244952
dev.em.0.mac_stats.tx_frames_65_127: 68946765
dev.em.0.mac_stats.tx_frames_128_255: 3324597
dev.em.0.mac_stats.tx_frames_256_511: 2036340
dev.em.0.mac_stats.tx_frames_512_1023: 3106394
dev.em.0.mac_stats.tx_frames_1024_1522: 38940203
dev.em.0.mac_stats.tso_txd: 0
dev.em.0.mac_stats.tso_ctx_fail: 0
dev.em.0.interrupts.asserts: 106244188
dev.em.0.interrupts.rx_pkt_timer: 39933
dev.em.0.interrupts.rx_abs_timer: 0
dev.em.0.interrupts.tx_pkt_timer: 5731
dev.em.0.interrupts.tx_abs_timer: 11354
dev.em.0.interrupts.tx_queue_empty: 0
dev.em.0.interrupts.tx_queue_min_thresh: 0
dev.em.0.interrupts.rx_desc_min_thresh: 0
dev.em.0.interrupts.rx_overrun: 0

bryan.paradis

@stephenw10:

Impressive.
The ERL has a custom ASIC to enable it to perform like that. It's not supported by FreeBSD, so if/when pfSense runs on it don't expect those numbers. Currently tops out at 250Mbps.

Steve

Yes indeed. It is a heavily changed vyatta base OS on debian mips cavicum. The driver would need to be ported. Still at $99

http://rtfm.net/FreeBSD/ERL/

Performance could be a little better, though it's more than adequate for my home Internet connection. Basic packet passing between two Gigabit hosts seems to top out at about 250Mbits/sec.

https://wiki.freebsd.org/FreeBSD/mips/Octeon

@vman76:

@bryan.paradis:

That is looking better for sure. mind posting the sysctl for that guy? Also what size packets are you using or were using in the test?

Sure, here is the current data . The firewall is now in production and averaging 150Mbps, @ 24,000 PPS with no issues since around noon. I tried various iperfs but the money spot was this one:

iperf -c –w 65000 –t 600 –P5

Which should use the full Ethernet frame. I tried a bunch of other windows sizes and more flows (up to -P 50) along with UDP tests. The above gave me the best results.Looking at the distribution of packets on the last firewall, and on routes netflow roue-cache the students use mostly applications with large packets (video streaming, filesharing etc). I'd like to have done some more testing but time constraints did not allow it.

dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.2
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0
dev.em.0.%pnpinfo: vendor=0x8086 device=0x10a4 subvendor=0x8086 subdevice=0x10a4 class=0x020000
dev.em.0.%parent: pci14
dev.em.0.nvm: -1
dev.em.0.debug: -1
dev.em.0.fc: 3
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66
dev.em.0.rx_processing_limit: 100
dev.em.0.eee_control: 0
dev.em.0.link_irq: 0
dev.em.0.mbuf_alloc_fail: 0
dev.em.0.cluster_alloc_fail: 0
dev.em.0.dropped: 0
dev.em.0.tx_dma_fail: 0
dev.em.0.rx_overruns: 0
dev.em.0.watchdog_timeouts: 0
dev.em.0.device_control: 1209795137
dev.em.0.rx_control: 67141634
dev.em.0.fc_high_water: 30720
dev.em.0.fc_low_water: 29220
dev.em.0.queue0.txd_head: 192
dev.em.0.queue0.txd_tail: 192
dev.em.0.queue0.tx_irq: 0
dev.em.0.queue0.no_desc_avail: 0
dev.em.0.queue0.rxd_head: 531
dev.em.0.queue0.rxd_tail: 530
dev.em.0.queue0.rx_irq: 0
dev.em.0.mac_stats.excess_coll: 0
dev.em.0.mac_stats.single_coll: 0
dev.em.0.mac_stats.multiple_coll: 0
dev.em.0.mac_stats.late_coll: 0
dev.em.0.mac_stats.collision_count: 0
dev.em.0.mac_stats.symbol_errors: 0
dev.em.0.mac_stats.sequence_errors: 0
dev.em.0.mac_stats.defer_count: 5793
dev.em.0.mac_stats.missed_packets: 0
dev.em.0.mac_stats.recv_no_buff: 139
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.xon_recvd: 5929
dev.em.0.mac_stats.xon_txd: 120
dev.em.0.mac_stats.xoff_recvd: 5929
dev.em.0.mac_stats.xoff_txd: 120
dev.em.0.mac_stats.total_pkts_recvd: 397413786
dev.em.0.mac_stats.good_pkts_recvd: 397401928
dev.em.0.mac_stats.bcast_pkts_recvd: 2715
dev.em.0.mac_stats.mcast_pkts_recvd: 1528
dev.em.0.mac_stats.rx_frames_64: 11419946
dev.em.0.mac_stats.rx_frames_65_127: 24122771
dev.em.0.mac_stats.rx_frames_128_255: 5438765
dev.em.0.mac_stats.rx_frames_256_511: 2942593
dev.em.0.mac_stats.rx_frames_512_1023: 13221690
dev.em.0.mac_stats.rx_frames_1024_1522: 340256163
dev.em.0.mac_stats.good_octets_recvd: 504144384891
dev.em.0.mac_stats.good_octets_txd: 70175650866
dev.em.0.mac_stats.total_pkts_txd: 199599490
dev.em.0.mac_stats.good_pkts_txd: 199599248
dev.em.0.mac_stats.bcast_pkts_txd: 1616
dev.em.0.mac_stats.mcast_pkts_txd: 2
dev.em.0.mac_stats.tx_frames_64: 83244952
dev.em.0.mac_stats.tx_frames_65_127: 68946765
dev.em.0.mac_stats.tx_frames_128_255: 3324597
dev.em.0.mac_stats.tx_frames_256_511: 2036340
dev.em.0.mac_stats.tx_frames_512_1023: 3106394
dev.em.0.mac_stats.tx_frames_1024_1522: 38940203
dev.em.0.mac_stats.tso_txd: 0
dev.em.0.mac_stats.tso_ctx_fail: 0
dev.em.0.interrupts.asserts: 106244188
dev.em.0.interrupts.rx_pkt_timer: 39933
dev.em.0.interrupts.rx_abs_timer: 0
dev.em.0.interrupts.tx_pkt_timer: 5731
dev.em.0.interrupts.tx_abs_timer: 11354
dev.em.0.interrupts.tx_queue_empty: 0
dev.em.0.interrupts.tx_queue_min_thresh: 0
dev.em.0.interrupts.rx_desc_min_thresh: 0
dev.em.0.interrupts.rx_overrun: 0

Interesting! Thanks for posting.

vman76

Well it looks I found the hardware limits of the new server as well. We were able to push about 500Mbps and 80,000 PPS with no issue. Once we get to the 600Mbps and 100,000 PPS we get input errors (NIC buffer overruns). While doing some realtime troubleshooting, I noticed that the errors occur exactly when the one of 4 CPU's hits 100% .(kernel em0 queue) process. em0 is my otuside interfaces. So it appears my earlier suspicion applies in this case and the CPU is too busy to pull the packets off the NIC buffer in time and I end up with overruns. The CPU I'm using is a Intel(R) Xeon(R) CPU 5130 @ 2.00GHz so it looks like I'm going to be searching for another box. I'm doing 1to1 NAT on over 5,000 hosts so I think that might be driving the CPU higher than I expected. The attached pic shows CPU1 at 84% but "top -P" shows that it gets to 100% when the packet loss occurs.

I'd love to put the Ubiquiti Edgerouter inline and test their PPS claim here since I'm way under 1,000,000 PPS :P (j/k)

Out of curiosity, does anyone know why the RRD graphs don't show individual CPU/core stats? The CPU data there looks like its the average of all 4 CPU's which doesn't real help in troubleshooting a problem like this. I did an snmpwalk and found utilization data for all the CPU's so I'm graphing it separately in cacti now. (HOST-RESOURCES-MIB::hrProcessorLoad.x)

Some data from my troubleshooting is below in case some spots something . I have a lot of experience troubleshooting networks in general but I'm very new to BSD so I could be missing something.

input (Total) output
packets errs idrops bytes packets errs bytes colls
86k 83 0 73M 87k 0 73M 0
100k 155 0 85M 101k 0 85M 0
96k 0 0 82M 97k 0 82M 0
99k 74 0 82M 101k 0 82M 0
96k 0 0 82M 98k 0 82M 0

dev.em.0.mac_stats.missed_packets: 2294752
dev.em.0.mac_stats.recv_no_buff: 4617837
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.xon_recvd: 9112
dev.em.0.mac_stats.xon_txd: 120
dev.em.0.mac_stats.xoff_recvd: 9112
dev.em.0.mac_stats.xoff_txd: 120
dev.em.0.mac_stats.total_pkts_recvd: 10671726540
dev.em.0.mac_stats.good_pkts_recvd: 10669413564
dev.em.0.mac_stats.bcast_pkts_recvd: 15097
dev.em.0.mac_stats.mcast_pkts_recvd: 9664
dev.em.0.mac_stats.rx_frames_64: 240300603
dev.em.0.mac_stats.rx_frames_65_127: 744037531
dev.em.0.mac_stats.rx_frames_128_255: 281908686
dev.em.0.mac_stats.rx_frames_256_511: 135974542
dev.em.0.mac_stats.rx_frames_512_1023: 172724810
dev.em.0.mac_stats.rx_frames_1024_1522: 9094467392
dev.em.0.mac_stats.good_octets_recvd: 13931850472813
dev.em.0.mac_stats.good_octets_txd: 1173620928614
dev.em.0.mac_stats.total_pkts_txd: 5912173538
dev.em.0.mac_stats.good_pkts_txd: 5912173297
dev.em.0.mac_stats.bcast_pkts_txd: 2117
dev.em.0.mac_stats.mcast_pkts_txd: 2

: vmstat -i
interrupt total rate
irq14: ata0 376 0
irq20: uhci1 437491 0
irq21: uhci0 uhci2+ 541201 0
cpu0: timer 1165155769 1997
irq256: bce0 23965829 41
irq257: mfi0 1297902 2
irq258: em0 2536851814 4350
irq259: em1 2695135942 4621
cpu2: timer 1165155721 1997
cpu3: timer 1165155724 1997
cpu1: timer 1165155721 1997
Total 9918853490 17008

highCPU.jpg_thumb

stephenw10

I don't really have experience at this sort of traffic level but it seems like you should be able to do better than that on those servers. That's just a general impression though. It would be useful to get an opinion from someone more experienced.

Could this be a situation where IP fastforwarding could be usefully enabled? It can cause problems, notably with IPSec.
https://forum.pfsense.org/index.php?topic=57723.0

What hardware offloading options do you have enabled?

Steve

vman76

@stephenw10:

I don't really have experience at this sort of traffic level but it seems like you should be able to do better than that on those servers. That's just a general impression though. It would be useful to get an opinion from someone more experienced.

Could this be a situation where IP fastforwarding could be usefully enabled? It can cause problems, notably with IPSec.
https://forum.pfsense.org/index.php?topic=57723.0

Steve

I thought it could do better too but the numbers say otherwise. I have a simple ruleset of about 5 rules on each interface. I have not loaded any packages. No VPN. I do log everything to syslog but that is a requirement that I can't get away from.

Hmm, interesting option. We will not be using IPSec terminated directly on this box so that's not an issue. However ,students do use VPN clients which will go through the firewall. I have to research it more to see if anything else might break by applying it. With over 3,000 users with every device you can imagine a student might bring into a dorm room, I'm apprehensive on what it might break.

stephenw10

Hmm, I imagine it would break IPSec through the box and probably generate some complaints! It can dramatically increase throughput in some instances though. There may other opportunities for tuning though.

Earlier I said that the ERL had an ASIC to increase throughput but I think that was wrong (I can't edit it now). It looks like it has a closed source IP forwarding module that can run separately on one of it's 8 cores. No chance of a FreeBSD driver but maybe an equivalent in the future.

Steve

podilarius

The results are somewhat expected. currently pfSense is using an old pf that is single core only. The only real reason to run pfsense on a multicore is for the addons to use the other cores while pf filtering is stuck on one.
The faster the clock speed of a single core, the more throughput you will observe. The pfSense hardware sizing have 2GHz machines topping out at around 500Mbps. You got it to go a bit higher. I would imagine that you could get a lot more if you have a 3.6GHz or an over clocked machine at 4Ghz.
There has been talk about upgrading to the newer pf, but I don't know much about it or even when. Perhaps 2.2 or 2.3. It should have multicore if based on the newer code. (Note, I am not with ESF and I don't know the plans, at all.) Just hoping that we can get to multicore/multithreaded before I need it.

vman76

@podilarius:

The results are somewhat expected. currently pfSense is using an old pf that is single core only. The only real reason to run pfsense on a multicore is for the addons to use the other cores while pf filtering is stuck on one.
The faster the clock speed of a single core, the more throughput you will observe. The pfSense hardware sizing have 2GHz machines topping out at around 500Mbps. You got it to go a bit higher. I would imagine that you could get a lot more if you have a 3.6GHz or an over clocked machine at 4Ghz.
There has been talk about upgrading to the newer pf, but I don't know much about it or even when. Perhaps 2.2 or 2.3. It should have multicore if based on the newer code. (Note, I am not with ESF and I don't know the plans, at all.) Just hoping that we can get to multicore/multithreaded before I need it.

I looked at CPU requirements and saw a 3 Ghz was recommended but it doesn't mention anything about the CPU architecture. The Dell 1850 in the beginning of this thread was a 3 Ghz Xeon but an older architecture (800 FSB). My current 2 Ghz (1333 FSB) is pushing twice the traffic so it gets kind of tricky comparing the older CPU's with the newer models.

Do you know what name of the actual PF process is so I could monitor it? I see that the kernel process is the one taking up all the CPU and it is across 2 cores (cpu1 em0, cpu2 em1 in my last screenshot). Is that actual OS pulling packets off the NIC before packet filtering process? I'm used to the Cisco ASAs where I would look at the dispatcher process for filtering CPU usage. Not sure what the equivalent is here.

Lastly, do you know what the "top" command equivalent to Diagnostics–>System activity is? The close I got to it was "top -P" but didn't show me as much detail as the System Activity menu.

Thanks for you patience with my newb questions.

podilarius

I agree it doesn't mention that, but if you went with a 1950 with faster proc, you might do well.
Not sure about the top command, but you can do a ps -ef while that is running and it would probably tell you.

stephenw10

top -SH

The hardware guide is little outdated as you've found.

Steve

Aluminum

In the little bit of reading I've done its basically about how many interrupts a second the core talking to that device can do, so clockspeed is judge, jury and executioner.
(and since newer architectures have improved IPC over time I would think that might include interrupts as well but not sure?)

The HFT guys apparently have the same problems that busy networks do, but makes sense as both are doing tons of small random I/O.

From what I understand if even a 4.x Ghz core cannot do your workload and you can't spread it to other cores, the next step is to offload it to specialty hardware. Definitely explains some of those odd dual core high clocked xeon models out there.

stephenw10

@podilarius:

There has been talk about upgrading to the newer pf, but I don't know much about it or even when. Perhaps 2.2 or 2.3.

I missed this earlier. I'm not associated with ESF either.
The smp friendly pf is in FreeBSD 10 so pfSense 2.2, which will be built on that, should inlude it.

http://svnweb.freebsd.org/base?view=revision&revision=240233

Steve