Wan inbound stalls

markn62

Not sure exactly when it began as the modem and switch were first suspect. With them ruled out I don't know where to start to debug Wan inbound stalls that occur 5-10 times a day. Drops real time traffic like VoIP etc. Doesn't affect outbound or LAN. Looked at system.log and interface stats and nothing glaring. The drops last about 5-10 seconds and takes off again at no time related to cron or anything else I can see, rather random. But does tend to occur more often when throughput is above 50mbps. Running basic NAT with a few QOS and other rules. A few packages. On ver 2.2 amd 64.

Any pointers where else to look would be most helpful.

lmartinez073

Try disabling the QOS

KOM

What's in your System log (Status - System logs) when this happens? Is this virtual or physical, appliance or PC?

markn62

Drops still occur with the shaper disabled and the limiter enabled. ICMP replies stall in concert with the throughput stalls (drop) through PfSense. Physical PC w/ AMD A6-6400K. Nothing in the system log ever coincides in time. Troubleshooting is somewhat constrained given it's a production box. Stumped where to begin.

KOM

Replace the WAN NIC and see if the problem persists.

markn62

Ethernet stats dont indicate any errors or dropped packets but I can try a replacement. Currently using an Intel Pro/1000 PCI-X dual port adapter. Isnt there a more detailed debug log that can be enabled?

KOM

Anything in your Gateway log? Anything in Status - RRD Graphs - Quality? If nothing, I'm out of ideas other than a NIC & a prayer.

markn62

A prayer or lots of luck. Intermittents are a PITA. Graphs & logs are clean. Guess I'll try a card swap.

hda

@markn62:

…
Isnt there a more detailed debug log that can be enabled?

Or Diagnostics: Packet Capture ?

Is the problem related to throughput speed ? Can all NIC's handle 50Mbps easily ?

markn62

I have one dual port GigE PCIX intel 350 adapter, one Intel GigE PCI adapter and another PCI GigE adapter, not sure the brand. Also one on-board disabled. I tried placing both Wan and Lan on the dual port adapter and drops got worse. I then tried placing the Wan/Lan on the other two PCI adapters and drops still occur albeit less frequently. All cards test @ +350mbps up/down with jperf.

I tried an unfiltered packet capture but the 60Gig SSD fills up in ~15 minutes so it's difficult to capture the event. No logs or diagnostics that I've tried relate to the timing of the events. Sure looks like a buffer or cache is backing up, stalling, then recovering.

Any other troubleshooting ideas?

KOM

Grab an El Cheapo PC from a landfill or your neighbour's basement and try with that. Isolate the problem as best you can. Maybe a noisy bus on your mb is giving your NICs the sharts.

markn62

I've tried two separate adapter sets with no improvement. I don't see much value in trying a third pair of adapters. I've been shot gunning this for 6 months and have got nowhere. That is the premise of this post, to hopefully learn a better way to troubleshoot this more statistically rather than add'l shots in the dark. It's already burned up an inordinate amount of time. Changing out the MB is a substantial effort, again no statistics or logs to suggest it's the problem.

Btw, I've also tried removing all rules and shapers, no help. And removing all packages, no help. Also replaced the GigE Lan switch, no help.

Is it possible a "System: Advanced: System Tunable" may be responsible? Attached is a PRTG image showing the drop occur. Occurs about 3 to 10 times in a 24 hour period not related to traffic load or time.

IMG_0210.PNG_thumb

Harvy66

How are you measuring the drop? Remote service? Service on the firewall? Service attached to the same WAN/LAN segments?

Do you know for a fact if it's the router or the upstream?

markn62

Isolated to router using;

PRTG throughput graphing of router lan on PC via layer-2 switch, see image 10a.
Sessoft MultiPing session of router lan on PC via layer-2 switch, see image 11a.
Sessoft MultiPing session @ modem gw on laptop wired to 2nd modem port bypassing router, see image 12a.

You'll notice behind the router (PC) both throughput and latency coincide. Ahead of the router (laptop) latency is not impacted.

ScreenShot010a.jpg_thumb

ScreenShot011a.JPG_thumb

ScreenShot012a.jpg_thumb

KOM

I've tried two separate adapter sets with no improvement. I don't see much value in trying a third pair of adapters.

My suggestion was for you to try a different PC altogether. Rule things out one by one if you can.

markn62

Gonna try another Ethernet adapter to see if it has any influence. I would have bought a quad-port pci-x but won't fit my case. So I settled on the Intel dual-port pro/1000 MT as it's in the FreeBSD 10.1 hardware list. It uses a different driver, CAS instead of IGB which is where I'm putting most of my hope, not in hardware.

markn62

I upgraded to 2.2.4-R and changed out a dual-port Intel adapter using igb driver to a Intel PRO/1000 MT Dual Port Server Adapter (82546) using the em driver. The new adapter jperf tcp speed tests on the Lan port @ ~300mbps both directions. Still getting Wan stalls at the same frequency as before. Short but several throughout the day.

Running out of ideas…

Derelict

I would put a managed switch between the modem and WAN port on a blank VLAN, make a mirror port of the modem switch port and put a looping tcpdump capture on it and see what you see when it stalls. If you just stop getting packets from the ISP, you know what your next call is - and you'll at least have some evidence level 2 support can use.

If you don't see anything on the modem port, mirror the WAN port and run the same capture.

firewalluser

@markn62:

I have one dual port GigE PCIX intel 350 adapter, one Intel GigE PCI adapter and another PCI GigE adapter, not sure the brand. Also one on-board disabled. I tried placing both Wan and Lan on the dual port adapter and drops got worse. I then tried placing the Wan/Lan on the other two PCI adapters and drops still occur albeit less frequently. All cards test @ +350mbps up/down with jperf.

I tried an unfiltered packet capture but the 60Gig SSD fills up in ~15 minutes so it's difficult to capture the event. No logs or diagnostics that I've tried relate to the timing of the events. Sure looks like a buffer or cache is backing up, stalling, then recovering.

Any other troubleshooting ideas?

Tcpdump piped to another machine with a 4TB/bigger hard drive, might give you some breathing space, something like this.

http://socpuppet.blogspot.co.uk/2013/05/using-netcat-to-push-dumped-traffic-to.html

Derelict

You can tell tcpdump to use a certain number of files of a certain size then overwrite them in a loop. The "buffer" only has to be long enough for you to stop the dump after a drop occurs to get the info you need.

tcpdump -C 20 -w filename -W 100 -i eth0

Save 100 20Mbyte files named filenameXXX in a rolling capture

Initially, the file size should be something wireshark can comfortably load for you. Later, when you know what you're looking for, you can capture larger files and filter them with tcpdump for the info you're looking for.