Need reliable quad GigEthernet pcie card

wkk2

I have a Dell R320 with an Intel Pro 1000VT (4 Ethernet ports). The system is unreliable: openvpn tcp sessions drop, ssh works and then freezes at random times. Https console access fails on one port but continues to work on another. Outgoing traffic seems to restore operation. I'm running 2.0.3 and nothing seems to show up in the log files. It might work for 5min. or even an hour before trouble. I've tried setting the wan port to 100 and disabled check sum off load.

Can anyone suggest a reliable quad GigEthernet PCIe card?

IOerror

I used those cards for many years. May I bug you for some info to help troubleshoot (TS)?

Are some/all of the ports connected to the same switch?

Are you trunking between the server and switch on any interfaces?
If you are trunking, is it set up for auto-negotiated trunks on the switch?
Are all or some of the connections hard-coded for 1000Mbs, 100Mbs, or 10Mbs?
Have you ensured your VLANs are configured properly? Native VLAN on trunks can be a monster to TS

Are you NATing?

Are you NATing at the server or at the boundary router?
Are you using Perfect Forward Secrecy at the server and/or the boundary? This could be a complicated one to TS.
Have you done a trace to see if any packets are hitting a routing loop or being inspected by a deep pack inspection service outside the server?

Can you describe in detail what happens on which ports along with a scrubbed topology?

stephenw10

I think your bridge issue may have been confusing things earlier.
If you're running a multiport Intel card you should definitely be trying this:
https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Intel_igb.284.29_and_em.284.29_Cards

Steve

wkk2

I was doing bridging (WAN-DMZ) but I removed it to simplify my troubles.

I have a WAN port with a public address that was auto/Gig but I've tried 100 full duplex and I turned off check sum off load.

A LAN port is connected to a managed switch that is configured as an access port (vlan1). It's set for outgoing nat with the WAN interface IP. The interface is now disabled.

I have a DMZ port with a static private address. It's connected to an unmanaged switch that is now empty.

The 4th port is configured as a trunk and connected to the same switch as the LAN interface. That port is configured as a trunk with two vlans. The first vlan (VLAN/PVID) has two access ports that are empty, no address is assigned. The second vlan has one associated switch access port that is connected to another network and has static IP that gives me a back door when the WAN interface stops working.

With the LAN port disable, I rebooted and I started a TCP openvpn session to the WAN address with all traffic routed through the tunnel. I started pinging the private DMZ address. The tunnel failed after about 20min after a few ping timeouts and a few sporadic replies.

I was also pinging the WAN public address from another computer. There were a block of timeouts followed by a few replies, followed by complete loss. Pings to the cable modem were OK. This time, after about 5 min. pings to the WAN port started working again.

wkk2

I tried adding kern.ipc.nmbclusters="131072", hw.igb.num_queues=1, and hw.igb.fc_settings=0 with no luck. Web management access gone again…

More clues from packet trace on WAN interface during failure:

I started pinging the WAN interface from an external site and while receiving timeouts, I started a packet trace on the WAN interface.

Then from a firewall shell I pinged an external address (this usually clears the access bottleneck)

The trace:

I see lots of echo requests from the WAN to cable modem and the replies, so they appear to be talking.

I see a few outgoing UDP.53 to name servers (no answers)
Also, I see a little incoming ssdp from a private address, somebody has a leaky network.

The trace shows 4 outgoing echo requests to the external server with no replies (my shell ping).

This is followed by an ARP who-has from the cable modem asking about the firewall IP address.
The firewall gives an ARP-Reply.

Things are now working.

Now I start seeing echo replies from the shell pinging of my external site, echo requests start appearing and getting answers.

I don't understand why the cable modem suddenly does the arp-request for the firewalls WAN MAC. The firewall was send echos and getting replies all through the trace.

IOerror

That is interesting indeed. Have you spoken with your ISP yet? Possibly provide a scrubbed pcap and see if there is an issue with the modem and/or circuit you are riding across. Possibly ask them to do a MAC flush and reboot the modem. Also, one way to isolate this to the modem is to connect a laptop/pc directly to the modem and pcap the same tests. If you have a spare router laying around, you could hook the pfSense box to it and see if the same behavior exists. Granted, you won't have VPN.

I hate pointing fingers at someone else, but maybe it's a mix of the ISP and the Server. It seems like you've done everything you can on the server side.

stephenw10

I wouldn't disable flow control unless you have some really good reason for doing so.

Steve

wkk2

SOLVED….The Motorola cable modem configuration was reloaded and upgraded from 3.3.1 to 3.5.8 and finally replaced with a Ubee DDW3611. Replacing the modem solved the problem with the intermittent incoming/ARP issues.

So the Pro 1000VT is now working. I never determined why the firewall ARP replies wouldn't satisfy the old modem. It might have been a CM issue or MSO problem that was cleared by replacing the modem.

Thanks for all the help.