Web sites not loading when accessing pfSense through VLAN trunk.



  • Firewall in pfSense is set to allow all on all interfaces except WAN.

    setup that doesn't work:

    
    PC -> (VLAN16 port)
                 |
             [switch]
                 |
         (LAGG ports, trunk)
              |  |  | (ports 3-5)
             [pfSense] -> cable modem                  
    
    

    The result is that DHCP works, rest of VLAN16 is accessible, route to management VLAN1 works, sites on the internet are generally accessible, but most don't finish loading, some stop as soon as transmitting the <title>tag.<br /><br />setup that works fine:<br /><pre><br />PC -> (VLAN80 port)<br />            |<br />          [switch]<br />            |<br />        (VLAN80 port)<br />            |  (LAN port)<br />        [pfSense] -> cable modem                  <br /></pre><br />All works fine this way.<br /><br />I have setup DHCP for LAN and OPT1 ports, which is how most of my network operated to date. Now that I have a proper managed switch, I decided to segregate my traffic into VLANs, mainly to facilitate running a bunch of VMs on my network, splitting their WAN-directed ports into different VLANS to allow pfSense to easily control the traffic (WWW servers on VLAN16, p2p on VLAN32, users and other LAN PCs on VLAN128). When I connect via the VLAN/LAGG trunk, connecting my PC to VLANS 16/32/128, the internet connection becomes a bit laggy and web sites don't load properly. But Things work in general.<br /><br />This might seem like an overly elaborate config, which it is because I'm just learning this stuff. Being a noob at this, I don't know where to look for possible causes of the problem.<br /><br />Ports 3-5 are aggregated with LACP. I can try using only one port to see if that is the source of the problem, but things have generally worked there - I installed several VMs through the aggregated ports, downloading the base packages without a single problem.<br /><br />The hardware is:<br />Watchguard firebox running pfSense (was x700)<br />Dell Powerconnect 5324 (switch)<br />also present are:<br />Dell T7400 workstation (my PC, runs win vista)<br />Dell Poweredge 1900 (server running VMs on xen)<br /><br />the full setup<br /><pre><br />            [other LAN PCs]<br />                |    |    |<br />PC -> (generic gigabit switch) -> (VLAN128 ports)  <- [other LAN PCs]<br />                                        |<br />                                  [5324 switch] <- (LACP aggregated VLAN trunk) <-[PE1900]<br />                                        |                                            |<br />                          (LACP aggregated VLAN trunk)                      (old LAN interface)<br />                                      | | |                                          |<br />[cable modem] <- (WAN port) <-  [pfSense on Firebox] -> (LAN port)->[other generic gigabit switch]<br />                                                                                    | | |<br />                                                                              [PCs on old LAN]<br /><br /><br /></pre><br /><br />Where should I look for the source of the problem? What settings should I post to help solve it? Most things work, one other annoyance I noticed is that ssh connections to PE1900 break up after a bit of inactivity. I haven't had that happen when not using VLANs.</title>


  • Netgate Administrator

    Are you seeing any 'watchdog timeout' errors in the system logs? This is a common problem on those fireboxes.

    connecting my PC to VLANS 16/32/128, the internet connection becomes a bit laggy

    At the same time? You can't do that, without some far more complex config, it would break routing.

    You also seem to have two connections from your PE1900 to pfSense which could potentially introduce routing problems if you haven't taken steps to avoid it.

    The way to do anything like this is one step at a time. Don't setup a complex network and expect all aspects of it to work, unless you're lucky/very good! If you have made a mistake somewhere it makes it very difficult to find because you have introduced so many new things at once.
    Take out the LAGG setup until you have other stuff working reliably.

    Steve



  • @stephenw10:

    Are you seeing any 'watchdog timeout' errors in the system logs? This is a common problem on those fireboxes.

    I'll take a look, thanks.

    @stephenw10:

    connecting my PC to VLANS 16/32/128, the internet connection becomes a bit laggy

    At the same time? You can't do that, without some far more complex config, it would break routing.

    What I meant was switching (on the 5324) the port to which the test PC was connected to transmit traffic specific to these VLANs. My point was that the problem occurred when connecting to either of the three VLANs I set up, which went through the VLAN trunk rather then being partitioned on the switch, like in the working configuration.

    @stephenw10:

    You also seem to have two connections from your PE1900 to pfSense which could potentially introduce routing problems if you haven't taken steps to avoid it.

    The link between PE1900 and 5324 switch carries only VLAN tagged traffic, so it shouldn't clash with my pre-VLAN network. At least to my knowledge :). The previous config had PE1900 appear on the LAN as a NAS box with a static DHCP lease, and it still does that not to aggravate other LAN users :). At least until I can get the new setup to work.

    @stephenw10:

    The way to do anything like this is one step at a time. Don't setup a complex network and expect all aspects of it to work, unless you're lucky/very good! If you have made a mistake somewhere it makes it very difficult to find because you have introduced so many new things at once.

    Take out the LAGG setup until you have other stuff working reliably.

    Thanks, I certainly will do that. If this doesn't sound like some more or less common issue, I've no other options then to take this new network apart and start testing it piece by piece. Thats not to say I just plugged it in yesterday and prayed it would work, its just that now when I'm this far into the upgrade, I noticed my internet connection was laggy when on VLANs, while everything works fine when I plug my PC into the other switch.



  • So I tired testing if its a MTU issue and didn't come up with any conclusive results. Regardless of MTU settings, certain websites didn't load when connected through a VLAN. I only need to set MTU on the WAN interface, right?

    Then I tried to switch the LAGG from LACP mode to failover mode (effectively just one cable transmitting, others in standby). Didn't work as expected - no traffic would come through that link at all, until I switched it back into LACP mode. Then the normal LAN traffic resumed on the LAGG and I could connect to the switch, but VLAN trunk didn't work anymore. I guess this might be a bug, when switching LAGG modes?

    Anyhow, I'm suspecting that packets are getting messed up somewhere inside my network, just not sure where. The path is

    
                     [5324 switch]                             [pfSense]
    [PC]->{ (VLAN128 port)->(VLAN trunk) }->{ (VLAN trunk)->(VLAN128 interface)->(WAN) }->[modem] 
    
    

    Any advice on how to test this?


  • Netgate Administrator

    @superbob:

    Regardless of MTU settings, certain websites didn't load when connected through a VLAN.

    The same websites every time or more random?

    Anything in the firewall logs?

    Do you see packet loss in a ping test? Traceroute?

    Steve



  • @stephenw10:

    The same websites every time or more random?

    http://www.bodybuilding.com/ is one example. I've also had problems accessing imgur.com and support.dell.com, and a bunch of other sites I don't remember now. I bookmarked a set of sites that fail to load to varying degrees (from just starting the <title>tags to parts of the site top navigation), problem reoccurs consistently when connecting via VLAN trunk, after assigning different VLAN on the switch (VLAN80, effectively bridging pfSense LAN interface to my PC), pages load quickly after refreshing in browser.<br /><br /><small>@stephenw10:</small><br><blockquote><br />Anything in the firewall logs?<br /></blockquote><br />I have firewall logs? ;) If stuff gets logged automatically then I probably do, otherwise I'll enable logging of the vlan interfaces and see if anything unusual comes up when accessing problem sites.<br /><br /><small>@stephenw10:</small><br><blockquote><br />Do you see packet loss in a ping test? Traceroute?<br /></blockquote><br />Pinging with packet size 1473-1474 always fails (is this typical? I've googled this and only found inconclusive answers), otherwise it appears to work as normal unless I reduce the MTU on WAN port. With WAN MTU=1500 I can ping with 1472bytes without fragmenting (which is just as it should be). When I reduce the MTU on WAN port, I get no ping replies for pings within the range ( 1472 - (1500 - current_mtu) ) to 1472. Or in other words, the range in which I lose packets comes down by the amount of bytes deduced from the WAN MTU. Interestingly, when pinging within that range with the no-fragment flag, packets are not fragmented. Could this indicate a problem with path MTU discovery on my PC?<br /><br />That said, this appears to happen regardless of whether or not I'm connected via VLAN trunk or 'directly'.<br /><br />I've used <a href="http://www.elifulkerson.com/projects/mturoute.php" target="_blank">mturoute</a> on my PC to see my MTU at the problem sites and it detects 1500. Is this right?</title>


  • Netgate Administrator

    @superbob:

    Pinging with packet size 1473-1474 always fails.

    Starting to look more and more like an MTU problem.
    As I recall if you are using a NIC that doesn't support hardware VLAN tagging then the tagging has to be included in the packet reducing the MTU. I believe it's only 4 bytes though but this could be combining with some other problem to produce the effect you are seeing.

    Since you aren't seeing any problem on your LAN interface I would suggest that this is happening in the VLAN stage of the transfer. Try reducing the MTU of your internal interface somehow. Perhaps:
    http://forum.pfsense.org/index.php/topic,47387.0.html
    Though I would expect that to be taken care of without issue. It's as if you have a 'do not fragment' flag set somewhere?

    Steve



  • Ok, so I did some testing and it looks like it may be a driver issue. I tried pinging hosts on my network with various packet sizes and it looks like the firebox doesn't like to receive packets sized 1474 bytes and to send then out via VLAN interfaces. Transmitting on normal interfaces is fine, as is receiving the ping replies.

    here's a log of me pinging the PE1900 via the old LAN and via VLAN'ed network: http://pastebin.com/xvd2gd86
    I tail the syslog to see if outgoing traffic causes watchdog timeouts. So far it looks like only incoming traffic does, more on that later. Also what we see here is that the particular packet size is only a problem when going through a vlan trunk. Also, I did 'ifconfig' at the end.

    I attach the results of pinging the pfSense box and my laptop from PE1900, results of doing:

    
    for (( size=1470; size<=1492; size++ )); do ping -s $size -c1 192.168.0.22; done > ping_laptop.txt
    for (( size=1470; size<=1492; size++ )); do ping -s $size -c1 192.168.0.1; done > ping_pfsense.txt
    for (( size=1470; size<=1492; size++ )); do ping -s $size -c1 192.168.10.1; done > ping_pfsenseVlan.txt
    
    

    So it looks like even though its possible to directly send out requests of 1474 bytes, the firebox has trouble receiving and processing them. Also, pinging the pfSense box with 1473/1474 bytes causes a watchdog timeout on the interface. Pinging VLANs causes timeouts on random LAGG interfaces which receive the packet.

    Here's how it looks when I run tcpdump: http://pastebin.com/xkAcD1Q2

    Just to be sure its not an issue with LACP/VLAN trunk on the switch, I connected the laptop to VLAN 16 and pinged it from PE1900: http://pastebin.com/5pPuykqX

    I'm still not sure if this has anything to do with websites not loading for me. Assuming it does, would changing the MTU on my VLAN interfaces help? Should I make it higher or lower? And more importantly, how do I change it?

    [2.0.1-RELEASE][root@pfsense.bobnet]/root(90): ifconfig re3 mtu 1492
    ifconfig: ioctl (set mtu): Invalid argument
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(91): ifconfig lagg0 mtu 1492
    ifconfig: ioctl (set mtu): Invalid argument
    
    

    I can't change the MTU on the LAGG interfaces, or on the LAGG iteslf. I know the interfaces themselves should support changing of the MTU:

    
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(92): ifconfig re0 mtu 1492
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(93): ifconfig re1 mtu 1492
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(94): ifconfig re2 mtu 1492
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(95): ifconfig re3 mtu 1492
    ifconfig: ioctl (set mtu): Invalid argument
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(96): ifconfig re4 mtu 1492
    ifconfig: ioctl (set mtu): Invalid argument
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(97): ifconfig re5 mtu 1492
    ifconfig: ioctl (set mtu): Invalid argument
    
    

    Is this a matter of tricking the ifconfig script by assigning the LAGG ports an IP addr? Or… is this a bigger issue?

    ping_laptop.txt
    ping_pfsense.txt
    ping_pfsenseVlan.txt


  • Netgate Administrator

    This could be an important result in narrowing down the watchdog timeout errors that have plagued that box. Until now about the best advice we have is to use a managed connected to the box based anecdotal evidence that fragmented packets can cause a problem.

    Anyway I think you will have to bring the interface down before changing the MTU or possibly remove it from the LAGG? Though I have no problem doing that on my box.  :-\ You want to be doing this on the VLAN parent interface but this is the LAGG interface for you. Hmm. Trial and error I think!

    Also I see that the re driver and the chip supports hardware VLAN tagging. It might be interesting to try disabling it, especially vlanmtu:
    @ifconfig:

    vlanmtu, vlanhwtag, vlanhwfilter, vlanhwtso
        If the driver offers user-configurable VLAN support, enable
        reception of extended frames, tag processing in hardware, frame
        filtering in hardware, or TSO on VLAN, respectively.  Note that
        this must be issued on a physical interface associated with
        vlan(4), not on a vlan(4) interface itself.

    -vlanmtu, -vlanhwtag, -vlanhwfilter, -vlanhwtso
        If the driver offers user-configurable VLAN support, disable
        reception of extended frames, tag processing in hardware, frame
        filtering in hardware, or TSO on VLAN, respectively.

    Steve



  • Some Realtek NICs have broken long frame support, so when you're trying to pass packets that have the full 1500 MTU and then add the VLAN tag, they refuse to send or receive them. Your symptoms match that scenario 100%. Not sure anyone has done VLANs on those boxes, but it'd be far from the first NIC issues people have seen on them.


  • Netgate Administrator

    I am having difficulty understanding the various flags reported by ifconfig but it seems to me that the vlanmtu flag is used by the VLAN driver to determine whether or not the interface supports VLAN frames larger than 1500. If it is set and the interface in fact does not support this then it could be the cause of many problems. There is a suggested solution to this:

    The vlan driver automatically recognizes devices that natively support
        long frames for vlan use and calculates the appropriate frame MTU based
        on the capabilities of the parent interface.  Some other interfaces not
        listed above may handle long frames, but they do not advertise this abil-
        ity of theirs.  The MTU setting on vlan can be corrected manually if used
        in conjunction with such a parent interface.

    My own X700 box has died completely so I can't check.  :(
    What MTU size has the VLAN driver determined is correct on your box? What flags are reported by ifconfig on the re interfaces?

    Steve


  • Netgate Administrator

    Here's a quote from YongHyeon PYUN, the re(4) maintainer/author:
    @http://freebsd.1045724.n5.nabble.com/Abysmal-re-4-performance-under-8-1-STABLE-mid-August-td3946608.html:

    I'm sure this has nothing to do that this issue.
    If you want to disable checksum offloading of VLAN
    interface, use vlan interface instead of parent interface
    of the VLAN interface(i.e. ifconfig vlan0 -txcsum -rxcsum).
    And you can't disable VLAN_MTU on re(4). There is no
    reason to disable supporting VLAN oversized frames.

    So perhaps a manual MTU reduction is necessary.

    Steve



  • Thanks for all the insight!

    So far I tried disabling all the hardware features on the LAGG interfaces and reducing the MTU. I'll try doing one thing at a time this weekend, just wanted to see the result of the extreme set of changes. I had to delete the LAGG and all the VLANs to be able to change the underlying interfaces.

    There are two differences I noticed so far:
    1 - now I can ping VLAN PCs with all the packet sizes with no packet loss. Ie. doing ping -v -c 1 -g 1470 -G 1492 -S 192.168.10.1 192.168.10.64 on pfSense box doesn't exibit packet loss anymore.
    2 - while pinging the pfSense box via the VLAN interfaces (with hw. features disabled), when using packet sizes that didn't work before (~1474, basically MTU - 28), there are no echo replies detected in tcpdump. Previously, echo replies were logged but nothing got out

    Also, no watchdog timeouts logged yet, but I've done no stress testing yet either.

    Certain web sites are still inaccessible when connecting via VLAN interfaces. Actually, I think it even got worse since I can't even load imgur now, while previously it was just a matter of refreshing the page until the main pic loaded.

    Next up I'll try disabling the LAGG. Even though I don't suspect it of inducing errors, it doesn't make changing interface settings any easier. It did inherit all the relevant changes I made to the interfaces, like MTU and hw. features.


  • Netgate Administrator

    @superbob:

    2 - while pinging the pfSense box via the VLAN interfaces (with hw. features disabled), when using packet sizes that didn't work before (~1474, basically MTU - 28), there are no echo replies detected in tcpdump. Previously, echo replies were logged but nothing got out.

    Hmm, anything in the firewall log? Did you reinstate the firewall rules? Easily overlooked.  ;)
    Looks like you're making some progress.

    Steve



  • What I meant in point 2:

    PE1900, pinging normal interface with HW acceleration enabled:

    
    root@bobeus:~# ping -c3 -s 1474 -I 192.168.2.16 192.168.2.1
    PING 192.168.2.1 (192.168.2.1) from 192.168.2.16 : 1474(1502) bytes of data.
    ^C
    --- 192.168.2.1 ping statistics ---
    3 packets transmitted, 0 received, 100% packet loss, time 2015ms
    
    

    tcpdump on pfSense box:

    
    23:58:29.303739 IP 192.168.2.16 > 192.168.2.1: ICMP echo request, id 22611, seq 1, length 1480
    23:58:29.303762 IP 192.168.2.16 > 192.168.2.1: icmp
    23:58:29.303871 IP 192.168.2.1 > 192.168.2.16: ICMP echo reply, id 22611, seq 1, length 1480
    23:58:29.303875 IP 192.168.2.1 > 192.168.2.16: icmp
    23:58:30.316443 IP 192.168.2.16 > 192.168.2.1: ICMP echo request, id 22611, seq 2, length 1480
    23:58:30.316464 IP 192.168.2.16 > 192.168.2.1: icmp
    23:58:30.316517 IP 192.168.2.1 > 192.168.2.16: ICMP echo reply, id 22611, seq 2, length 1480
    23:58:30.316521 IP 192.168.2.1 > 192.168.2.16: icmp
    23:58:31.329564 IP 192.168.2.16 > 192.168.2.1: ICMP echo request, id 22611, seq 3, length 1480
    23:58:31.329586 IP 192.168.2.16 > 192.168.2.1: icmp
    23:58:31.329646 IP 192.168.2.1 > 192.168.2.16: ICMP echo reply, id 22611, seq 3, length 1480
    23:58:31.329650 IP 192.168.2.1 > 192.168.2.16: icmp
    
    

    PE1900, pinging the a VLAN interface with hw acceleration disabled.

    
    root@bobeus:~# ping -c3 -s 1468 -I 192.168.10.64 192.168.10.1
    PING 192.168.10.1 (192.168.10.1) from 192.168.10.64 : 1468(1496) bytes of data.
    ^C
    --- 192.168.10.1 ping statistics ---
    3 packets transmitted, 0 received, 100% packet loss, time 2015ms
    
    

    tcpdump on pfSense box:

    
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(19): tcpdump -i re3_vlan128 host 192.168.10.64
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on re3_vlan128, link-type EN10MB (Ethernet), capture size 96 bytes
    00:03:14.496355 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22616, seq 1, length 1476
    00:03:15.500857 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22616, seq 2, length 1476
    00:03:16.505921 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22616, seq 3, length 1476
    00:03:19.529010 ARP, Request who-has 192.168.10.1 tell 192.168.10.64, length 42
    00:03:19.529036 ARP, Reply 192.168.10.1 is-at 00:90:7f:2e:84:db (oui Unknown), length 28
    ^C
    5 packets captured
    5 packets received by filter
    0 packets dropped by kernel
    
    

    PE1900, pinging the a VLAN interface with a smaller payload.

    
    root@bobeus:~# ping -c3 -s 1452 -I 192.168.10.64 192.168.10.1
    PING 192.168.10.1 (192.168.10.1) from 192.168.10.64 : 1452(1480) bytes of data.
    1460 bytes from 192.168.10.1: icmp_req=1 ttl=64 time=0.503 ms
    1460 bytes from 192.168.10.1: icmp_req=2 ttl=64 time=0.443 ms
    1460 bytes from 192.168.10.1: icmp_req=3 ttl=64 time=0.434 ms
    
    --- 192.168.10.1 ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 1998ms
    rtt min/avg/max/mdev = 0.434/0.460/0.503/0.030 ms
    
    

    tcpdump on pfSense box:

    
    [2.0.1-RELEASE][root@pfsense.bobnet]/root(25): tcpdump -i re3_vlan128 host 192.168.10.64
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on re3_vlan128, link-type EN10MB (Ethernet), capture size 96 bytes
    00:08:47.328097 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22659, seq 1, length 1460
    00:08:47.328201 IP 192.168.10.1 > 192.168.10.64: ICMP echo reply, id 22659, seq 1, length 1460
    00:08:48.332135 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22659, seq 2, length 1460
    00:08:48.332183 IP 192.168.10.1 > 192.168.10.64: ICMP echo reply, id 22659, seq 2, length 1460
    00:08:49.336806 IP 192.168.10.64 > 192.168.10.1: ICMP echo request, id 22659, seq 3, length 1460
    00:08:49.336850 IP 192.168.10.1 > 192.168.10.64: ICMP echo reply, id 22659, seq 3, length 1460
    ^C
    6 packets captured
    6 packets received by filter
    0 packets dropped by kernel
    
    

    I've also looked at the interface itself, no replies visible either.

    Pinging with large payloads (2000+) works well, whenever the packets are fragmented.

    Question - if I set the interface/VLAN interface MTU very low, say 300, should a ping of 1200 bytes directed to it be automatically split up into smaller chunks? Or should it always fail (like it does here)? I think I need to read up on the basics…


  • Netgate Administrator

    If you don't have 'do not fragment' set then it should simply fragment the packets into suitably sized frames. The problem is how it decides whether it needs to do that and how it decides what a suitable size is.
    To be honest this is now well outside my own experience!  ;)

    Steve



  • @superbob:

    Question - if I set the interface/VLAN interface MTU very low, say 300, should a ping of 1200 bytes directed to it be automatically split up into smaller chunks? Or should it always fail (like it does here)? I think I need to read up on the basics…

    It'll get dropped, can't accept frames larger than your MTU. Nothing in the path to fragment it.



  • I tried, tested, tuned and couldn't get it to work in a reasonable amount of time. So I dropped the idea of using VLANs on the pfSense/firebox combo.

    Instead of aggregating four ports into a LAGG and passing VLANs through that, I just mapped four ports on the switch to different VLANs and setup interfaces normally. It works fine this way, but I really liked the idea of having a theoretical throughput of 400Mb/s to play with, along with a flexible amount of VLAN interfaces to control.


  • Netgate Administrator

    A dissapointing result but hopefully save someone else some time.  ::)
    I'm sure it could be made to work but whether it would be worth the effort or not is debatable. It would probably be easier to just put an Intel gigabit card in the PCI slot with the case mods that requires.

    Steve


Log in to reply