pfsense 2.4.4 Rel.2 checksum error / after reboot fine for 20 sec



  • Hello,

    i am facing the problem, that my pfsense throws check sum errors even tpc offloading is disabled. while i reboot the box i can reach the services behind the pfsense (INTERNAL => DMZ) without any problems. But after 20 seconds it got stucked again. All Services from outside the LAN can be reached without any issues at all 😉

    The problem is, that not all of the services gaining errors. Services like ssh or ping doing fine

    So i am asking here if someone also noticed this beheavior?

    thanks


  • Netgate Administrator

    Works but then fails after 20s sounds like asymmetric routing:
    https://docs.netgate.com/pfsense/en/latest/firewall/troubleshooting-blocked-log-entries-due-to-asymmetric-routing.html

    Do you see blocked traffic in the firewall log when it fails?

    Steve



  • Gateway set when it should not be set

    If a gateway is set on an internal interface, such as LAN, it can cause problematic behavior. Setting a gateway on an internal interface will tag that interface’s outbound rules with route-to, and inbound rules with reply-to which will cause packets to be forwarded to the defined gateway rather than following their natural path. For WANs this is typically a good thing! For LANs it is not. Among other ill effects, it can lead to a loop of sorts where packets bounce between the firewall and the defined gateway, eventually being blocked or dropped when their TTL expires.

    I don't think this is the case, since protocols TCP on Port 22 passes to the host without issues. Also https on port 445 (non nginx - Apache Service) passes the firewall to the dedicated host as well. Only https gets lost to the nginx server.

    so when there would be a "sloppy state" why are not other packets / protocols / ports affected?

    Addition: the both interfaces don't have gateways installed, since they are real subnet interfaces

    VLAN_DMZ => 192.168.13.2
    INTERNAL => 192.168.11.2
    

    So in general the pfsense box takes over to route into the specified LANS and the firewall rules are in place to allow communication between both networks.


  • Netgate Administrator

    I would expect all TCP traffic to be affected. There would need to be some other route involved too so some thing else routing between Internal and DMZ.

    Do you see blocked traffic in the firewall log though?

    Steve



  • no I don't see any blocked traffic, if this would be the case I would open the port. So the firewall rule logs are empty. I made a pcap to the specified service from the VLAN_DMZ:

    21:36:00.307570 0c:c4:7a:ad:28:85 > 00:0c:29:dd:25:02, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        192.168.13.2.21281 > 192.168.13.17.443: Flags [S], cksum 0x769e (correct), seq 3168589996, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 871614 ecr 0], length 0
    21:36:00.307757 00:0c:29:dd:25:02 > 0c:c4:7a:ad:28:85, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        192.168.13.17.443 > 192.168.13.2.21281: Flags [S.], cksum 0xcb12 (correct), seq 4180594802, ack 3168589997, win 28960, options [mss 1460,sackOK,TS val 5861164 ecr 871614,nop,wscale 7], length 0
    21:36:00.307797 0c:c4:7a:ad:28:85 > 00:0c:29:dd:25:02, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
        192.168.13.2.21281 > 192.168.13.17.443: Flags [.], cksum 0x68fe (correct), seq 1, ack 1, win 513, options [nop,nop,TS val 871614 ecr 5861164], length 0
    21:36:00.308000 0c:c4:7a:ad:28:85 > 00:0c:29:dd:25:02, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        192.168.13.2.21285 > 192.168.13.17.443: Flags [S], cksum 0x30a5 (correct), seq 3470265510, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 871614 ecr 0], length 0
    21:36:00.308105 00:0c:29:dd:25:02 > 0c:c4:7a:ad:28:85, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
        192.168.13.17.443 > 192.168.13.2.21285: Flags [S.], cksum 0x9d81 (correct), seq 782795409, ack 3470265511, win 28960, options [mss 1460,sackOK,TS val 5861164 ecr 871614,nop,wscale 7], length 0
    

    This is from the Client:

    21:38:49.994816 60:f8:1d:cb:cb:9c > 0c:c4:7a:ad:28:86, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 64)
        192.168.11.81.63884 > 192.168.13.17.443: Flags [S], cksum 0xbf2a (correct), seq 1173459999, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 119996831 ecr 0,sackOK,eol], length 0
    

    wireshark

    routing table

    here is an pcap from ssh port to the IP:

    22:15:36.684419 0c:c4:7a:ad:28:85 > 00:0c:29:dd:25:02, ethertype IPv4 (0x0800), length 102: (tos 0x48, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 88)
        192.168.11.81.65076 > 192.168.13.17.22: Flags [P.], cksum 0x68ed (correct), seq 920942514:920942550, ack 1142215424, win 2048, options [nop,nop,TS val 122192186 ecr 6446297], length 36
    22:15:36.684770 00:0c:29:dd:25:02 > 0c:c4:7a:ad:28:85, ethertype IPv4 (0x0800), length 102: (tos 0x10, ttl 64, id 36928, offset 0, flags [DF], proto TCP (6), length 88)
    

  • Netgate Administrator

    The only thing that looks like an issue there is that one TCP retransmission packet. The checksums all look good there too.

    Need to see more of the failing traffic. Or a pcap covering the traffic at the point where it fails.

    Steve



  • revert - still no sucess. I reinstalled the whole pfsense and still cant access the PORT 443 over the pfsense. This is still Port 443 related. 🐛

    i have uploaded the pcap here, if you like to view the scenario pcap

    Ther filter needs to be set as: ip.addr eq 192.168.13.17 and tcp.port==443 to view the problem of the source host.

    Still in debugging this error, and yes it works while the pfsense reboots. so i dont know what the pfsense does in the first 20 seconds.

    Routeing Table looks like:

    Destination        Gateway            Flags     Netif Expire
    default            4x.237.xx.1       UGS        igb0
    1.1.1.1            4x.237.xx.1       UGHS       igb0
    4.2.2.2            4x.237.xx.1       UGHS       igb0
    8.8.4.4            4x.237.xx.1       UGHS       igb0
    4x.237.xx.0/26    link#1             U          igb0
    4x.237.xx.8       link#1             UHS         lo0
    4x.237.xx.10      link#4             UHS         lo0
    4x.237.xx.11      link#2             UHS         lo0
    4x.237.xx.12      link#3             UHS         lo0
    82.212.30.1        4x.237.xx.1       UGHS       igb0
    95.xx.xx.178       4x.237.xx.1       UGHS       igb0
    127.0.0.1          link#10            UH          lo0
    192.168.10.0/24    link#8             U          igb7
    192.168.10.2       link#8             UHS         lo0
    192.168.11.0/24    link#7             U          igb6
    192.168.11.2       link#7             UHS         lo0
    192.168.13.0/24    link#6             U          igb5
    192.168.13.2       link#6             UHS         lo0
    192.168.177.0/24   link#5             U          igb4
    192.168.177.10     link#5             UHS         lo0
    

    all of the sudden now ssh gets also lost. So there might be a problem between pfsense and the network.

    How does my setup look like: I have for each of the subnet a single interface. DMZ uses its own and INTERNAL as well. There are no routing table entries available since i thought pfsense will take care of it.

    • DMZ (igb5) 192.168.13.2 (pfsense)
    • INTERNAL (igb6 192.168.11.2 (pfsense)


  • ok - needs some more help, i believe to figured the problem. i have multiple WAN Interfaces. To get the right gateway i add different rules in. might be the problem resides there


  • Netgate Administrator

    Just based on the time it takes it's almost certainly some state timeout.

    If you have policy based rules on in the source interface for WAN failover you would need have a rule above that to reach the other interface(s) without being policy routed.

    It's possible that rule did not apply before the WANs come up at boot so some traffic was initially allowed.

    Steve



  • ok, so lets get this streight: the docs mentioned that:

    On Firewall > Rules, visit the tab for the internal interface to be used with the gateway group, either edit the existing pass rules and add the gateway setting, choosing the desired gateway, or add a new rule to match only certain traffic to direct into the gateway group. Remember that rules are processed from the top down, and once a rule is matched, processing stops.
    

    So in order from the DMZ i would add a rule which states

    SOURCE: VLAN_DMZ => DEST: !RFC_NETWORK => GW: GW1
    

    ath the INTERNAL NET I would do a different rule, since i want to make sure INTERNAL Outbound traffic goes to GW3

    SOURCE: INTERNAL => DEST: !RFC_NETWORK => GW: GW3
    

    all of them i would add as first rule? since yet, i had them as last rule.

    But still i get this state:

    LAN 	tcp 	192.168.11.62:60664 -> 192.168.13.17:443 	CLOSED:SYN_SENT 	2 / 0 	104 B / 0 B 	
    VLAN_DMZ 	tcp 	192.168.11.62:60664 -> 192.168.13.17:443 	SYN_SENT:CLOSED 	2 / 0 	104 B / 0 B
    

    and no connection :/


  • Netgate Administrator

    Yes, those rules should not catch that traffic as long as the RFC_NETWORK alias has been set correctly.

    That just looks like 192.168.13.17 is not responding. 2 packets to it going over both intefaces, 0 packets back.

    Steve



  • so i have set up this properly but why is this not working for the port 443? I adedd haproxy tcp mode to the wan interface.

    WAN => 443 => HAPROXY (tcp/ssl mode) => tcp rule to matching ssl name to the specified server
    

    since i addes this i am not able to communicate back to the desired server over the port 443. If i add SSH Port to the tcp mode function to the haproxy i believe i will not get the SSH back to live on a internal net.

    This drive me crazy and no solution nor failures to find.


  • Netgate Administrator

    Wait you have HAProxy in there running on those ports?

    If you disable HAProxy does it all work as expected?

    Steve



  • Nope, HAProx runs on External IF (igb0/igb3) and catches from there at 443 to internal server. igb5/6 are standalone.


  • Netgate Administrator

    Right that's what it's configured to do but you just said:

    since i addes this i am not able to communicate back to the desired server over the port 443.

    So before you added HAProxy it was working?

    Steve



  • yes thats what i am saying, but i dont know if this belongs together. Before i addedd the haproxy into play (fresh install) i was able to browse internally to the dedicated https port. So i would assume, something is broken within haproxy while its redericting ....



  • @fgro79
    Sorry for late reply, but anyhow.

    I suspect if you disable the haproxy transparent-client-ip feature on the backend your troubles of reaching the webserver directly are gone. That feature comes with a warning message for a reason..



  • i noticed, sorry i can not update my configs yet since the i am facing the issue described in here

    So i need to wait until i can modify or downgrade the system to safely remove the transparent-client-ip feature. I was going to use this feature for internal smtp server to forward the original IP.


Log in to reply