[Solved] Postfix timeout caused by lost packets


  • Hi,

    I just discovered a lot of postfix timeouts caused apparently from some weird network error.
    The firewall has two wan interfaces, WAN1 and WAN2,  and the mail server is on a DMZ interface.
    WAN1 is used by the mail server and for DNS queries while WAN2 (which is connected to the default gateway) for all the rest.
    There is rule that forces all outgoing traffic from the DMZ to the gateway of WAN1.

    Here is an excerpt from Whireshark of what seems to create the network problem:

    No.    Time          Source                Destination          Protocol Length Info
        168 1.340890      213.205.33.215        192.168.1.2          TLSv1.2  1376  [TCP Previous segment not captured] Ignored Unknown Record

    Frame 168: 1376 bytes on wire (11008 bits), 1376 bytes captured (11008 bits)
    Ethernet II, Src: CiscoInc_a4:fa:fc (00:13:80:a4:fa:fc), Dst: Fabiatec_07:94:78 (00:04:a7:07:94:78)
    Internet Protocol Version 4, Src: 213.205.33.215, Dst: 192.168.1.2
    Transmission Control Protocol, Src Port: 38411 (38411), Dst Port: 25 (25), Seq: 137865, Ack: 2149, Len: 1322
    Secure Sockets Layer

    No.    Time          Source                Destination          Protocol Length Info
        169 1.341198      192.168.1.2          213.205.33.215        TCP      60    [TCP Dup ACK 167#1] 25 → 38411 [ACK] Seq=2149 Ack=122001 Win=63456 Len=0

    Frame 169: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
    Ethernet II, Src: Fabiatec_07:94:78 (00:04:a7:07:94:78), Dst: CiscoInc_a4:fa:fc (00:13:80:a4:fa:fc)
    Internet Protocol Version 4, Src: 192.168.1.2, Dst: 213.205.33.215
    Transmission Control Protocol, Src Port: 25 (25), Dst Port: 38411 (38411), Seq: 2149, Ack: 122001, Len: 0

    It seems that some fragments were lost and that the peers were not able to recover.
    I googled a lot and found that the problem could be related to MTU discovery. I already tried to lower the MTU and to permit ICMP traffic, but it hasn't worked.

    I've attached the full decoded tcpdump.

    What can it be?

    Thanks,
    Stenio

    Edit:

    It seems that the problem was the provider's router. After a reboot no more packets were lost.

    capture.txt


  • Not sure if it's related, but if you are using DKIM to sign your postfix email Cisco has a habit of corrupting the packets, and then dropping them as malformed.

    http://www.arschkrebs.de/postfix/postfix_cisco_pix_bugs.shtml

    Steve


  • Hi Steve,

    No, I'm not using DKIM.
    The problem seems to be related to TLS and to the length of the email message: the bigger the email and more probable the network problem and hence the timeout.
    Also the "distance" between the servers seems to have an influence, probably because more hops imply more time and more chance to lose fragments.

    A lot of messages come from google's servers (209.85.128.0/17, 74.125.0.0/16).
    I tried to decrease the MTU of the server's interface from 1500 to 1362 and this had a positive effect. I'll try to lower it more.

    Thanks,
    Stenio