[Solved] Postfix timeout caused by lost packets

stenio

Hi,

I just discovered a lot of postfix timeouts caused apparently from some weird network error.
The firewall has two wan interfaces, WAN1 and WAN2, and the mail server is on a DMZ interface.
WAN1 is used by the mail server and for DNS queries while WAN2 (which is connected to the default gateway) for all the rest.
There is rule that forces all outgoing traffic from the DMZ to the gateway of WAN1.

Here is an excerpt from Whireshark of what seems to create the network problem:

No. Time Source Destination Protocol Length Info
168 1.340890 213.205.33.215 192.168.1.2 TLSv1.2 1376 [TCP Previous segment not captured] Ignored Unknown Record

Frame 168: 1376 bytes on wire (11008 bits), 1376 bytes captured (11008 bits)
Ethernet II, Src: CiscoInc_a4:fa:fc (00:13:80:a4:fa:fc), Dst: Fabiatec_07:94:78 (00:04:a7:07:94:78)
Internet Protocol Version 4, Src: 213.205.33.215, Dst: 192.168.1.2
Transmission Control Protocol, Src Port: 38411 (38411), Dst Port: 25 (25), Seq: 137865, Ack: 2149, Len: 1322
Secure Sockets Layer

No. Time Source Destination Protocol Length Info
169 1.341198 192.168.1.2 213.205.33.215 TCP 60 [TCP Dup ACK 167#1] 25 → 38411 [ACK] Seq=2149 Ack=122001 Win=63456 Len=0

Frame 169: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: Fabiatec_07:94:78 (00:04:a7:07:94:78), Dst: CiscoInc_a4:fa:fc (00:13:80:a4:fa:fc)
Internet Protocol Version 4, Src: 192.168.1.2, Dst: 213.205.33.215
Transmission Control Protocol, Src Port: 25 (25), Dst Port: 38411 (38411), Seq: 2149, Ack: 122001, Len: 0

It seems that some fragments were lost and that the peers were not able to recover.
I googled a lot and found that the problem could be related to MTU discovery. I already tried to lower the MTU and to permit ICMP traffic, but it hasn't worked.

I've attached the full decoded tcpdump.

What can it be?

Thanks,
Stenio

Edit:

It seems that the problem was the provider's router. After a reboot no more packets were lost.

capture.txt

AEITS_Inc

Not sure if it's related, but if you are using DKIM to sign your postfix email Cisco has a habit of corrupting the packets, and then dropping them as malformed.

http://www.arschkrebs.de/postfix/postfix_cisco_pix_bugs.shtml

Steve

stenio

Hi Steve,

No, I'm not using DKIM.
The problem seems to be related to TLS and to the length of the email message: the bigger the email and more probable the network problem and hence the timeout.
Also the "distance" between the servers seems to have an influence, probably because more hops imply more time and more chance to lose fragments.

A lot of messages come from google's servers (209.85.128.0/17, 74.125.0.0/16).
I tried to decrease the MTU of the server's interface from 1500 to 1362 and this had a positive effect. I'll try to lower it more.

Thanks,
Stenio