Network issue with very small frames (tcp, padded)
we are using pfsense stable 2.6.0 which builds on freebsd 12.3 Stable.
Recently, we activated a bunch of servers in a new very fast (virtualized) environment, which is connected to the internal network with 10GBit. The pfsense firewall has 1GBit NICs on LAN and WAN side and is a physical machine.
We now encountered an issue which is probably resulting from an overload situation or MTU issue (while all MTU values are 1500 everywhere):
When transmitting a medium large file (1.3mb) over http (apache http server running on Ubuntu on the server side, in a vm), the traffic sometimes, not always contains a very small frame (4-5 bytes payload, total length 64 bytes) which is corrupting the output.
We can see in the packet capture taken on the LAN side of the firewall that normally all data frames are 1514 bytes, but all of a sudden, this 60 byte frame (marked as PSH,ACK) arrives which contains a very small tcp segment data of 4 bytes only. This frame is padded with one or two 0 bytes at the end AFTER the payload.
Extract from wireshark:
Transmission Control Protocol, Src Port: 80, Dst Port: 37710, Seq: 553466, Ack: 188, Len: 4Frame 136159: 60 bytes on wire (480 bits), 60 bytes captured (480 bits)
Ethernet II, Src: e2:84:72:d3:14:5c (e2:84:72:d3:14:5c), Dst: IETF-VRRP-VRID_04 (00:00:5e:00:01:04)
Destination: IETF-VRRP-VRID_04 (00:00:5e:00:01:04)
Source: e2:84:72:d3:14:5c (e2:84:72:d3:14:5c)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 192.168.25.41, Dst: 80.xxx.xxx.xxx
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
Total Length: 44
Identification: 0x594c (22860)
Flags: 0x40, Don't fragment
...0 0000 0000 0000 = Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x37c2 [validation disabled]
[Header checksum status: Unverified]
Source Address: 192.168.25.41
Destination Address: 80.xx.xx.xx
Transmission Control Protocol, Src Port: 80, Dst Port: 37710, Seq: 553466, Ack: 188, Len: 4
Source Port: 80
Destination Port: 37710
[Stream index: 77]
[Conversation completeness: Complete, WITH_DATA (47)]
[TCP Segment Len: 4]
Sequence Number: 553466 (relative sequence number)
Sequence Number (raw): 2934325866
[Next Sequence Number: 553470 (relative sequence number)]
Acknowledgment Number: 188 (relative ack number)
Acknowledgment number (raw): 2019009064
0101 .... = Header Length: 20 bytes (5)
Flags: 0x018 (PSH, ACK)
[Calculated window size: 42153]
[Window size scaling factor: -2 (no window scaling used)]
Checksum: 0xc1c8 [unverified]
[Checksum Status: Unverified]
Urgent Pointer: 0
TCP payload (4 bytes)
[Reassembled PDU in frame: 137331]
TCP segment data (4 bytes)
0000 00 00 5e 00 01 04 e2 84 72 d3 14 5c 08 00 45 00
0010 00 2c 59 4c 40 00 40 06 37 c2 c0 a8 19 29 50 XX
0020 XX XX 00 50 93 4e ae e6 42 6a 78 57 a2 28 50 18
0030 a4 a9 c1 c8 00 00 ee 86 11 a2 00 00
The real payload is only 4 bytes, (0xee, 0x86, 0x11, 0xa2), After that, two bytes are appended (sort of padding, resulting in a total of 60 bytes).
We do not know why this padding occurs, but it is copied into the forwarded frame, leading to a currupted output.
Any idea is welcome how to solve this. The padding seems to be the underlying reason of the data congestion, but we do not know if there is a way to prevent the padding on the sending side or if simply there is a bug in the tcp code of Freebsd kernel (which is probably modified in pfsense) which forwards this padded bytes as content.
Traffic between the internal machines is not affected for whatever reason (mostly Linux).
The congestion issues do not occur on each transfer, but only sometimes.
Most of the time, those small frames are not sent for whatever reason.
I found in the cap file that there have been TCP window full / TCP window update messages, but these are rare and regular.
We have tested always with the same file, but the frame sizes vary on each transfer.
Padding seems to be a well-known process in order to make small packets have the minimum length of 64 bytes, as I have read in some articles.
We tried all kinds of switches in order to reduce the side influences. The error occurs less often if the recv buffer is increased, but is not eliminated. From my point of view, the bigger receive buffer leads to less window size events, which means that the dataflow is more stable with standard frames (1514 bytes). We found that relatively nearby when the small frame is transmitted, a TCP Window Full message is found in the capture. This might indicate to the server that it is sending too fast and result in a flush (the last 4-5 bytes) and wait for a moment before sending the next full-size packet (my amateur understanding of Window sizes ...).
The tcp code should be able to handle this from my point of view.
I can of course provide the .cap file with the defect flow if that makes sense.
But my C knowledge is very limited, so I am not able to debug the kernel (and as always, this happens in production, is not reproducable in testing, and we do not have the identical network setup for test).
We have eleminated mostly all factors which are in the way and were able to reproduce
the behaviour with a single "fetch http:// ...." command (when called very often in a loop). The error sometimes occurs only after > 500 calls, sometimes earlier.
The original setup includes a haproxy but in our final tests we were able to reproduce it
with only the LAN interface and the test script.
In the damaged file in the "fetch"-output we always see the 00 bytes which have been appended in the padded ethernet frame. So some internal code probably does not respect the current length of payload but copies the whole buffer.
The following frame was received:
0000 a0 36 9f 5f 90 42 e2 84 72 d3 14 5c 08 00 45 00
0010 00 2d cb 45 40 00 40 06 bc 08 c0 a8 19 29 c0 a8
0020 19 03 00 50 2b d2 a6 44 37 b5 d6 94 85 ed 50 18
0030 a4 cf d8 4e 00 00 b0 b9 89 d3 de 00
(padded with a single 00 byte), content was 5 bytes (b0 b9 89 d3 de).
Next packet received:
0000 a0 36 9f 5f 90 42 e2 84 72 d3 14 5c 08 00 45 00 .6._.B..r....E.
0010 05 dc cb 46 40 00 40 06 b6 58 c0 a8 19 29 c0 a8 ...F@.@..X...)..
0020 19 03 00 50 2b d2 a6 44 37 ba d6 94 85 ed 50 10 ...P+..D7.....P.
0030 a4 cf 22 8b 00 00 5e 2c 5b ad de 09 e6 d0 27 59 .."...^,[.....'Y
(data starts with 0x5e 0x2c ....)
Resulting defect (hex dump of defect file vs. correct file):
000eeac0 bd ec e8 40 92 5f 88 ef ed dd 10 7c 3e 88 a3 23 |œìè@._.ïíÝ.|>.£#|
000eead0 e8 6c 67 b0 b9 89 d3 de 00 5e 2c 5b ad de 09 e6 |èlg°¹.ÓÞ.^,[Þ.æ|
000eeae0 d0 27 59 1e f7 57 56 42 b3 db 91 18 1b 43 d2 eb |Ð'Y.÷WVB³Û...CÒë|
000eeac0 bd ec e8 40 92 5f 88 ef ed dd 10 7c 3e 88 a3 23 |œìè@._.ïíÝ.|>.£#|
000eead0 e8 6c 67 b0 b9 89 d3 de 5e 2c 5b ad de 09 e6 d0 |èlg°¹.ÓÞ^,[Þ.æÐ|
000eeae0 27 59 1e f7 57 56 42 b3 db 91 18 1b 43 d2 eb 85 |'Y.÷WVB³Û...CÒë.|
The wrong byte was inserted between 0xde and 0x5e.
So to be clear in a pcap on the WAN side you do not see the rogue 00s appended to the packet?
Have you tried disabling pf-scrub?
It obviously shouldn't do that but it's an easy test.
The 0 bytes are in the middle of a frame when looking at the WAN side.
When doing the test locally, there of course is no wan trace.
Just the file which is defect (we ran it in a script calling fetch in a loop and then checking for the correct checksum).
And yes, we switched off the scrub for test, the hardware options, anything we could think of.
It does not happen that often if we increase the read buffers, but this is potentially because the small packet arrives because of a tcp window full event or something similar.
The point here is not the small frame, but that the code seems to forward the ethernet padding to the application layer. It just happens more often if the read buffers are smaller.
We got another hint that the hardware could be a reason, since the machine is somewhat older, but anyway the padding should not get forwarded.
Another thing: the final file has the original size ! The bytes added in between some frames, are missing at the end. So the total len of tcp stream is miscalculated. This is weird.
Hmm, maybe I've misunderstood how you're testing here then.
The file is being served from a server in the new VM environment which is behind pfSense?
And the client fetching that file and getting the wrong result is on a different internal interface?
We initially ran into the issue when some clients reported wrong files.
We then tested if we could reproduce the wrong download from our office network.
This was possible. From that point of view, the wrong bytes were in the middle of a random frame for unknown reason.
We then checked with different internal servers (Linux mostly) with wget if the error could be reproduced (which was not the case).
Then we did package traces on the pfsense while performing the download via WAN and found the small frames < 60 bytes, padded, which occurred on the LAN side from time to time, leading to the problematic result.
This setup is Client -> Internet -> WAN -> haproxy -> LAN -> virtual env/Server (Apache)
At the end, we tried to reproduce this on the firewall itself (in a shell) in order to eleminate
factors like haproxy and reduce the number of components in line.
So the latest test was just pfsense local shell on pfsense-> LAN -> virtualized server.
Nothing else in between but switches (huge MTU) and the hosting environment.
The situation is occurring less often in a local environment, but was occurring irregularly (sometimes after 50 tries, sometimes after 400 tries).
The reason why the error did not happen to other machines, I can only guess that either these do not have an issue with the small padded frames, or never get them, because of better capabilities to handle fast responses in buffer (so no TCP window alerts, which seem to play a role here).
We have other (older) servers in place which should be converted into the virtual environment, but these have 1GBit NICs, which does not seem to trigger the issue. (this is an old setup running smoothly for a number of years). We have stopped the conversion for the moment until
this unexpected issue is solved.
On the other machines in the lan segment, we only have tcpdump to trace the traffic and that one is not really comparable, contains very big frames (8k to 16k) because the place of tracing is different from a pfsense packet trace (as being said, MTU is 1500 everywhere, so the output is taken "after" the traversal of the segmentation possibly).
It could be that an old NIC (Intel I350 copper is detected, igb driver, but that can be a lot of types, not an uncommon card type) is a reason, I now ordered new hardware, but it is not logical to me that hardware should be the reason for such an issue.
We have a failover machine with identical hardware and were able to reproduce the issue there. So it is not a defect of the NIC.
We also have an "internal" firewall (pfsense, same version) present, but that one is running with a 10GBit NIC and we were not able to reproduce the situation here. Maybe because it never gets such small frames or because the driver correctly handles the padding (ix Driver here).
Hmm, so where is the MTU change, packet re-assembly happening? If it is?
Does the other pfSense with the 10G NIC have larger frames enabled?
Are you able to test a different pfSense version a client?
Did you test a connection through pfSense but without HAProxy?