No NAT processing for certain packets

turrican64

Can we reopen the bug ticket?

Thank you!
Best regards

stephenw10

Done. Though I still think it must be something causing a state conflict somehow. It's going to be very difficult to reproduce.

I assume you cannot produce this on demand? You just have to wait for it to happen?

turrican64

Hi @stephenw10

Thank you for reopening the ticket.

@stephenw10 said in No NAT processing for certain packets:

I still think it must be something causing a state conflict somehow

If you think on something specific I am happy to share it.

@stephenw10 said in No NAT processing for certain packets:

I assume you cannot produce this on demand? You just have to wait for it to happen?

Correct, unfortunately I don't have a method to reproduce it, I have to wait for it to happen. Last time it took 4 weeks.

stephenw10

Nothing easy! What I expect is happening is that when the state tries to open a conflicting state exists. But to actually see that would require dumping the state table at that point. So we would need to script something to do it.

turrican64

Hi @stephenw10

How would it be possible to identify that moment and trigger the script?

I’d like to ask some questions in regards the bug ticket updates:

Like I mentioned a couple comments up, the way that happens is when something tries and fails to make a NAT state. Usually static port is the easiest way it happens

May I ask what static port means?

Another way is if they have so many connections to the same remote ip:port that they exhausted their pool of unique external source ports.

There’s only one computer talking to this remote IP:Port. But regardless I can’t imagine a scenario creating that many connections which could exhaust the pool of 65536 source ports. Is this a likely scenario?

Need to see things like their entire ruleset

How can I send it to the developers without exposing my ruleset publicly?

and state table at the exact moment a packet failed to get NAT applied.

I kept pfSense in this state for 3 days which means NAT was not applied to these packets for this much time and I posted the state table entry for this remote IP:Port. There was only one entry which means no conflict.

Is this not alone a proof of a bug?

turrican64

Additionally I can summarise my assumptions in this way:

NAT state was successfully created and was working well for 4 weeks
NAT state disappeared for some reason and it’s recreation was not successful so a faulty state was created
outgoing packets from the host arriving at 1 pps rate which kept alive this faulty state in the state table
once packets stopped arriving for 50 seconds (because I shut down the computer’s switch port) the faulty state from the state table disappeared
once packets started arriving from the computer again after 50s NAT state creation was successful and it has been working since

My expectation would be that even if for some reason (eg conflict) a NAT state is not possible to be created at a certain point of time a faulty state should not remain in the state table once the conflict is not existing anymore.

A log level which could print a message with the reason why NAT was failing would be helpful. Do you know if this is possible to set up such logging?

stephenw10

Yeah, good question. I imagine a floating outbound rule with logging enabled that passes traffic with a source of the internal subnet. That should not happen because the NAT rules should change the source before it reaches that normally. However that's exactly what's happening here. So that would at least log it to give an idea of when it happens and the frequency. But then we would want a script that dumps the state table when that log entry is created. I'm thinking about that.

'Static port' there means a rule where source port randomisation is disabled. pfSense adds a rule for IPSec on port 500 for that since a lot of IPSec servers will not accept connections from other source ports. Commonly some VoIP devices are also know to break with source port changes. So the most likely scenario we see this in is two VoIP phones trying to connect to the same remote PBX where they use the same OBN rule. The state already exists with SIP as source and destinations ports resulting in a conflict and the second device not being NAT'd.
But since you only have one device and no static source port rules that cannot be happening here.

The time point that matters is when the state was created. A conflicting state might expire seconds later and the bad state will remain.

turrican64

Hi @stephenw10

Thank you for the detailed explanation, for some reason the issue hasn't happened on this box in the last 1.5 month so I was not able to collect logs. However there is another location with one phone which just stopped working.

Here the situaiton is similar in the sense that there is a non valid state is used, but the difference is that the NAT translation is happening to an IP address which doesn't belong to pfSense (probably an old IP address.)

The real WAN IP address is 87.97.33.xx but the SIP packets from the phone are translated to 84.230.48.58

After disconnecting the phone for two minutes the state cleared from pfSense and after connecting the phone again the translation is now happening on the correct IP address, and the phone is working again.

This is a completely different setup with a completely different phone. And there is a third location, as well again completely different setup with different phone, where the same issue has happened.
There are no static ports configured only auto rule creation with source port randomization enabled.

In this case we can't say what Jim is stating in the bug ticket:

"use NAT in such a way that it would try to make two connections use the same conflicting information, it will fail to create a NAT state and the second connection will egress without NAT"

Here the packets don't egress without NAT, the NAT is happening, but to an IP adress which doesn't exist on the pfSense probably and old WAN IP address.

What is your opinion?

Thank you!

stephenw10

Ok, that's different to the no-nat case. If 84.236.48.58 was an old address it looks like that's just a stale state.

If you run: pfctl -vss | grep -a2 84.236.48.58

You will see the age of the states in question which should show if they are remaining.

That also shouldn't normally happen. How is the gateway state killing configured there?

turrican64

@stephenw10

@stephenw10 said in No NAT processing for certain packets:

Ok, that's different to the no-nat case.

Might be better if I start a new topic for this than.

@stephenw10 said in No NAT processing for certain packets:

If you run: pfctl -vss | grep -a2 84.236.48.58

This has no result probably because I disconnected the phone for 2 minutes and the state cleared. The service provider bounces the PPPoE interface from time to time to trigger IP address change. I saw that the the PPPoE interface's uptime was couple of hours, I assume IP address change happened at that time.

They are default values. I've never changed these settings

Thank you!

stephenw10

Yes you would need to run that at the time you were seeing those states.

You only have a single WAN there?

One thing that could provide useful evidence for both these situations would be to setup pflow exporting. That would show if there are any conflicting states when a non-natted connection is created.

It should also catch failing to close out old states.