Repeated ENA TX Timeout on AWS pfSense Instances (Affecting Multiple Firewalls Randomly)

rattle007_beat

Hello everyone,

We’re running multiple pfSense+ firewalls (24.11-RELEASE (amd64)) on AWS EC2 (mostly m6i.xlarge and m6i.large instances), and have been observing recurring network interruptions caused by ENA driver TX timeouts — even during periods of low or moderate traffic.

At random times (not tied to peak load), the pfSense network card drops out. After reboot, the system comes back up fine, but logs consistently show messages like:

ena0: Found a Tx that wasn't completed on time, qid 2, index 501. 18980 msecs have passed since last cleanup. Missing Tx timeout value 5000 msecs.
ena0: Found a Tx that wasn't completed on time, qid 2, index 506. 18980 msecs have passed since last cleanup. Missing Tx timeout value 5000 msecs.
ena0: Found a Tx that wasn't completed on time, qid 2, index 508. 18980 msecs have passed since last cleanup. Missing Tx timeout value 5000 msecs.
ena0: Found a Tx that wasn't completed on time, qid 2, index 510. 18980 msecs have passed since last cleanup. Missing Tx timeout value 5000 msecs.
ena0: The number of lost tx completion is above the threshold (244 > 128). Reset the device
ena0: ena_com_validate_version() [TID:100038]: ENA device version: 0.10
ena0: ena_com_validate_version() [TID:100038]: ENA controller version: 0.0.1 implementation version 1
ena0: Trigger reset is on
ena0: device is going DOWN

All the times our monitoring tool observed that the CPU core in question was peaked to max.

AWS support confirmed:
No EC2 or hypervisor-level issues (status checks OK).

We still don't know what's causing the CPU to spike or if the ENA itself is causing the CPU to spike

ENA Driver version:
ena0: Elastic Network Adapter (ENA)ena v2.8.0
ena0: ena_com_validate_version() [TID:100000]: ENA device version: 0.10
ena0: ena_com_validate_version() [TID:100000]: ENA controller version: 0.0.1 implementation version 1

Has anyone seen similar ENA TX timeout or “Found a Tx that wasn’t completed” issues on AWS-based pfSense instances?
Any best practices for interrupt balancing / RSS tuning on AWS instances with ENA?

marcosm

That's expected behavior on AWS when the CPU is maxed. See https://docs.netgate.com/pfsense/en/latest/solutions/aws-vpn-appliance/instance-type-and-sizing.html

You'll need to find the cause of the core maxing out.