@heper OK. I realized my ISP also has iperf3 server listening so I tried that instead of speedtest. I attached a virtio vtnet interface to pfsense and made a better comparison with single vs. parallel flows, IPv4 vs. IPv6, and coming from physical port (ix) vs. virtio (vtnet). WAN is always physical port.
You had right, I think it is related to tx/rx queue issue you also linked above.
My test results show a single flow (ix or vtnet) can support around 5Gb/s on my system (packet filtering enabled). If I enable parallel, ix reaches to 9Gb/s, it does not matter IPv4 (NAT) or v6 (no NAT), and it consumes around 70% CPU (4 cores). However, when using vtnet with parallel flow, neither throughput nor CPU use changes and it is still around 5-6Gb/s (similar to single flow). It is actually very good for a single flow (as good as physical) and CPU consumption is not different (only a few percents higher maybe).