Is anyone else seeing lots of Oerrs on a PPPoE ISP connection running over VLAN 911?
-
@stephenw10 So I switched back to the new driver and also disabled checksum offloading. Guess what - no errors reported now. Not sure what that means exactly but an interesting result I think.
-
Ah, interesting. Unexpected for output errors but that's the sort of thing I might expect to cause errors.
-
@stephenw10 Sadly, I spoke too soon. This morning when I checked there were a load of Oerrs against the VLAN and pppoe interfaces (but NOT the base interface). Again the error count for the VLAN was a few more than the pppoe count and overall it represents an error rate of ~0.02%. No Ierrs.
Mac stats show one CRC error and one collision (though given this is just a cable between the NetGate port and the ONT it's hard to see how there could be a collision):
dev.igc.0.mac_stats.tso_txd: 0 dev.igc.0.mac_stats.tx_frames_1024_1522: 106987943 dev.igc.0.mac_stats.tx_frames_512_1023: 575465 dev.igc.0.mac_stats.tx_frames_256_511: 549671 dev.igc.0.mac_stats.tx_frames_128_255: 729276 dev.igc.0.mac_stats.tx_frames_65_127: 25797022 dev.igc.0.mac_stats.tx_frames_64: 131923 dev.igc.0.mac_stats.mcast_pkts_txd: 0 dev.igc.0.mac_stats.bcast_pkts_txd: 3 dev.igc.0.mac_stats.good_pkts_txd: 134771300 dev.igc.0.mac_stats.total_pkts_txd: 134771300 dev.igc.0.mac_stats.good_octets_txd: 165755531650 dev.igc.0.mac_stats.good_octets_recvd: 131017743687 dev.igc.0.mac_stats.rx_frames_1024_1522: 86115978 dev.igc.0.mac_stats.rx_frames_512_1023: 425579 dev.igc.0.mac_stats.rx_frames_256_511: 211253 dev.igc.0.mac_stats.rx_frames_128_255: 806539 dev.igc.0.mac_stats.rx_frames_65_127: 7578356 dev.igc.0.mac_stats.rx_frames_64: 102537 dev.igc.0.mac_stats.mcast_pkts_recvd: 0 dev.igc.0.mac_stats.bcast_pkts_recvd: 0 dev.igc.0.mac_stats.good_pkts_recvd: 95240242 dev.igc.0.mac_stats.total_pkts_recvd: 95240243 dev.igc.0.mac_stats.mgmt_pkts_txd: 0 dev.igc.0.mac_stats.mgmt_pkts_drop: 0 dev.igc.0.mac_stats.mgmt_pkts_recvd: 0 dev.igc.0.mac_stats.unsupported_fc_recvd: 0 dev.igc.0.mac_stats.xoff_txd: 0 dev.igc.0.mac_stats.xoff_recvd: 0 dev.igc.0.mac_stats.xon_txd: 0 dev.igc.0.mac_stats.xon_recvd: 0 dev.igc.0.mac_stats.alignment_errs: 0 dev.igc.0.mac_stats.crc_errs: 1 dev.igc.0.mac_stats.recv_errs: 0 dev.igc.0.mac_stats.recv_jabber: 0 dev.igc.0.mac_stats.recv_oversize: 0 dev.igc.0.mac_stats.recv_fragmented: 0 dev.igc.0.mac_stats.recv_undersize: 0 dev.igc.0.mac_stats.recv_no_buff: 0 dev.igc.0.mac_stats.recv_length_errors: 0 dev.igc.0.mac_stats.missed_packets: 0 dev.igc.0.mac_stats.defer_count: 0 dev.igc.0.mac_stats.sequence_errors: 0 dev.igc.0.mac_stats.symbol_errors: 0 dev.igc.0.mac_stats.collision_count: 1 dev.igc.0.mac_stats.late_coll: 0 dev.igc.0.mac_stats.multiple_coll: 0 dev.igc.0.mac_stats.single_coll: 0 dev.igc.0.mac_stats.excess_coll: 0
Not sure where this leaves me. The errors don't seem to be 'real' but if so why is the device counting so many?
-
A single collision or other error like that can be from when the link came up for example or of it renegotiated. Not anything I'd worry about.
You could try switching to a different igc NIC in case it's somehow just not reporting errors on the parent that are in fact there.
If it really was seeing errors though I'd expect to see your upload speed impacted but you're seeing good throughput?
-
@stephenw10 I'm seeing good throughput for bith upload and download. In fact upload is generally better than download (but that is likely due to Internet factors more than anything). Given that the 'old' PPPoE driver does not report any errors it seems to me that this is somehow related to the new driver so given that switching to a different igc NIC is a bit of a faff (and disruptive) I think I won't try that for now. It certainly looks (to me at least) like there is at least one additional bug beyond the one you identified (unless that is somehow also responsible for this).
-
The interesting part to me is that it's shown on the VLAN. We know the old driver doesn't log some errors like the dropped packets from traffic shaping. We old found that when using the new driver.
But the VLAN should only be passing the PPPoE packets which should be the same for both drivers. I assume you don't see errors on the VLAN with the old driver?
-
@stephenw10 Nope, no errors at all (on pppoe0 or igc0.911) with the old driver. And of course nothing in mac_stats for igc.0 with old or new drivers.
-
Hmm, could be a clue there.
-
@stephenw10 I've been doing more experiments and have some more info on all of this.
-
With the new driver it seems that heavy traffic triggers an increase in the number of reported errors. For example a HTTP or Iperf3 speed test that pushes the link close to the limit (though the speeds are still very good).
-
I'm now running with the old driver (as an experiment for comparison purposes). I saw a small number of errors on the VLAN only just after the router had restarted and some small increase in those over time, and no errors at all on the pppoe0 interface, even after several hours of running including many speed tests.
After boot
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll igc0 1500 <Link#1> XX:XX:77:7f:c9:d6 4804079 0 0 5717254 0 0 ... igc0.911 1500 <Link#15> XX:XX:77:7f:c9:d6 4804079 0 0 5717254 820 0 igc0.911 - fe80::%igc0.911/64 fe80::XXXX:77ff:fe7f:c9d6%igc0.911 0 - - 0 - - ... pppoe0 1492 <Link#17> pppoe0 4803848 0 0 5717807 0 0 pppoe0 - XXX.69.48.XXX/32 abcdef.com 15982 - - 7 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d6%pppoe0 1386 - - 1391 - - pppoe0 - abcdef.com abcdef.com 1232 - - 3075 - - pppoe0 - 2XX2:XXX:62fb::123/128 2XX2:XXX:62fb::123 501 - - 1 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d9%pppoe0 0 - - 0 - - pppoe0 - 2XX2:XXX:feed:62fb::/64 2XX2:XXX:feed:62fb:92ec:77ff:fe7f:c9d6 1848 - - 0 - - ...
An hour, and several speed tests, later
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll igc0 1500 <Link#1> XX:XX:77:7f:c9:d6 15059231 0 0 18533104 0 0 ... igc0.911 1500 <Link#15> XX:XX:77:7f:c9:d6 15059231 0 0 18533104 820 0 igc0.911 - fe80::%igc0.911/64 fe80::XXXX:77ff:fe7f:c9d6%igc0.911 0 - - 0 - - ... pppoe0 1492 <Link#17> pppoe0 15058839 0 0 18533476 0 0 pppoe0 - XXX.69.48.XXX/32 abcdef.com 111491 - - 28 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d6%pppoe0 8527 - - 8535 - - pppoe0 - abcdef.com abcdef.com 4958 - - 12415 - - pppoe0 - 2XX2:XXX:62fb::123/128 2XX2:XXX:62fb::123 3912 - - 1 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d9%pppoe0 0 - - 0 - - pppoe0 - 2XX2:XXX:feed:62fb::/64 2XX2:XXX:feed:62fb:92ec:77ff:fe7f:c9d6 8376 - - 0 - - ...
Several hours later
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll igc0 1500 <Link#1> XX:XX:77:7f:c9:d6 57841186 0 0 68454730 0 0 igc0 - fe80::%igc0/64 fe80::XXXX:77ff:fe7f:c9d6%igc0 0 - - 1 - - ... igc0.911 1500 <Link#15> XX:XX:77:7f:c9:d6 57841186 0 0 68454730 1590 0 igc0.911 - fe80::%igc0.911/64 fe80::XXXX:77ff:fe7f:c9d6%igc0.911 0 - - 0 - - ... pppoe0 1492 <Link#17> pppoe0 57840234 0 0 68454319 0 0 pppoe0 - XXX.69.48.XXX/32 abcdef.com 513438 - - 250 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d6%pppoe0 39368 - - 39387 - - pppoe0 - abcdef.com abcdef.com 21290 - - 62223 - - pppoe0 - 2XX2:XXX:62fb::123/128 2XX2:XXX:62fb::123 18804 - - 1 - - pppoe0 - fe80::%pppoe0/64 fe80::XXXX:77ff:fe7f:c9d9%pppoe0 0 - - 0 - - pppoe0 - 2XX2:XXX:feed:62fb::/64 2XX2:XXX:feed:62fb:92ec:77ff:fe7f:c9d6 40841 - - 0 - - ...
mac_stats for igc.0 show no errors of any kind. I'd be interested to know where the (very few) errors counted against the VLAN are coming from.
It seems to me that maybe the new driver is not quite ready for prime time.
-
-
Hmm, seeing a few errors when the interface comes up is not that unusual. Errors on the VLAN only is odd though. Especially as they are increasing after boot.
You have hardware VLAN tagging enabled on igc0?
Shown in options out of capabilities like:[25.07.1-RELEASE][admin@6100.stevew.lan]/root: ifconfig -vm igc0 igc0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1300 options=48020b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,HWSTATS,MEXTPG> capabilities=4f43fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG>
Try disabling that and see if the errors stop:
ifconfig igc0 -vlanhwtag
-
@stephenw10 Yes it is enabled:
igc0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 options=4e427bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWTSO,RXCSUM_IPV6,T XCSUM_IPV6,HWSTATS,MEXTPG> capabilities=4f43fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC ,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> ether XX:XX:77:7f:c9:d6 inet6 fe80::XXXX:77ff:fe7f:c9d6%igc0 prefixlen 64 scopeid 0x1 media: Ethernet autoselect (2500Base-T <full-duplex>) status: active supported media: media autoselect media 2500Base-T media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> drivername: igc0
I've turned it off now:
igc0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 options=4e427ab<RXCSUM,TXCSUM,VLAN_MTU,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> capabilities=4f43fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> ether XX:XX:77:7f:c9:d6 inet6 fe80::XXXX:77ff:fe7f:c9d6%igc0 prefixlen 64 scopeid 0x1 media: Ethernet autoselect (2500Base-T <full-duplex>) status: active supported media: media autoselect media 2500Base-T media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> drivername: igc0
Is that likely to have any detrimental effect?
-
Potentially it might make the connection fractionally slower but I'd be surprised if you're able to detect it!
-
@stephenw10 I'll post an update after it has been running like this for several hours (this is still with the old PPPoE driver). If this eliminates the errors then I could also try it with the new driver to see if it has any effect on that.
-
@stephenw10 Looking good so far; since turning off hardware VLAN tagging no further errors have accrued on the vlan interface, and still zero on the base interface and the pppoe interface, this being with the old PPPoE driver.
If the situation remains the same by the morning (UK time), is it worth me trying with the new PPPoE driver to see if this also eliminates the errors that was reporting? I've set up a 'scriptcmd' to disable the hardware tagging on every reboot in case I forget.
-
Yes, definitely try the if_pppoe driver if you can. That would be odd if the pppoe packets are somehow triggering some issue there but at least consistent. And it does nicely tie in with the 'vlan not parent' behaviour. You might not be seeing it as much in the old driver simply because it's not pushing the NIC as hard.
-
@stephenw10 So over a 12 hour period with the old driver and hwvlantag turned off there were a total of 19 Oerrs reported against the clan device with zero on the base interface and zero on the PPPoE interface. It's unclear why there were any errors on the VLAN device but at least the rate of increase is minuscule now.
This morning I switched back to the new PPPoE driver and ensured that hwvlantag was still turned off (it was).After the reboot there were 357 Oerrs showing on the VLAN interface and 514 on the PPPoE interface - a difference of 157 (previously with the new driver the PPPoE error count was always 5 less than the VLAN error count).
After 30 minutes the Oerr count has increased to 4333 on the VLAN device and 4490 on the PPPoE device (still a difference of 157). So turning off hardware VLAN tagging hasn't resolved the fundamental issue when using the new driver. I'll leave it to run in this configuration for the rest of today at least.
I'm not sure where this leaves me; there is clearly soem kind of issue with the new PPPoE driver. Maybe it is just mis-reporting errors (though why that also percolates down to the VLAN interface is a bit concerning) or maybe there truly is soem kind of problem. If so it seems likely that it is a driver issue rather than an actual interface/cable/ONT issue.
So, should I stick with the new driver permanently or switch back to the old driver until these issues get resolved? Is there anything more we can do to diagnose this so that someone can fix the issue?
-
Well if it's not hurting the speeds and the total throughput is still higher I'd stick with the new driver. That will at least show any other issues you might have with your particular ISP for example.
Also it will be much easier to run tests against when we come up with some!
-
@stephenw10 Yeah, that was kind of my thinking too. I'll stick with it for now despite the (seemingly) high error rate (0.1% if it could be believed).
Do let me know if anyone has any ideas for trying to hone in on the cause of these 'errors'.