PPPoE with VLAN tag not reconnecting with pfSense on ProxMox VM, wrong MTU's?
-
Hi all,
I may have found some sort of bug with pfSense and DSL PPPoE connections. It seems that pfSense in a virtualised setup has issues handling a PPPoE connection with VLAN tag. In my case any time when the PPPoE goes down it didn't reconnect anymore.
About a year ago I replaced the home network router with a virtualised pfSense setup in ProxMox. The previous modem/router, a Draytek2826B, could not function properly without a scheduled reboot every week. With each new firmware update it was also getting slower handling firewall rules. And there were mysterious very short outages/delays on the DSL line that didn't show up in logs - logging is sparse on Draytek routers like this one. This and some other considerations made me to migrate to pfSense.
That actually went very well at first. The DSL to WAN connection first looked like this after migration:
VDSL bridge modem (with VLAN tag6) --> local switch (only passes PPPoE protocol between dedicated ports) --> NIC on ProxMox --> [VM pfSense vnet0 interface, VLAN tag6 --> VLAN6 interface --> PPPoE interface on WAN].
The LAN side (with several other VLAN's) is on a different NIC. The internet connection was very stable in the first two months. Even though I still used the Draytek2862B in full bridge mode, the internet speed was a bit higher than before. No more scheduled reboots, no reboots after firewall changes, I was very happy.
In the initial setup I made an error in the MTU setting, both the physical NIC in ProxMox and the virtual NIC in the VM (vtnet0) were set to 1508. That should have been 1512 since the PPPoE connection is also inside a VLAN (in my case 6). But... the internet connection worked when I started to set everything up, on both IPv4 and IPv6. I hadn't noticed the MTU error until months later.After around 8 weeks without any interruptions, the PPPoE connection suddenly went down and would not reconnect again. Internet was down for hours since I wasn't home when it happened. When I got home again I restarted the VM, but that had no effect. So I checked the Draytek2862B bridge modem, it was still reporting perfect DSL connection... but the PPPoE interface in pfSense still could not get any connection (the log already showed hundereds of reconnection attempts). Only after a reboot of the modem the connection restored. This made me think that the Draytek still was causing problems - I had other weird issues with it when I still used it as a router. I could not let it run without rebooting with in 2 weeks. However I assumed that this was just coincidental since in full bride mode and with pfSense it worked for 8 weeks perfectly. Then after just a few weeks it happend again, and another 2 weeks again.
So I tried to reproduce the issue by simply pulling cables. And every time when I reconnected the cables, the WAN reconnected perfectly. I had already dug through all the logs, Googled many times, went to several forums including this one from Netgate. Read many PPPoE issue topics, but almost none of them were very clear on causes and possible solutions. I contacted my ISP about this, but they could not give me any information other than that their PPPoE server does have a security policy that temporarily locks out any modem that does many login attempts within a minute (DDoS protection). They assumed that my router's reconnection timer was set too agressive. Unfortunately they had no easy way of checking, plus there's no setting in pfSense to change the 10 second delay between PPPoE reconnect attempts.
Then I saw someone in a forum (can't remember which) mentioning that it PPPoE connection problems is an MTU issue. Then I found that I had the MTU setup wrong. Did change incoming interface to 1512, rebooted everything and I thought it was fixed now. Yeah... not... again after a month the PPPoE went down again. Rebooted the modem and everything restored again.
Now I was convinced the Draytek was the cause, so I ordered a new VDSL bridge modem (Zyxel). Incidently that one drew less power than the Draytek. But it did not resolve the PPPoE problem, a few weeks later it went down again.
This time started checking the logs and MTU parameters more thoroughly. In the logs I had already noticed that any moment that the PPPoE connection went down, the message "shutdown requested" from the ISP side appeared in the PPP logs. It didn't explain though why rebooting the external modem helps and rebooting pfSense not? Even shortly pulling cables had not effect.
I opened the CLI to check the status of each interface directly... and there I found something strange. The virtual network adaptor vtnet0 reported MTU=1508. That's odd, it should be 1512 as configured by Proxmox (virtio network interface). In pfSense it is set to 1512 as well (MLPPP parameters). So I changed the WAN and PPPoe MTU via the pfSense GUI just to see what would happen. No matter what value I entered in the WAN interface and in the PPPoE interface... the MTU on vtnet0 remained at 1508!! Could it be that I had to create a secundary WAN interface with vtnet0 as its interface? Up till then I only had a VLAN interface connected to vtnet0, and that does not have a MTU setting. The PPPoE interface was connected to that VLAN interface. After creating the secundary WAN connected to vtnet0 with 1512 as the MTU in pfSense, the MTU status via IFCONFIG in the CLI now reported 1504 !!!So took a different approach: I changed the virtual NIC in ProxMox connected to the VM pfSense: I set it to use VLAN6 untagged with MTU 1508, keeping the bridge at 1512. Inside pfSense i removed the VLAN6 interface and connected the PPPoE interface directly to vtnet0. When I checked the MTU via the CLI it was now correct, 1508. Now it looks like this:
VDSL bridge modem (with VLAN tag6) --> local PoE switch (only passes PPPoE protocol between dedicated ports) --> NIC on ProxMox --> [VM pfSense vnet0 interface, untagged --> PPPoE interface on WAN]
After reboot everything worked normally and started waiting for the next outage. Fortunately I didn't have to wait weeks this time. The PPPoE went down within a week. Again the logs showed PPPoE received the the "shutdown requested" message again, but now it had reconnected again within a minute. In fact, there were a couple of incidents after that and all did reconnect perfectly.
Pondering about this...
it seems that the PPPoE interface can't connect reliable with a VLAN interface in pfSense itself. It's as if the PPPoE shuts down the VLAN interface permanently and doesn't restart it. And for some reason it also sets the wrong MTU on the (virtual) NIC.I got it working with PPPoE data untagged by ProxMox. Maybe this is partly caused by virtualisation, I don't know. For now I have solved my problem, but I still would like to know why rebooting the external bridge modem caused the PPPoE to restore and simply pulling the ethernet cable not. What's the difference?
Since I have read many other posts about similar problems with PPPoE, I think this is something that developers could look into.
Setup in use during this time:
version: pfSense 2.6.0 to 2.7.2
ProxMox 7.x on an Apple MacMini 2014 with Thunderbolt ethernet adaptors (WAN and LAN).
WAN is PPPoE via vtnet0 and LAN via vtnet1 - virtIO drivers (Open vSwitch).