dhclient exiting on WAN
-
I guess that there is no easy solution for this timing/race problem - somebody would provide it already :). I could try to look into it but I would gladly learn which files or scripts should interest me. That would help a lot - so I don't start to search by looking everywhere. Or some pointers regarding scripts logic when link gets up or down.
-
@tomashk said in dhclient exiting on WAN:
But if I have problem described above and I will go to Status->Interfaces to Renew connection on WAN pfsense will tell me that now it has connection but
At that moment, and before your PC on LAN has a connection, does pfSEnse have a connection ?
Go to the console, option 8 and ping a remote IP like 8.8.8.8 - if ok, it's up.
ping dns.google - if ok, dns is also working.Check the man file of dhclient, what these time settings mean.
I tend to say : use Backoff ( so when dhclient start, it will wait a bit) or use the Initial interval ?When a hardware event occurs, hardware management (devd ?) will start to signal processes that use that hardware. Like, for example, interface igb0 will trigger a dhclient startup.
The thing is, this parts works. But you have a modem that will then bring the interface down, up down up down up. dhclient looses track.
And it gets even better : a WAN link event triggers also the restart of many other processes, which will affect the system's reactivity, as the processor has a lot to do.@tomashk said in dhclient exiting on WAN:
it will take around 50s to first link up
but after 22s it will get link down
to get final link up after 3sKnown. Modems tend to do these things. There are even reasons about the 'why' part.
@tomashk said in dhclient exiting on WAN:
Apr 10 11:44:44 dhclient 25640 Cannot open or create pidfile: No such file or directory
Known bug without much consequences.
dhclient wants to store the pid file on a place that doesn't exist.
It can be corrected if needed but won't solve the issue.@tomashk said in dhclient exiting on WAN:
servicewatchdog_cron
There are some really rare cases where this tool can be helpful, but if you are looking for the issue, it will create just more issues.
Some possible things to test :
Try several timing settings shown above.
Put a switch between your modem and pfSense.
Locate the place where dhclient is started, and add a 'wait xx seconds' instruction. -
Thank you very much for suggestions. I'll start analysis after work and check what works as quick fix for me and if I have time in the next few days I'll try to go deeper, but in that case I will try to at least understand why it is broken for me because it might help others. But I guess first I have to read suggested sources and only after that think about fixing anything :)
-
- dhclient problem:
- Timing parameters (backoff, etc.) doesn't work in a few configurations I tried. Probably because dhclient is no able to start. The only line with dhclient in logs is:
rc.bootup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb0 > /tmp/igb0_output 2> /tmp/igb0_error_output' returned exit code '1', the output was ''
and contents of /tmp/igb0_output is
igb0: no link .............. giving up
- I tried to wait but without my intervention (renew button in interfaces status page) dhclient is never called in my scenario
- Now I remember that I had almost the same problem many years ago but at some point it just started to work. Maybe something changed in pfsense or maybe modem timings changed.
- "Additional" problem
- Still when I renew WAN dhclient is called and I'm getting WAN IP. But all devices on LAN's side doesn't get Internet access. It is strange because pfsense shows IP for WAN and dpinger (after forcing its start) provide stats for gateway widget
- As suggested I connected with putty to pfsense and tried pinging google. Both IP and domain complained about connection. Especially interesting was pinging of IP address because it returned "ping: sendto: No route to host" error.
- After that I looked to Diagnostics / Routes and it looked properly until I compared it with properly working situation and I have noticed that default route is not there. I mentioned that I have to save for example routing configuration to have working internet connection. The change that happens at that point is that default route is added to routing table and with that internet works.
So that was short update in case somebody sees something interesting. Thanks to previous suggestions now I have some new information to work with.
-
Is WAN_DHCP set as the default gateway in System > Routing?
Do you have more than one IPv4 gateway shown there?
-
@stephenw10
Now it is like that:
Previously it has been set as Automatic but I have just switched it to WAN_DHPC and it behaves exactly the same (or at least I don't see any changes). Only after pressing "Save" on System/Routing/Gateways page I got default route.Also as you can see IPv6 gateway is set to none if it matters but I don't use IPv6 so I don't think that it means anything.
-
Also I looked in the logs to see how it behaves when I used renew button for interface (scenario 1) to get IP (when dhclient did not work for me on the start of modem) and how it behaves when I get properly working internet and I just unplug ethernet cable from modem and plug it back (scenario 2).
- scenario 1
- scenario 2
As we can see first one doesn't add default route. Unfortunately I have not analysed it enough to know if it is bad or not. I just show my observations
- scenario 1
-
Hmm, so in scenario 1 you end up without a default route even though you do have a gateway?
-
@stephenw10
Yes. Now with working internet I have situation from the picture below. With scenario 1 I don't have the first line starting with default
I will test a little bit more soon because I want to rule any situation that those problems are caused by my mistakes with configuration. I believe those problems started with ver 2.6/22.01 but I can't be sure - I haven't noticed it previously because I had long uptime and there was no need for turning it off.
-
Mmm, there are a number of threads reporting similar issues. It look to be real IMO but it's not something we can easily replicate. It seems to be timing related.
Steve
-
Scenario 1: from bottom to top :
Scenario 2: from bottom to top :
So, if scenario 2 is what should happen = adding the 'default' route, it is not logged and thus not happening in scenario 1.
Maybe the IP was the same so the "add route default" wasn't needed ?
As far as know, dhclient uses shell scripts to do this. I'll have a look into that this weekend. -
Sorry for late response - I was away for a few days.
@gertjan said in dhclient exiting on WAN:
Maybe the IP was the same so the "add route default" wasn't needed ?
Probably IP was the same. Unfortunately I've lost those logs but currently whenever I renew I get the same WAN IP.
PS. For test I installed pfsense on another device and restored my configuration. Of course it works properly every time. Probably it will take some time to find how it differs when compared with "wrong logs". The main difference I see from the start is that when it works (on new installation) dhclient is able to start properly. So there is nothing like
rc.bootup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb0 > /tmp/igb0_output 2> /tmp/igb0_error_output' returned exit code '1', the output was ''
in the logs. And later I can see standard DHCPDISCOVER (or others) or dhclient's messages about link state like
igb0 link state up -> down igb0 link state down -> up
I guess I'll try to find out what might stop dhclient from starting and throwing
igb0: no link .............. giving up
into /tmp/igb0_output. Of course I'll try to find other differences, but this one looks like the main one.
-
@tomashk said in dhclient exiting on WAN:
I guess I'll try to find out what might stop dhclient from starting and throwing
igb0: no link .............. giving up
into /tmp/igb0_output
You've answered your own question.
The dhclient aborts when there is no WAN interface any more.
It stops executing, as the interface WAN isn't listed as a system resource any more.Your issue isn't dhclient , that does exacly what it is designed to do.
Focus on the WAN interface : make it stopping going down - up - down - up - etc.
The thing is, you know already who brings the interface down : the other device in front of it.
Modems do so very often. It's their way to 'signal' that they created a "soft link" with the upstream ISP.
Or : this scenario : it powers up. The internal web GUI comes up, and the modem brings up it's LAN == the wan of pfSense. When the dhclient starts now, and send a DHCPDISCOVER, it will obtain an RFC1918 from the DHCP server of the modem. This RFC1918 will permit you to connect to the GUI of the modem to change it settings.
Moments later, the modem finished building the soft link with the ISP.
It brings down it's LAN, as it has to as this is the only way to reset the current connection, and brings it again. Now, when the dhclient of pfSense send a DHCPDISCOVER, the modem will be transparent, and the DHCPDISCOVER will reach the ISP. This time, the DHCP server of the ISP will reply, and give you a 'real' WAN IP.
This is how most modems work. For decades now.Things become even more interesting if initially the modem has a hard time syncing with the ISP. Every time it looses the connection, because, for example, the initial speed was to high, it aborts, (bring its LAN down), and tries a connection with a lower speed.
dhclient, on the pfSense side, looses track.
Your mission, as a modem user, is to find the right timings settings.
You know it('s possible, because most often, when you hook up a PC to your modem and use that pppoe connection setting, your PC will connect.
Your PC also uses a dhcp client like software. -
@tomashk said in dhclient exiting on WAN:
For test I installed pfsense on another device and restored my configuration. Of course it works properly every time.
What's the difference between those devices?
The is probably a timing issue. If the upstream device resets the link it may only be down for a short time and that must coincide with the dhclient script running in order to fail.
Running on hardware that is significantly faster or slower could well change the timing enough to prevent it happening.It should not be possible to hit that but it may also be easy to work around by adding a boot delay in pfSense.
Steve
-
@stephenw10 The one failing is based on small Asrock motherboard with Intel Celeron J4005 and working one is hp t620 plus thin client. I believe the second one is taking longer to start.
As you said this is probably a timing issue. The idea with boot delay sounds like good quick fix once I'll try to go back to J4005. For now I'll have to leave it because work doesn't leave me much time to experiment. For a moment I have that temporary new device which works. But in a week or two I am going to get back to it and try to understand it better.
By understand I mean for example that long time ago I looked into https://github.com/pfsense/FreeBSD-src/blob/devel-12/sbin/dhclient/dhclient.c (the file is important - I just took random branch from github as example) and I found the code that gives up (line 469)
Due to value 10 in line 468 and sleep from line 472 dhlient will try to wait for WAN interface only for 10 seconds. I was thinking that maybe it would be good to change value 10 to environment variable (with default value 10) so user can set this, but there are two problems with my current poor understanding:
- I'm not able to predict what would be broken in completely different part of the system if dhclient starts too long (maybe nothing but maybe some kind of the butterfly effect :) )
- there might already exist some system script that should be used for such situations
-
Mmm, I agree increasing that initially seems to be a good idea but could easily cause problems on systems with multiple disconnected WANs.
It would be nice to have that value configurable.
As a simple test you could just set the line:
autoboot_delay="30"
In /boot/loader.conf. That's normally 3s. If that works you might put it in loader.conf.local instead to prevent it getting overwritten.
Steve
-
Quick update. Because I wanted to test the release candidate for 22.05 I took the old device, that had problems described above. After updating to the RC version it still existed. But I used suggested autoboot_delay and it fixed this timing problem. For me 10 seconds were enough to make it work.
-
Anyone still hitting this who can test there is a patch now available here:
https://redmine.pfsense.org/issues/13671#note-4
We need feedback there to confirm it.Steve