dhclient exiting on WAN
-
Re: pfSense WAN dhcp client exiting (error)
I believe I have exactly the same problem as the one mentioned above. Probably race condition not letting dhclient to start. I'll add some new information. Maybe it will let somebody point something that would be at least partial solution. Normally I would ignore it but after power loss it is problem if I am not in the same place.
Main problem
I get this problem if I start pfsense (22.01) at the same time as modem. And modem starts really strange:- it will take around 50s to first link up
- but after 22s it will get link down
- to get final link up after 3s
Dhclient tries to run when we get link down change. Additionally dpinger is not able to start so I have Service watchdog running and checking it.
Logs showing the problem (WAN is igb0) with first WAN link UP at 11:45:04:
Apr 10 11:44:44 kernel lo0: link state changed to UP Apr 10 11:44:44 kernel vlan0: changing name to 'igb1.10' Apr 10 11:44:44 kernel vlan1: changing name to 'igb1.20' Apr 10 11:44:44 kernel vlan2: changing name to 'igb1.50' Apr 10 11:44:44 kernel vlan3: changing name to 'igb1.200' Apr 10 11:44:45 sshd 26424 Server listening on :: port 22. Apr 10 11:44:45 sshd 26424 Server listening on 0.0.0.0 port 22. Apr 10 11:44:45 sshguard 27262 Now monitoring attacks. Apr 10 11:44:48 check_reload_status 417 Linkup starting igb1 Apr 10 11:44:48 kernel igb1: link state changed to UP Apr 10 11:44:48 kernel igb1.50: link state changed to UP Apr 10 11:44:48 kernel igb1.200: link state changed to UP Apr 10 11:44:48 kernel igb1.20: link state changed to UP Apr 10 11:44:48 kernel igb1.10: link state changed to UP Apr 10 11:44:48 check_reload_status 417 Linkup starting igb1.50 Apr 10 11:44:48 check_reload_status 417 Linkup starting igb1.200 Apr 10 11:44:48 check_reload_status 417 Linkup starting igb1.20 Apr 10 11:44:48 check_reload_status 417 Linkup starting igb1.10 Apr 10 11:44:56 php 435 rc.bootup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb0 > /tmp/igb0_output 2> /tmp/igb0_error_output' returned exit code '1', the output was '' Apr 10 11:44:56 php 435 rc.bootup: Resyncing OpenVPN instances. Apr 10 11:44:56 kernel pflog0: promiscuous mode enabled Apr 10 11:44:56 kernel load_dn_sched dn_sched FIFO loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched QFQ loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched RR loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched WF2Q+ loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched PRIO loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched FQ_CODEL loaded Apr 10 11:44:56 kernel load_dn_sched dn_sched FQ_PIE loaded Apr 10 11:44:56 kernel load_dn_aqm dn_aqm CODEL loaded Apr 10 11:44:56 kernel load_dn_aqm dn_aqm PIE loaded Apr 10 11:44:57 php 435 rc.bootup: sync unbound done. Apr 10 11:45:04 check_reload_status 417 Linkup starting igb0 Apr 10 11:45:04 kernel igb0: link state changed to UP Apr 10 11:45:26 check_reload_status 417 Linkup starting igb0 Apr 10 11:45:26 kernel igb0: link state changed to DOWN Apr 10 11:45:27 php 435 rc.bootup: The command '/usr/bin/timeout -k 45 30 /usr/local/sbin/ntpd -g -q -c /dev/null 216.239.35.0 216.239.35.4 216.239.35.8 216.239.35.12' returned exit code '124', the output was '10 Apr 11:44:57 ntpd[16849]: ntpd 4.2.8p15@1.3728-o Wed Jan 12 15:39:52 UTC 2022 (1): Starting 10 Apr 11:44:57 ntpd[16849]: Command line: /usr/local/sbin/ntpd -g -q -c /dev/null 216.239.35.0 216.239.35.4 216.239.35.8 216.239.35.12 10 Apr 11:44:57 ntpd[16849]: ---------------------------------------------------- 10 Apr 11:44:57 ntpd[16849]: ntp-4 is maintained by Network Time Foundation, 10 Apr 11:44:57 ntpd[16849]: Inc. (NTF), a non-profit 501(c)(3) public-benefit 10 Apr 11:44:57 ntpd[16849]: corporation. Support and training for ntp-4 are 10 Apr 11:44:57 ntpd[16849]: available at https://www.nwtime.org/support 10 Apr 11:44:57 ntpd[16849]: ---------------------------------------------------- 10 Apr 11:44:57 ntpd[16849]: proto: precision = 0.144 usec (-23) 10 Apr 11:44:57 ntpd[16849]: line 0 column 0 syntax error, unexpected $end 10 Apr 11:44:57 ntpd[16849]: basedate set to 2021-12-31 10 Apr 11:44:57 ntpd[16849]: gps base set to 2022-01-02 (week 2191) 10 Apr 11:44:57 ntpd[16849]: Listen and drop on 0 v6wildcard [::]:123 10 Apr 11:44:57 ntpd[16849]: Listen and drop on 1 v4wildcard 0.0.0.0:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 2 igb0 [fe80::219:99ff:feec:5d44%1]:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 3 igb1 [fe80::219:99ff:feec:5d45%2]:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 4 lo0 [::1]:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 5 lo0 [fe80::1%5]:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 6 lo0 127.0.0.1:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 7 lo0 10.127.127.1:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 8 igb1.10 [fe80::219:99ff:feec:5d45%8]:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 9 igb1.10 192.168.10.1:123 10 Apr 11:44:57 ntpd[16849]: Listen normally on 10 igb1.20 [fe80::2 Apr 10 11:45:27 php 435 rc.bootup: NTPD is starting up. Apr 10 11:45:28 check_reload_status 417 Updating all dyndns Apr 10 11:45:29 php 435 rc.bootup: The command '/usr/local/sbin/strongswanrc stop' returned exit code '1', the output was 'strongswan not running? (check /var/run/daemon-charon.pid).' Apr 10 11:45:29 php-fpm 378 /rc.dyndns.update: Dynamic DNS custom (): IP address could not be extracted from Check IP Service Apr 10 11:45:29 php-fpm 378 /rc.dyndns.update: Dynamic DNS () There was an error trying to determine the public IP for interface - wan (igb0 ). Apr 10 11:45:29 check_reload_status 417 Linkup starting igb0 Apr 10 11:45:29 kernel igb0: link state changed to UP Apr 10 11:45:33 php 435 rc.bootup: Creating rrd update script Apr 10 11:45:33 syslogd exiting on signal 15 Apr 10 11:45:33 syslogd kernel boot file is /boot/kernel/kernel Apr 10 11:45:34 php-fpm 379 /rc.start_packages: Restarting/Starting all packages. Apr 10 11:45:34 vnstatd 50422 vnStat daemon 2.8 started. (pid:50422 uid:0 gid:0) Apr 10 11:45:34 vnstatd 50422 Monitoring (10): re0 (1000 Mbit) pfsync0 (1000 Mbit) pflog0 (1000 Mbit) igb1.50 (1000 Mbit) igb1.200 (1000 Mbit) igb1.20 (1000 Mbit) igb1.10 (1000 Mbit) igb1 (1000 Mbit) igb0 (1000 Mbit) enc0 (1000 Mbit) Apr 10 11:45:44 vnstatd 50422 SIGTERM received, exiting. Apr 10 11:45:44 vnstatd 61236 vnStat daemon 2.8 started. (pid:61236 uid:0 gid:0) Apr 10 11:45:44 vnstatd 61236 Monitoring (10): re0 (1000 Mbit) pfsync0 (1000 Mbit) pflog0 (1000 Mbit) igb1.50 (1000 Mbit) igb1.200 (1000 Mbit) igb1.20 (1000 Mbit) igb1.10 (1000 Mbit) igb1 (1000 Mbit) igb0 (1000 Mbit) enc0 (1000 Mbit) Apr 10 11:45:44 lighttpd_pfb 67220 [pfBlockerNG] DNSBL Webserver stopped Apr 10 11:45:44 lighttpd_pfb 70109 [pfBlockerNG] DNSBL Webserver started Apr 10 11:45:45 php-fpm 379 [pfBlockerNG] Starting firewall filter daemon Apr 10 11:45:45 tail_pfb 33043 [pfBlockerNG] Firewall Filter Service stopped Apr 10 11:45:45 php_pfb 33690 [pfBlockerNG] filterlog daemon stopped Apr 10 11:45:45 tail_pfb 35030 [pfBlockerNG] Firewall Filter Service stopped Apr 10 11:45:45 php_pfb 35245 [pfBlockerNG] filterlog daemon stopped Apr 10 11:45:45 vnstatd 39241 Error: pidfile "/var/run/vnstat/vnstat.pid" lock failed (Resource temporarily unavailable), exiting. Apr 10 11:45:45 tail_pfb 37691 [pfBlockerNG] Firewall Filter Service started Apr 10 11:45:45 root 39812 Bootup complete Apr 10 11:45:45 lighttpd_pfb 44210 [pfBlockerNG] DNSBL Webserver stopped Apr 10 11:45:45 php 39676 [pfBlockerNG] filterlog daemon started Apr 10 11:45:45 tail_pfb 44005 [pfBlockerNG] Firewall Filter Service started Apr 10 11:45:45 lighttpd_pfb 45987 [pfBlockerNG] DNSBL Webserver started Apr 10 11:45:45 php 45685 [pfBlockerNG] filterlog daemon started Apr 10 11:45:46 login 49840 login on ttyv0 as root Apr 10 11:45:46 sshguard 52583 Now monitoring attacks. Apr 10 11:46:00 php 97796 servicewatchdog_cron.php: Service Watchdog detected service dpinger stopped. Restarting dpinger (Gateway Monitoring Daemon) Apr 10 11:47:00 php 43610 servicewatchdog_cron.php: Service Watchdog detected service dpinger stopped. Restarting dpinger (Gateway Monitoring Daemon) Apr 10 11:48:00 php 66035 servicewatchdog_cron.php: Service Watchdog detected service dpinger stopped. Restarting dpinger (Gateway Monitoring Daemon)
From DHCP logs I get only one line with dhclient - it never gets back to do anything more (until I somehow force connection):
Apr 10 11:44:44 dhclient 25640 Cannot open or create pidfile: No such file or directory
Additional problem
I have not looked into it so I don't know if it is related or I missunderstood it. But if I have problem described above and I will go to Status->Interfaces to Renew connection on WAN pfsense will tell me that now it has connection but my computer on LAN won't be able to connect to Internet. I have to (for example) go to routing setting and save to make connection fully work for my computer.WAN and routing configuration
I put below what is not default in case it is needed.- Disable Gateway Monitoring Action is checked but unchecking it did not solve anything
- WAN interface configuration (increasing Timeout did not help)
-
I guess that there is no easy solution for this timing/race problem - somebody would provide it already :). I could try to look into it but I would gladly learn which files or scripts should interest me. That would help a lot - so I don't start to search by looking everywhere. Or some pointers regarding scripts logic when link gets up or down.
-
@tomashk said in dhclient exiting on WAN:
But if I have problem described above and I will go to Status->Interfaces to Renew connection on WAN pfsense will tell me that now it has connection but
At that moment, and before your PC on LAN has a connection, does pfSEnse have a connection ?
Go to the console, option 8 and ping a remote IP like 8.8.8.8 - if ok, it's up.
ping dns.google - if ok, dns is also working.Check the man file of dhclient, what these time settings mean.
I tend to say : use Backoff ( so when dhclient start, it will wait a bit) or use the Initial interval ?When a hardware event occurs, hardware management (devd ?) will start to signal processes that use that hardware. Like, for example, interface igb0 will trigger a dhclient startup.
The thing is, this parts works. But you have a modem that will then bring the interface down, up down up down up. dhclient looses track.
And it gets even better : a WAN link event triggers also the restart of many other processes, which will affect the system's reactivity, as the processor has a lot to do.@tomashk said in dhclient exiting on WAN:
it will take around 50s to first link up
but after 22s it will get link down
to get final link up after 3sKnown. Modems tend to do these things. There are even reasons about the 'why' part.
@tomashk said in dhclient exiting on WAN:
Apr 10 11:44:44 dhclient 25640 Cannot open or create pidfile: No such file or directory
Known bug without much consequences.
dhclient wants to store the pid file on a place that doesn't exist.
It can be corrected if needed but won't solve the issue.@tomashk said in dhclient exiting on WAN:
servicewatchdog_cron
There are some really rare cases where this tool can be helpful, but if you are looking for the issue, it will create just more issues.
Some possible things to test :
Try several timing settings shown above.
Put a switch between your modem and pfSense.
Locate the place where dhclient is started, and add a 'wait xx seconds' instruction. -
Thank you very much for suggestions. I'll start analysis after work and check what works as quick fix for me and if I have time in the next few days I'll try to go deeper, but in that case I will try to at least understand why it is broken for me because it might help others. But I guess first I have to read suggested sources and only after that think about fixing anything :)
-
- dhclient problem:
- Timing parameters (backoff, etc.) doesn't work in a few configurations I tried. Probably because dhclient is no able to start. The only line with dhclient in logs is:
rc.bootup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb0 > /tmp/igb0_output 2> /tmp/igb0_error_output' returned exit code '1', the output was ''
and contents of /tmp/igb0_output is
igb0: no link .............. giving up
- I tried to wait but without my intervention (renew button in interfaces status page) dhclient is never called in my scenario
- Now I remember that I had almost the same problem many years ago but at some point it just started to work. Maybe something changed in pfsense or maybe modem timings changed.
- "Additional" problem
- Still when I renew WAN dhclient is called and I'm getting WAN IP. But all devices on LAN's side doesn't get Internet access. It is strange because pfsense shows IP for WAN and dpinger (after forcing its start) provide stats for gateway widget
- As suggested I connected with putty to pfsense and tried pinging google. Both IP and domain complained about connection. Especially interesting was pinging of IP address because it returned "ping: sendto: No route to host" error.
- After that I looked to Diagnostics / Routes and it looked properly until I compared it with properly working situation and I have noticed that default route is not there. I mentioned that I have to save for example routing configuration to have working internet connection. The change that happens at that point is that default route is added to routing table and with that internet works.
So that was short update in case somebody sees something interesting. Thanks to previous suggestions now I have some new information to work with.
-
Is WAN_DHCP set as the default gateway in System > Routing?
Do you have more than one IPv4 gateway shown there?
-
@stephenw10
Now it is like that:
Previously it has been set as Automatic but I have just switched it to WAN_DHPC and it behaves exactly the same (or at least I don't see any changes). Only after pressing "Save" on System/Routing/Gateways page I got default route.Also as you can see IPv6 gateway is set to none if it matters but I don't use IPv6 so I don't think that it means anything.
-
Also I looked in the logs to see how it behaves when I used renew button for interface (scenario 1) to get IP (when dhclient did not work for me on the start of modem) and how it behaves when I get properly working internet and I just unplug ethernet cable from modem and plug it back (scenario 2).
- scenario 1
- scenario 2
As we can see first one doesn't add default route. Unfortunately I have not analysed it enough to know if it is bad or not. I just show my observations
- scenario 1
-
Hmm, so in scenario 1 you end up without a default route even though you do have a gateway?
-
@stephenw10
Yes. Now with working internet I have situation from the picture below. With scenario 1 I don't have the first line starting with default
I will test a little bit more soon because I want to rule any situation that those problems are caused by my mistakes with configuration. I believe those problems started with ver 2.6/22.01 but I can't be sure - I haven't noticed it previously because I had long uptime and there was no need for turning it off.
-
Mmm, there are a number of threads reporting similar issues. It look to be real IMO but it's not something we can easily replicate. It seems to be timing related.
Steve
-
Scenario 1: from bottom to top :
Scenario 2: from bottom to top :
So, if scenario 2 is what should happen = adding the 'default' route, it is not logged and thus not happening in scenario 1.
Maybe the IP was the same so the "add route default" wasn't needed ?
As far as know, dhclient uses shell scripts to do this. I'll have a look into that this weekend. -
Sorry for late response - I was away for a few days.
@gertjan said in dhclient exiting on WAN:
Maybe the IP was the same so the "add route default" wasn't needed ?
Probably IP was the same. Unfortunately I've lost those logs but currently whenever I renew I get the same WAN IP.
PS. For test I installed pfsense on another device and restored my configuration. Of course it works properly every time. Probably it will take some time to find how it differs when compared with "wrong logs". The main difference I see from the start is that when it works (on new installation) dhclient is able to start properly. So there is nothing like
rc.bootup: The command '/sbin/dhclient -c /var/etc/dhclient_wan.conf igb0 > /tmp/igb0_output 2> /tmp/igb0_error_output' returned exit code '1', the output was ''
in the logs. And later I can see standard DHCPDISCOVER (or others) or dhclient's messages about link state like
igb0 link state up -> down igb0 link state down -> up
I guess I'll try to find out what might stop dhclient from starting and throwing
igb0: no link .............. giving up
into /tmp/igb0_output. Of course I'll try to find other differences, but this one looks like the main one.
-
@tomashk said in dhclient exiting on WAN:
I guess I'll try to find out what might stop dhclient from starting and throwing
igb0: no link .............. giving up
into /tmp/igb0_output
You've answered your own question.
The dhclient aborts when there is no WAN interface any more.
It stops executing, as the interface WAN isn't listed as a system resource any more.Your issue isn't dhclient , that does exacly what it is designed to do.
Focus on the WAN interface : make it stopping going down - up - down - up - etc.
The thing is, you know already who brings the interface down : the other device in front of it.
Modems do so very often. It's their way to 'signal' that they created a "soft link" with the upstream ISP.
Or : this scenario : it powers up. The internal web GUI comes up, and the modem brings up it's LAN == the wan of pfSense. When the dhclient starts now, and send a DHCPDISCOVER, it will obtain an RFC1918 from the DHCP server of the modem. This RFC1918 will permit you to connect to the GUI of the modem to change it settings.
Moments later, the modem finished building the soft link with the ISP.
It brings down it's LAN, as it has to as this is the only way to reset the current connection, and brings it again. Now, when the dhclient of pfSense send a DHCPDISCOVER, the modem will be transparent, and the DHCPDISCOVER will reach the ISP. This time, the DHCP server of the ISP will reply, and give you a 'real' WAN IP.
This is how most modems work. For decades now.Things become even more interesting if initially the modem has a hard time syncing with the ISP. Every time it looses the connection, because, for example, the initial speed was to high, it aborts, (bring its LAN down), and tries a connection with a lower speed.
dhclient, on the pfSense side, looses track.
Your mission, as a modem user, is to find the right timings settings.
You know it('s possible, because most often, when you hook up a PC to your modem and use that pppoe connection setting, your PC will connect.
Your PC also uses a dhcp client like software. -
@tomashk said in dhclient exiting on WAN:
For test I installed pfsense on another device and restored my configuration. Of course it works properly every time.
What's the difference between those devices?
The is probably a timing issue. If the upstream device resets the link it may only be down for a short time and that must coincide with the dhclient script running in order to fail.
Running on hardware that is significantly faster or slower could well change the timing enough to prevent it happening.It should not be possible to hit that but it may also be easy to work around by adding a boot delay in pfSense.
Steve
-
@stephenw10 The one failing is based on small Asrock motherboard with Intel Celeron J4005 and working one is hp t620 plus thin client. I believe the second one is taking longer to start.
As you said this is probably a timing issue. The idea with boot delay sounds like good quick fix once I'll try to go back to J4005. For now I'll have to leave it because work doesn't leave me much time to experiment. For a moment I have that temporary new device which works. But in a week or two I am going to get back to it and try to understand it better.
By understand I mean for example that long time ago I looked into https://github.com/pfsense/FreeBSD-src/blob/devel-12/sbin/dhclient/dhclient.c (the file is important - I just took random branch from github as example) and I found the code that gives up (line 469)
Due to value 10 in line 468 and sleep from line 472 dhlient will try to wait for WAN interface only for 10 seconds. I was thinking that maybe it would be good to change value 10 to environment variable (with default value 10) so user can set this, but there are two problems with my current poor understanding:
- I'm not able to predict what would be broken in completely different part of the system if dhclient starts too long (maybe nothing but maybe some kind of the butterfly effect :) )
- there might already exist some system script that should be used for such situations
-
Mmm, I agree increasing that initially seems to be a good idea but could easily cause problems on systems with multiple disconnected WANs.
It would be nice to have that value configurable.
As a simple test you could just set the line:
autoboot_delay="30"
In /boot/loader.conf. That's normally 3s. If that works you might put it in loader.conf.local instead to prevent it getting overwritten.
Steve
-
Quick update. Because I wanted to test the release candidate for 22.05 I took the old device, that had problems described above. After updating to the RC version it still existed. But I used suggested autoboot_delay and it fixed this timing problem. For me 10 seconds were enough to make it work.
-
Anyone still hitting this who can test there is a patch now available here:
https://redmine.pfsense.org/issues/13671#note-4
We need feedback there to confirm it.Steve