PfSense looses connection every 28-30 days.

jacksnack2

PfSense 2.4.4-RELEASE-p2 (amd64)
modem: Arris TM822G
Personal computer build.
Armstrong Networks, Mercer PA.
This is a remote site.
Initial install: December 2018.
Running BIND package.
WAN is DHCP
Running Dynamic DNS

Hello,

I have a reoccurring issue I can not resolve. Every 28-30 days the PfSense box looses a connection.

The first several times this happened i was able to access the box via ssh. However the last thing I tried was to disable dpinger. Yesterday (after another drop) I lost an IP address. (I assume because dpinger was disabled???)

I initially changed the DNS servers to those of Google for forwarders thinking this was a DNS issue. It now appears it is not as last night I lost an IP address and had a remote user physically reboot the router.

The first several times this happened I was able to ssh into the box. It appears the subnets behind the PfSense were not able to get DNS resolution. The PfSense box was able to ping external machines (Yahoo).

However, as mentioned above I don't think this DNS is the issue as yesterday the IP was lost.

Because this is cyclical (every 28-30 days) I would think this is either:
-ISP IP renewal/polling of sorts
-A service running on the PfSense box the runs every 28-30 days causing a conflict

Thoughts on how I can test this (without waiting 28-30 days)?

I'm out of options (and can't afford commercial support).

Any help is greatly appreciated!

johnpoz

@jacksnack2 said in PfSense looses connection every 28-30 days.:

PfSense box looses a connection.

So you mean it looses its lease on its IP, the IP changes and your ddns no longer points to your current wan IP..

What do the logs say when this happens... You say when it happened before you could ssh to it? But not access the gui? If you have ssh open to the internet, you can always tunnel down the ssh connection to hit the gui.

How is that clients behind not resolve but pfsense can? Are you pointing pfsense to other than itself? Maybe your bind just went offline if that is what is providing your users dns?

stephenw10

@jacksnack2 said in PfSense looses connection every 28-30 days.:

I lost an IP address

Yes. What exactly do you mean by that?

Steve

jacksnack2

Sorry, let me clarify.

The WAN does not show an IP address assigned via DHCP from the ISP.

More information, even though "Disable Gateway Monitoring Daemon" is checked under the WAN gateway, I still show this from yesterday from the gateway.log:

Mar 31 14:50:39 hostanme dpinger: WAN_DHCP GATEWAY_IP: Alarm latency 26705us stddev 25984us loss 21%
Mar 31 14:52:28 hostanme dpinger: WAN_DHCP GATEWAY_IP: Clear latency 22320us stddev 21882us loss 14%
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr GATEWAY_IP bind_addr HOST_IP identifier "WAN_DHCP "
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 172.16.110.1 bind_addr 172.16.110.1 identifier "LAN101VPN3JEAN_VPNV4 "
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr GATEWAY_IP bind_addr HOST_IP identifier "WAN_DHCP "
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 172.16.110.1 bind_addr 172.16.110.1 identifier "LAN101VPN3JEAN_VPNV4 "
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr GATEWAY_IP bind_addr HOST_IP identifier "WAN_DHCP "
Mar 31 15:45:31 hostanme dpinger: send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 172.16.110.1 bind_addr 172.16.110.1 identifier "LAN101VPN3JEAN_VPNV4 "

Derelict

Those logs are from March 31.

jacksnack2

copied too quick, indeed they are.

stephenw10

I would check the dhcp logs when this happens. What is dhclient doing on WAN when it fails?

Steve

jacksnack2

Looks like dhclient is having issues.

Apr 30 16:52:54 hostanme dhcpd: DHCPREQUEST for 172.16.21.3 from 00:1b:a9:a9:7f:eb via em1
Apr 30 16:52:54 hostanme dhcpd: DHCPACK on 172.16.21.3 to 00:1b:a9:a9:7f:eb via em1
Apr 30 17:04:10 hostanme dhcpd: DHCPREQUEST for 172.16.11.2 from 28:18:78:7d:5e:e1 via em0
Apr 30 17:04:10 hostanme dhcpd: DHCPACK on 172.16.11.2 to 28:18:78:7d:5e:e1 via em0
Apr 30 17:04:26 hostanme dhcpd: DHCPREQUEST for 172.16.21.114 from 3c:5c:c4:e2:04:4d (amazon-51f9125fe) via em1
Apr 30 17:04:26 hostanme dhcpd: DHCPACK on 172.16.21.114 to 3c:5c:c4:e2:04:4d (amazon-51f9125fe) via em1
Apr 30 17:07:55 hostanme dhcpd: DHCPDISCOVER from 94:65:9c:22:7c:70 via em1
Apr 30 17:07:56 hostanme dhcpd: DHCPOFFER on 172.16.21.116 to 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:00 hostanme dhcpd: DHCPDISCOVER from 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:00 hostanme dhcpd: DHCPOFFER on 172.16.21.116 to 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:05 hostanme dhcpd: DHCPDISCOVER from 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:05 hostanme dhcpd: DHCPOFFER on 172.16.21.116 to 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:05 hostanme dhcpd: DHCPREQUEST for 172.16.21.116 (172.16.21.1) from 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:08:05 hostanme dhcpd: DHCPACK on 172.16.21.116 to 94:65:9c:22:7c:70 (PHN-LPC06TU0C) via em1
Apr 30 17:12:29 hostanme dhcpd: DHCPDISCOVER from 38:59:f9:b9:9c:2a via em1
Apr 30 17:12:30 hostanme dhcpd: DHCPOFFER on 172.16.21.105 to 38:59:f9:b9:9c:2a (Hellstrom) via em1
Apr 30 17:12:30 hostanme dhcpd: DHCPREQUEST for 172.16.21.105 (172.16.21.1) from 38:59:f9:b9:9c:2a (Hellstrom) via em1
Apr 30 17:12:30 hostanme dhcpd: DHCPACK on 172.16.21.105 to 38:59:f9:b9:9c:2a (Hellstrom) via em1
Apr 30 17:12:34 hostanme dhcpd: DHCPINFORM from 172.16.21.105 via em1
Apr 30 17:12:34 hostanme dhcpd: DHCPACK to 172.16.21.105 (38:59:f9:b9:9c:2a) via em1
Apr 30 17:13:49 hostanme dhcpd: DHCPINFORM from 172.16.21.105 via em1
Apr 30 17:13:49 hostanme dhcpd: DHCPACK to 172.16.21.105 (38:59:f9:b9:9c:2a) via em1
Apr 30 17:15:24 hostanme dhcpd: DHCPREQUEST for 172.16.31.2 from 50:e5:49:3d:8a:37 via em3
Apr 30 17:15:24 hostanme dhcpd: DHCPACK on 172.16.31.2 to 50:e5:49:3d:8a:37 via em3
Apr 30 17:15:58 hostanme dhcpd: DHCPINFORM from 172.16.21.105 via em1
Apr 30 17:15:58 hostanme dhcpd: DHCPACK to 172.16.21.105 (38:59:f9:b9:9c:2a) via em1
Apr 30 17:17:06 hostanme dhcpd: DHCPREQUEST for 172.16.21.107 from 54:ae:27:26:7f:ca (iPad) via em1
Apr 30 17:17:06 hostanme dhcpd: DHCPACK on 172.16.21.107 to 54:ae:27:26:7f:ca (iPad) via em1
Apr 30 17:18:00 hostanme dhcpd: DHCPINFORM from 172.16.21.105 via em1
Apr 30 17:18:00 hostanme dhcpd: DHCPACK to 172.16.21.105 (38:59:f9:b9:9c:2a) via em1
Apr 30 17:18:31 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:33 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:35 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:39 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:50 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:19:16 hostanme dhcpd: DHCPINFORM from 172.16.21.105 via em1
Apr 30 17:19:16 hostanme dhcpd: DHCPACK to 172.16.21.105 (38:59:f9:b9:9c:2a) via em1
Apr 30 17:19:28 hostanme dhclient[8918]: send_packet: No route to host

stephenw10

Hmm, OK and that's it from dhclient?

Do you have to reboot to fix this? Resave the WAN? Replug the cable?

Check the system log any interface events at that time. No route to host implies the interface may have done down entirely.

Steve

jacksnack2

Yesterday the remote user had to physically restart the modem. An IP was then obtained.
Other times if I can ssh in I simply restart the PfSense box. (And DNS seems to start serving the subnets) See original post.

Yes, that is it from dhclient.
A 'good' entry would be something like this, I assume:
Apr 30 22:32:08 host dhclient: RENEW
Apr 30 22:32:08 host dhclient: Creating resolv.conf

system.log shows nothing regarding links during the time in question.
Apr 30 15:58:40 host sshd[30847]: Bad protocol version identification '' from 125.64.94.212 port 33564
Apr 30 15:58:40 host sshguard[5379]: Attack from "125.64.94.212" on service 100 with danger 10.
Apr 30 15:58:40 host sshd[31019]: Bad protocol version identification '\026\003\001' from 125.64.94.212 port 35787
Apr 30 15:58:40 host sshguard[5379]: Attack from "125.64.94.212" on service 100 with danger 10.
Apr 30 18:12:00 host syslogd: exiting on signal 15

stephenw10

Hmm well both restarting the modem and the pfSense box re-negotiates the Ethernet link there. You might try just pulling th link and reconnecting it.
Otherwise I think you'd have to get connected from the LAN side and see what is really happening, what works, what doesn't work.

Steve

jacksnack2

Thanks Stephen.

It is my experience that NICs rarely fail. Apart from a hardware problem, do you have any ideas what is causing this issue every 28-30 days?

jacksnack2

@johnpoz Thanks for the input.

I originally blamed BIND as well, but with the complete loss of the IP lease yesterday, I assumed other factors are at work.

"How is that clients behind not resolve but pfsense can? Are you pointing pfsense to other than itself? Maybe your bind just went offline if that is what is providing your users dns?"

I am assuming this has something to do with BIND. But more importantly it appears a result of the DHCP Lease in some way. I only disabled dpinger 30 days ago. I assume dpinger worked its magic to restore the IP address, but routing tables were hosed in the process.

Yesterday dpinger had been disabled already, so I assume this is why the IP collapsed completely.

Gertjan

@jacksnack2 said in PfSense looses connection every 28-30 days.:

Looks like dhclient is having issues.

You gave a list with mostly dhcp server logs, the process that hands out IP's on your LAN.

These are the details from the DHCP client : :

Apr 30 17:18:31 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:33 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:35 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:39 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:18:50 hostanme dhclient[8918]: send_packet: No route to host
Apr 30 17:19:28 hostanme dhclient[8918]: send_packet: No route to host

Conclusion : the WAN interface is down .... (doesn't exist).

edit : btw : why do your LAN client called "PHN-LPC06TU0C" have to repeat server times their DHCP request (DISCOVER), pfSense received it and replied with an OFFER, and the LAN clients wait 5 seconds to send out another DISCOVER.
After several tries the LAN client finally accepts (receives ??) the OFFER and acknowledges with an ACK ....
A set of bad network cables ? Network overload ? VLAN issues ? Bad wifi connection ?

Derelict

Look at the DHCP logs and filter on process dhclient and post what's there.

The interval from before it fails to after it recovers would be most telling.

My hunch is the WAN port is asking for a renewal and not getting it then reverting to a full DHCP request and not getting that either, then the lease expires and it just stops working - which would probably be an ISP/modem problem. All pfSense can do is ask for a renewal. The server has to respond to it.

Something else you might want to do is just start a Diagnostics > Packet Capture on WAN for UDP 67, set it for something like 1000000 packets and just let it run. Stop it after it fails. Even better would be to try something like disconnect/reconnect ethernet or restart the modem to see if you can get a capture including a recovery too.

jacksnack2

@Derelict I have attached filtered DHCP logs as you suggested.[0_1556895346233_dhcp.log.filtered](Uploading 100%) dhcp.log.filtered.txt

Thanks all for your help.

Derelict

Please use wireshark to filter what you want to show and upload the actual pcap.

jacksnack2

Thanks again all,

I have enabled Wireshark and will report back when more information is available.

I should also note that the previous router did not seem to have these issues, It was a Netgear router. I replaced it with PfSense because the PfSense box sports faster interfaces and more functionality.

johnpoz

@jacksnack2 said in PfSense looses connection every 28-30 days.:

I have enabled Wireshark and will report back when more information is available.

What dude - just download the pcap from pfsense info wireshark - filter out what you don't want to show with wireshark... Save the pcap and upload it.

jacksnack2

Sorry, I don't know what "What dude" means.

I understand how to use Wireshark.

Derelict advised to "Stop it after it fails". It may take weeks before another failure event takes place.

Until then Wireshark will show normal traffic. I doubt this is of any use.