Back to odd problem -- lose WAN at random points with a week or more between events
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 08 38 fc 03 40 09 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 f8 17 e9 40 08 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 e0 19 ab 40 0a 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 2a 28 40 00 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 2c cf 40 1d 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 2e cf 40 1d 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 30 40 fc 03 40 09 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 18 70 fc 03 40 09 00 00 00 00 00
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): CAM status: Auto-Sense Retrieval Failed
Mar 10 01:55:17 kernel (ada1:ahcich1:0:0:0): Error 5, Unretryable error
Mar 10 09:08:42 syslogd kernel boot file is /boot/kernel/kernel
Mar 10 09:08:42 kernel ---<<BOOT>>---These errors indicate a failing disk drive (whether it's an SSD or an old spinning surface, it is failing).
You need to be sure you have a backup of the firewall configuration on separate media (such as a USB stick), then replace the failing drive and reinstall pfSense from an install image restoring you config during the install process.
-
@Wylbur I think you need to have a look in the DHCP log and see if the issue arises when DHCLIENT (WAN DHCP client) tries to renew the DHCP lease. Some ISP’s are quite picky with other hardware on their infrastructure, and require a quite strict DHCP client configuration.
You know that your DISCOVER/OFFER/REQUEST/ACK (new DHCP lease works), but does a renewal of an existing lease? -
Thank you for your input. But, I've already done that. This is what the WAN port is running with. The LAN port is whatever the MOBO has and I never seem to have problems with that port. The weirdness is, this MOBO will not accept connections on both the Intel ports of the dual Intel port ethernet adapter that I'm using on this machine. Yet when I ran that adapter in another machine, both ports were usable so one was for WAN and the other for LAN.
But we have to ask this question: Why was I able to run for months on end when using Realtek ports with Untangle, or IPfire with this ISP? That is the thing that puzzels me.
Now, if I were a "c" (or assembly language) programmer and knew the x86 architecture as well as I do z/Architecture machines (IBM Mainframes), I could probably code a trap to capture this failure and know why it was happening. Or I could run a trace of it that we could examine once it failed. But I don't know this architecture at that level. So unfortunately, I'm more of a knowledgeable user that knows enough to just be dangerous.
Wylbur
-
I see no failures that indicate a problem with renewal of lease with the ISP. What I do see are some changes where the fiber optic modem may get its IPv4 IP address changed and then the WAN is given a new IPv4 address. And then some 8.8.8.8 pings take place and some latency is noted.
Now and then I see alerts for latency with the ISP against 8.8.8.8 and the system recovers.
What should I be looking for in that would show me I have the problem you are suspecting? I've been scanning logs off and on for weeks looking for anomalies that would tell me something. Meanwhile on these latency issues, we know that the ISP has the ability to run Gigabit connections. What we have is 200/200 Mbs. And I generally have no stuttering within my Lan with this. And I have multiple devices streaming. I am constantly listening to European radio via tunein (old iphone) which tells me pretty quicly if I've just lost connections.
Wylbur
-
@Wylbur You would know from the logs if renew was failing, because the logs would fill with a lot of renew attempts (with an increasing timer). So thats not the root of your problem.
-
This is rather disconcerting for a referbished machine that is less than 30 days old.
I had to put in a second SSD because the system would not install for some reason. So that has me wondering of the referbish didn't detect a bad HDD.
I really hate to pull this right now and run Knoppix to do diagnostics, because it takes me about 30 minute to get the INTEL ethernet adapter out of this box and into the back up unit so I can start that whole process.
Have any thoughts what diagnostics I can run with pfSense in order to capture this?
I had thought this was related to the time change since it happened right abou that time. But, this clock should be GMT/UTC, so only the offset would/should have changed.
Since this box is under warranty, I would like to be able to demonstrate this to the entity where I go it.
Wylbur
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
Mar 10 09:08:42 kernel hdacc0: <Realtek ALC221 HDA CODEC> at cad 0 on hdac0
Mar 10 09:08:42 kernel hdaa0: <Realtek ALC221 Audio Function Group> at nid 1 on hdacc0
Mar 10 09:08:42 kernel pcm0: <Realtek ALC221 (Analog)> at nid 23 and 26,27 on hdaa0
Mar 10 09:08:42 kernel pcm1: <Realtek ALC221 (Analog 2.0+HP)> at nid 20,33 on hdaa0
Mar 10 09:08:42 kernel hdacc1: <Intel Skylake HDA CODEC> at cad 2 on hdac0
Mar 10 09:08:42 kernel hdaa1: <Intel Skylake Audio Function Group> at nid 1 on hdacc1
Mar 10 09:08:42 kernel pcm2: <Intel Skylake (HDMI/DP 8ch)> at nid 3 on hdaa1Try disabling all that in the BIOS. And anything else you're not using there. Some of those things could be conflicting with the addon NIC preventing it being detected.
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
Thank you for your input. But, I've already done that. This is what the WAN port is running with. The LAN port is whatever the MOBO has and I never seem to have problems with that port. The weirdness is, this MOBO will not accept connections on both the Intel ports of the dual Intel port ethernet adapter that I'm using on this machine. Yet when I ran that adapter in another machine, both ports were usable so one was for WAN and the other for LAN.
But we have to ask this question: Why was I able to run for months on end when using Realtek ports with Untangle, or IPfire with this ISP? That is the thing that puzzels me.
In case everyone ISP use Juniper, Extreme (and other not-so-bad) hardware on aggregate level, and every user use Intel, Melannox (and other bug-free) hardware and well-writed & tested drivers,- we all wouldn’t have any puzzle-problem like this anymore.
So, just “catch, fix and forgot”,- best strategy in this hardware-mixed world. ;)
Now, if I were a "c" (or assembly language) programmer and knew the x86 architecture as well as I do z/Architecture machines (IBM Mainframes), I could probably code a trap to capture this failure and know why it was happening. Or I could run a trace of it that we could examine once it failed. But I don't know this architecture at that level. So unfortunately, I'm more of a knowledgeable user that knows enough to just be dangerous.
Just change the SSD, choose NICs that ISP recommend to work better WITH HIS APPLIANCE, make backups regulary (both config.xml and ZFS snapshots) and be happy until next device upgrade/change.
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
I see no failures that indicate a problem with renewal of lease with the ISP. What I do see are some changes where the fiber optic modem may get its IPv4 IP address changed and then the WAN is given a new IPv4 address. And then some 8.8.8.8 pings take place and some latency is noted.
Why exactly the IP on “fiber optic modem” are changed?
This is very rare situation in fiber nets in Europe, as I know.What is this device exactly? (Manufacturer and model)
Meanwhile on these latency issues, we know that the ISP has the ability to run Gigabit connections. What we have is 200/200 Mbs.
From which country You are, and ISP ?
-
@keyser said in Back to odd problem -- lose WAN at random points with a week or more between events:
@Wylbur You would know from the logs if renew was failing, because the logs would fill with a lot of renew attempts (with an increasing timer). So thats not the root of your problem.
Agree!!!
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
This is rather disconcerting for a referbished machine that is less than 30 days old.
I had to put in a second SSD because the system would not install for some reason. So that has me wondering of the referbish didn't detect a bad HDD.
Since this box is under warranty, I would like to be able to demonstrate this to the entity where I go it.
Save Your time: not spending time on “demonstrations”, return the box, buy something more powerful or from well-known & reputable brand.
-
I captured a packet trace when I ran into another loss of the system (DHCP working just fine, no ISP access). Unfortuneatley, I lost that text file. The funning thing is, it appeared that there were packets were passing through the WAN. So it seems that something causes communications to fail which is why a reboot clears the issue.
Meanwhile, I have to find a point where I can take the system down, and come back up on the backup machine, while I figure out how to make the BIOS changes. Hopefully this will be simple and not get blocked by the built in security so I can make the changes to the BIOS.
Wylbur.
-
I am in the USA. The ISP is a company called Metronet. The fiber optic interface system is by Nokia, and is an Intertek unit.
Metronet, Spectrum, AT&T, ComCast, etc. all change your IP address whenever they feel like it so you can't have a static address and host a web site unless you pay them for a static address.
-
@Wylbur said in Back to odd problem -- lose WAN at random points with a week or more between events:
I am in the USA. The ISP is a company called Metronet. The fiber optic interface system is by Nokia, and is an Intertek unit.
Ok, thanks. Just to know.
Metronet, Spectrum, AT&T, ComCast, etc. all change your IP address whenever they feel like it so you can't have a static address and host a web site unless you pay them for a static address.
Did DynDNS (or any other services) give You ability to having remote access?
-
I have swapped systems so that the backup is running and the new system is out for me to change BIOS settings.
So with the new machine that had the log errors below, I do not see any correlation of the following to anything I can change in the BIOS.
Mar 10 09:08:42 kernel hdacc0: <Realtek ALC221 HDA CODEC> at cad 0 on hdac0
Mar 10 09:08:42 kernel hdaa0: <Realtek ALC221 Audio Function Group> at nid 1 on hdacc0
Mar 10 09:08:42 kernel pcm0: <Realtek ALC221 (Analog)> at nid 23 and 26,27 on hdaa0
Mar 10 09:08:42 kernel pcm1: <Realtek ALC221 (Analog 2.0+HP)> at nid 20,33 on hdaa0
Mar 10 09:08:42 kernel hdacc1: <Intel Skylake HDA CODEC> at cad 2 on hdac0
Mar 10 09:08:42 kernel hdaa1: <Intel Skylake Audio Function Group> at nid 1 on hdacc1
Mar 10 09:08:42 kernel pcm2: <Intel Skylake (HDMI/DP 8ch)> at nid 3 on hdaa1
I got into the BIOS and did not find anything for changing any of the above.
However, I did find where the system can "sleep" or change to low power for several items. I set all that off.I also, ran the I/O tests while I had the opportunity on the SSDs and the initial tests came back good. Ran the extended tests and they are also good.
Then I ran the RAM tests and they came out with no errors detected.
The big question I have is, what would cause pfSense to "fail" and stop responding to ping, not respond to its website/page(s), but yet allow an iPhone 5 attached via an adapter that accepts RJ45 (eithernet), and continue streaming data via TuneIn (out of Europe in this case) while not responding to keyboard/mouse attached to the server via USB. Oh, and causing a Roku box to lose its connections (this by Wifi) so that TV(s) so attached lose connections. In otherwords, what makes that iPhone5 special that it did not lose its connections?
And I think this has happened now 3 times.
-
If the firewall is unable to open new states it would present like that. Existing states stay open so traffic continues. I would expect to see that logged though. Especially if it actually ran out of states.
You should be able to disable on-board audio in the BIOS unless it's significantly locked down.
-
@stephenw10 said in Back to odd problem -- lose WAN at random points with a week or more between events:
You should be able to disable on-board audio in the BIOS unless it's significantly locked down.
I've swapped it back in this morning. And that unit doesn't have a speaker, it has connections.... But I saw nothing relative to audio that I could kill.
BTW this is an HP box and they don't make a lot of doc available -- security by obscurity.
So now waiting to see if it has this lack of connections problem again, or the loss of WAN issue.
-
The chipset has that audio hardware in it though and it's consuming resources. We have seen that cause conflicts with other hardware.
-
Would this cause the system to run out of space in the "states table" and is that where I should look to see if we are headed into problems? I've been looking in the doc trying to figure this out. <big interruption> Had the system get locked up and had to swap the backup unit in.
I do not know why, but it is not taking it very long to run into the situation of not being able to handle any new traffic and breaks connections for some currently running things (such as my connection to a mainframe where I was working on a product), and others were still running (like the iPhone streaming music out of Germany). Everything else got stopped such that I could not ping the server from inside the LAN with either W11 laptop that was connected by wire.
Wylbur.
-
It's unlikely to be exhausting the state table in my opinion. You can see the states usage on the dashboard like:
State table size 0% (3/98000)
It's also logged in Status > Monitoring Graphs so you would see there too if it were ever getting close to 100%.
Just to confirm you're not using the re NIC any longer?