WG x750e - automatic speed adjustment: mbmon going crazy
This time, mbmon started producing bogus data (CPU temp around 9°C) after 24 days of uptime.
Unfortunately, the new version of WGXepc provides the same measure.
Let me know if you want me to try something else, otherwise, I'll just reboot the machine.
So, is auto fan speed control daemon showing up in the next release as an option click-to-enable?
This script is specifically for the Watchguard firebox X-e boxes so it's very unlikely to be in a pfSense realease. Even if it were more generic it probably wouldn't ever be included as more than a package. Any script that can control the fan speeds in a box has the potential to cause damage by overheating the CPU. Usually in something with thermal fan control, like a laptop, the control is handled directly by the SuperIO chip such that it will continue to cool correctly even if the OS crashes. Unfortunately the superio chip in the X-e box doesn't support this.
So reading the temperature with WGXepc gives the same value as mbmon. That implies the SuperIO chip is actually reporting the wrong value. Why could that be? It could be the temperature offset register has been set some how or that it's reading the wrong register for some reason. We could try investigating that but it will be quite involved. You could try using WGXepc in your script instead. Since it only reads one register (where as mbmon reads all registers everytime it's run) it may make a difference.
Unfortunately, it did not work much better. After 2 weeks, I started getting temperatures of 255°C.
When I tried mbmon from the command line, it reported an error stating it could not access the hardware.
Everything went back to normal after a reboot.
Ah, so what were you using in the script that didn't work? I forget where we left off. ::)
Something gets wrong after a short while:
- WGXepc does not properly set the fan speed anymore
- WGXepc does not report the proper temperature anymore
- mbmon does not execute properly anymore.
$ /usr/local/bin/WGXepc -f 50 Found Firebox X-E Fanspeed set to 50 $ /usr/local/bin/WGXepc -f Found Firebox X-E Fanspeed is ff $ /usr/local/bin/WGXepc -t Found Firebox X-E SuperIO sensor 2 reads: 255 $ /usr/local/bin/mbmon -I -i -c1 -T2 No ISA-IO HWM available!! InitMBInfo: Unknown error: 0
At this stage, I guess the best option would be to modify the WGXepc source code to allow it to keep executing in the background and set the fan speed depending on the temperature that is read.
Ok, but was your script using mbmon or WGXepc to read the temperature? The difference between them is that mbmon reads reads a whole load of values every time it's run even if you only need one. To get all those values requires setting the SuperIO chip in various modes. It's possible that under certain conditions mbmon leaves the superio chip in some error state. It may be possible to determine what the error state is and recover from it or to avoid it in the first place.
My script was using only WGXepc.
Well the only thing to do then is try to find out why the SuperIO chip is no longer responding usefully. Quite how to do that though…. ::)
Did you guys ever resolve this? I have exactly the same problem. Running mbmon returns reasonable temps for a while and suddenly starts reporting this:
Temp.= 255.0, 0.0, 0.0; Rot.= 0, 0, 0
Vcore = 4.08, 4.08; Volt. = 4.08, 6.85, 15.50, 6.07, 5.11
When I cancel mbmon and run it again, I get:
ioctl(smb0:open): No such file or directory
No Hardware Monitor found!!
InitMBInfo: Bad file descriptor
WGXepc -t reports good temps and then nothing but 255.
Only way to fix this is to reboot.
I'm assuming the SuperIO chip got itself hosed somehow. Is there any way to reset or reboot the chip?
This is on two X-Core 550e boxes and the default SL6N7 Banias chips have been replaced with SL7EP Dothan chips according to https://doc.pfsense.org/index.php/PfSense_on_Watchguard_Firebox#Further_Enhancements_3.
All dip switches set correctly. No other changes (powerd/speedstep) were made.
No, or at least I haven't heard from anyone who did. However reading back through the data sheet there appear to be a number of possible things we could try. Since the chip is just giving results that are registers full of all 1s we don't know if it's actually returning anything or if we're even talking to it properly. Though it would seem likely we are because under mbmon the voltage readings continue to come back as reasonable numbers.
As a rather extreme option it looks like there is a register that can re-initialise the chip, back to it's power on defaults. However I've no way of knowing what registers are configured by the BIOS at boot so the results could be…. unpredictable. ;)
Thanks for responding, Steve. I posted a new topic because this one was old and I wasn't sure it was still active. I'm happy to stay in this one.
This has happened on two boxes but I have another that works fine. I suppose it's possible that two of the Dothan chips I put in these boxes are creating this problem but it seem unlikely it would be two. I'm going to replace the replacement in one of them when another chip and wait for results.
Is the data sheet you spoke of available electronically? Where can I get it?
Yes, it's available in many places such as here.
Interacting with the chip manually for test purposes is a PITA. ::) It involves writing many individual registers to read one value. For example:
The reset register will not require so much though. It would be interesting to install superiotool to see what has stopped and what is still readable when the chip enters it's uncooperative state. We may get a clue.
I replaced the chip and have been running and watching mbmon for an hour. No problems. I guess it's possible I had two bad chips with the same symptoms. If the problem shows up again, I'll try your idea with superiotool. Until then, I'm moving on.
Hard to see how changing the CPU could make much difference. A change in the temperature sensor perhaps?
How that would affect the fan control though.
Beats the crap out of me. Been watching mbmon for 2 1/2 hours and still working great.
I noticed that when I took out the other chip, I had smeared on a lot of paste (Arctic Silver). I was much more conservative with the new chip.
Could that have anything to do with this?
Spoke too soon. It took 3 hours but mbmon started showing this:
Temp.= 255.0, 0.0, 0.0; Rot.= 0, 0, 0 Vcore = 4.08, 4.08; Volt. = 4.08, 6.85, 15.50, 6.07, 5.11
I guess I'll look into resetting the chip as you suggested.
Are you running bios B8? If so you can also get the temperature via ACPI on the dashboard or sysctl. Does that fail also?
Looks like I'm running B6. Temps not available through ACPI.
The superio chip should be a Winbond, right? On other pictures of x550e boards, you can clearly see the Winbond logo. On this machine, the chip is covered with a Phoenix Technologies sticker. How can I determine what chip I have?
This is what I get if I cancel mbmon and restart it:
[2.1.5-RELEASE]/root(50): mbmon -d -S SMBus[Intel8XX(ICH/ICH2/ICH3/ICH4/ICH5/ICH6)] found, but No HWM available on it!! InitMBInfo: Device not configured
They all have the Phoenix sticker on the Winbond chip, nothing unusual there. Superiotool will identify the chip.
Ok, here's the superiotool dump before the problem:
superiotool r4.0-2827-g1a00cf0 Found Winbond W83627HF/F/HG/G (id=0x52, rev=0x41) at 0x2e Register dump: idx 02 20 21 22 23 24 25 26 28 29 2a 2b 2c 2e 2f val ff 52 41 ff fe c0 00 00 00 00 fc c4 ff 00 ff def 00 52 NA ff 00 MM 00 00 00 00 7c c0 00 00 00 LDN 0x00 (Floppy) idx 30 60 61 70 74 f0 f1 f2 f4 f5 val 00 00 00 00 04 0e 00 ff 00 00 def 01 03 f0 06 02 0e 00 ff 00 00 LDN 0x01 (Parallel port) idx 30 60 61 70 74 f0 val 01 03 78 07 04 38 def 01 03 78 07 04 3f LDN 0x02 (COM1) idx 30 60 61 70 f0 val 01 03 f8 04 00 def 01 03 f8 04 00 LDN 0x03 (COM2) idx 30 60 61 70 f0 f1 val 01 02 f8 03 00 00 def 01 02 f8 03 00 00 LDN 0x05 (Keyboard) idx 30 60 61 62 63 70 72 f0 val 01 00 60 00 64 01 00 80 def 01 00 60 00 64 01 0c 80 LDN 0x06 (Consumer IR) idx 30 60 61 70 val 00 00 00 00 def 00 00 00 00 LDN 0x07 (Game port, MIDI port, GPIO 1) idx 30 60 61 62 63 70 f0 f1 f2 val 01 00 00 00 00 00 00 00 00 def 00 02 01 03 30 09 ff 00 00 LDN 0x08 (GPIO 2, watchdog timer) idx 30 f0 f1 f2 f3 f5 f6 f6 f7 val 00 ff ff ff 00 00 00 00 00 def 00 ff 00 00 00 00 00 00 00 LDN 0x09 (GPIO 3) idx 30 f0 f1 f2 f3 val 00 ff ff ff 00 def 00 ff 00 00 00 LDN 0x0a (ACPI) idx 30 70 e0 e1 e2 e3 e4 e5 e6 e7 f0 f1 f3 f4 f6 f7 f9 fe ff val 00 00 00 00 14 00 00 00 00 00 00 af 32 00 00 00 00 00 00 def 00 00 00 00 NA NA 00 00 00 00 00 00 00 00 00 00 00 00 00 LDN 0x0b (Hardware monitor) idx 30 60 61 70 f0 val 01 02 90 00 00 def 00 00 00 00 00
After mbmon starts failing as obove, there is only one difference in the dump. A single byte in the hardware monitor:
LDN 0x0b (Hardware monitor) idx 30 60 61 70 f0 val 01 02 0b 00 00 def 00 00 00 00 00
Steve, how would you go about doing the reset you were talking about?
EDIT: I assume the reset you mentioned was the initialization bit in the configuration register at 40h. I tried writing to this register with your readio/writeio tools but still got nothing back but FF. In fact, all reads are returning FF. Except for the results of the superiotool dump, I would think the chip had simply shut down.
I can live without the temperature if I must. But I have no ability to change the fan speed now. I had it set up to automatically adjust the fan speed based on the CPU temp. Now I can neither determine the temp nor adjust the fans. The only way to fix this appears to be to reboot which is not acceptable.
Would enabling speedstep or powerd make any difference?
Steve, you're gonna love this!
After learning what the output from superiotool actually meant and reviewing the Winbond data sheet, I realized the only change in the two dumps was the hardware monitor base address changed from 0290h to 020Bh. I wrote a little script using your readio/writeio tools to reset the base address. I never expected this to work but lo' and behold it did! Everything started working again.
I guess I should modify WGXepc to check for and correct this but I don't have a FreeBSD platform to do it on. Maybe I'll try and do it on one of my Fireboxes.
Nice work. That seems odd though. I'll have to read the data sheet myself unless you can enlighten me. How was anything able to be read if the base address had changed? Only the extended registers changed? Why did it change? :-
Like you say it should be relatively easy to check that and set it back. :)
Hmm, just wondering if the base address changed due to some other piece of hardware requiring access to that address space. I didn't think it could change except at boot on the ISA bus but really I don't know. Perhaps even our own script is trying to access the space twice causing the shift.
It might be better to read the base address and just use it rather than trying to change it back.
Nothing was able to read from the hardware monitor but everything else seems to have been okay. I've been doing some thinking about that and looking at the WGXepc code.
As I said, the only difference in the registers was device 0x0b (Hardware Monitor) register 0x61 changed from 0x90 to 0x0b. I run a daemon that uses WGXepc to continually check the temp and adjust the fan speed accordingly. The basic script came from https://forum.pfsense.org/index.php?topic=66129.msg360358#msg360358. bigramon was doing the same thing which started this topic.
I'm working on the theory that WGXepc is writing 0x0b to the 0x61 register for device 0x0b. I can see how this could happen if port_out(EFDR, 0x0b) was done after a call to get_w83627_addr_port() but it does not look like that is whats happening.
I decided I didn't want to modify WGXepc. It's not my code and I really don't know where any repository for it is. I believe this problem is caused by a bug in it, though. There is one issue I saw: get_w83627_addr_port() leaves the chip in Extended Function mode. The result of subsequent port I/O might be unpredictable at times.
What I did instead, Steve, was to modify my daemon script to check for a temp of 255 and then use your writeio tool to reset the base address for the hardware monitor. Not the most elegant solution but it works reliably, so far.
For anyone interested in this solution, here is the relevant section of the script:
cpu_temp=`/usr/local/bin/WGXepc -t | sed '1,2d'` # Temperature of 255 means Winbond chip hardware monitor base address # has been hosed. Use writeio to reset the address. This has only been # tested on the X550e. if [ $cpu_temp -ge 255 ] then /usr/local/bin/writeio 0x2e 0x87 > /dev/null # Put Winbond into /usr/local/bin/writeio 0x2e 0x87 > /dev/null # Extended Function mode. /usr/local/bin/writeio 0x2e 0x07 > /dev/null # Set logical device number /usr/local/bin/writeio 0x2f 0x0b > /dev/null # to Hardware Monitor. /usr/local/bin/writeio 0x2e 0x60 > /dev/null # Reset /usr/local/bin/writeio 0x2f 0x02 > /dev/null # monitor base /usr/local/bin/writeio 0x2e 0x61 > /dev/null # address to /usr/local/bin/writeio 0x2f 0x90 > /dev/null # 0290h. /usr/local/bin/writeio 0x2e 0xaa > /dev/null # Exit Extended Function mode. continue fi
The writeio tool you can get from Steve's google site. There is a link to it earlier in this topic.
Thanks again for your help, Steve.
Thanks for that. I'll have to look into that bug. However I don't think it should be an issue for just reading the temperature because to do so does not require using extended function mode.
It also interesting to note that in the third post in this thread Bigramon's box fails to a state where the temperature is showing 88C and the fan speeds stop reading but the hardware monitor base address in unchanged. Perhaps a completely different bug. ::)
get_w83627_addr_port() enters extended mode to get the base address, adds 0x05 to it, and returns the result. getcputemp() calls get_w83627_addr_port() and uses the result (0295h) as the index port and +1 as the data port. The reads fail afterward because they're using the wrong ports.
This was why I didn't want to change the code. It was unclear why this was done this way so I didn't have enough confidence in my level of knowledge and I only had the X550e to test with.
I agree bigramon's original problem must have been something else.
Ha, I should really re-familiarise myself with the code before commenting. ;)
Interestingly that's how it should be done and what I suggested above. Read in the base address and use it, don't assume it's 290. However that implies that if the base address changed it should still work. :-\ It seems then that the register indicating the base address changed but the actual address perhaps did not. It might have worked better if I had just assumed it was at 290! ::)
Actually, from reading the data sheet, your approach seems correct. Didn't realize you were the original author. Also, after the chip got hosed, I tried getting the temp using writio/readio and got the same result. Don't know if that's significant.
I suggest you change get_w83627_addr_port() so it exits extended function mode. Let me know if you do and I'll test it here.
Hmm, you mean you got an invalid temperature value? What base address did you use?
I'll definitely look at that function, it shouldn't be left in extended mode.
after it was showing 020Bh I used 0290h directly and it gave me 255. For fan speed too.
I will confirm that again on Monday and let you know for sure. I don't have the equipment here.
I've confirmed the problem happens outside of WGXepc when the hardware address in 60 and 61 has been altered.
breakwinbond script sets 60 and 61 to a bad value just as I believe WGXepc is doing:
./writeio 0x2e 0x87 ./writeio 0x2e 0x87 ./writeio 0x2e 0x07 ./writeio 0x2f 0x0b ./writeio 0x2e 0x61 ./writeio 0x2f 0x0b ./writeio 0x2e 0xaa
fixwinbond script restores 60 and 61 to the correct value:
./writeio 0x2e 0x87 ./writeio 0x2e 0x87 ./writeio 0x2e 0x07 ./writeio 0x2f 0x0b ./writeio 0x2e 0x60 ./writeio 0x2f 0x02 ./writeio 0x2e 0x61 ./writeio 0x2f 0x90 ./writeio 0x2e 0xaa
getwinbond returns the fan speed and cpu temp:
./writeio 0x0295 0x5a > /dev/null fan=`./readio 0x296 | cut -d: -f2 | tr '[:lower:]' '[:upper:]'` ./writeio 0x0295 0x4e > /dev/null ./writeio 0x0296 0x01 > /dev/null ./writeio 0x0295 0x50 > /dev/null cpu=`./readio 0x296 | cut -d: -f2 | tr '[:lower:]' '[:upper:]'` cpu=`echo "ibase=16; $cpu" | bc` ./writeio 0x0295 0x4e > /dev/null ./writeio 0x0296 0x00 > /dev/null echo "Fan Speed = $fan" echo "CPU Temp = $cpu"
[2.1.5-RELEASE]/tmp(81): WGXepc -f 40 Found Firebox X-E Fanspeed set to 40 [2.1.5-RELEASE]/tmp(81): WGXepc -f Found Firebox X-E Fanspeed is 40 [2.1.5-RELEASE]/tmp(81): WGXepc -t Found Firebox X-E SuperIO sensor 2 reads: 36 [2.1.5-RELEASE]/tmp(81): ./getwinbond Fan Speed = 40 CPU Temp = 37 [2.1.5-RELEASE]/tmp(81): ./breakwinbond Setting 2e to 87 Setting 2e to 87 Setting 2e to 7 Setting 2f to b Setting 2e to 61 Setting 2f to b Setting 2e to aa [2.1.5-RELEASE]/tmp(81): WGXepc -f Found Firebox X-E Fanspeed is ff [2.1.5-RELEASE]/tmp(81): WGXepc -t Found Firebox X-E SuperIO sensor 2 reads: 255 [2.1.5-RELEASE]/tmp(81): ./getwinbond Fan Speed = FF CPU Temp = 255 [2.1.5-RELEASE]/tmp(93): ./fixwinbond Setting 2e to 87 Setting 2e to 87 Setting 2e to 7 Setting 2f to b Setting 2e to 60 Setting 2f to 2 Setting 2e to 61 Setting 2f to 90 Setting 2e to aa [2.1.5-RELEASE]/tmp(81): WGXepc -f Found Firebox X-E Fanspeed is 40 [2.1.5-RELEASE]/tmp(81): WGXepc -t Found Firebox X-E SuperIO sensor 2 reads: 37 [2.1.5-RELEASE]/tmp(94): ./getwinbond Fan Speed = 40 CPU Temp = 36
WGXepc uses registers 60 and 61 as the base address from which to build the port numbers for the Index and Data port of the hardware monitor. After register 61 is corrupted, WGXepc returns bad data because it is accessing the wrong ports.
The getwinbond script uses the 295h and 296h ports directly and still gets bad data. I'm guessing this is because changing the value in register 61 changes the circuit path in the LPC bus. The data sheet indicates the LAD pins of the LPC interface are used to control peripheral addressing so maybe this has something to do with it. Just as getting the temp from sensor 2 requires setting the bank before reading the output, 60 and 61 must be setting the correct location for port I/O to work as expected.
As I said, I'm guessing but this seems most likely.
Hmm, interesting. Having read back through the code (amazing what you forget ::)) I remember now I added the code to read in the base address correctly because the XTM5 uses a different address. However I think that was before I added the temperature code for the X-e. So the bug that left it in extended function mode may have always been there. However since it also happens in mbmon it's probably not the cause. Anyway it's easy enough to put in a check so I'll do that when I get a chance. Another solution would be to use the B8 bios and read the temperature exported by ACPI via the sysctl. I have this on my home box where the temperature is shown on the dashboard and it hasn't failed yet. It probably only polls the value while the dash is shown though.
Anyway some good progress. :)
I've updated WGXepc to include code to check the SuperIO data base address when the X-e model is detected. I've also fixed the bug that left the chip in extended function mode and attempted to rationalise some of the code somewhat. It's still pretty awful though! ::)
Anyway the updated source and binaries are on my Google site as normal. I'd welcome some testing.
One thing I did notice when testing here is that entering something unexpected for the fan speed can result in a setting of 0. I have never tried 0. It really does shut off the fans completely. Do you think that should be possible or should there be some check to make sure a minimum fan speed is maintained?
I will do some testing.
Something weird happened today. The router stopped passing traffic inbound or outbound. Fixed when I rebooted. I noticed the temp from WGXepc kept coming back as 0 and the fanspeed as well. Don't know if this is related to WGXepc. I will update if this happens again.
Steve, I tried to install superiotool with this:
pkg_add -r ftp://ftp.freebsd.org/pub/FreeBSD/ports/i386/packages/All/superiotool-20121019.tbz
Same command I used before and it worked fine on the other box. Now when I try to use it I get this:
/libexec/ld-elf.so.1: Shared object "libz.so.6" not found, required by "superiotool"
If you're doing this in pfSense 2.1.5 you need to use the FreeBSD 8.3 repo which has now been archived. It lloks like you're using a current repo. Try:
pkg_add -r http://ftp-archive.freebsd.org/pub/FreeBSD-Archive/ports/i386/packages-8.3-release/Latest/superiotool.tbz
You may have to remove whatever you previously installed first and it may have installed some depedencies. :-\
I had not seen you had been investigating further before today.
Just downloaded the new WGXepc and restarted my auto-speed script.
Let"s see how it behaves now :)