zfs zpool status DEGRADED - correct procedure to replace the failed disk ?
-
pool: zroot
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 1.49G in 00:01:02 with 0 errors on Tue Apr 4 02:30:48 2023
config:NAME STATE READ WRITE CKSUM zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 FAULTED 3 5 1 too many errors
errors: No known data errors
Do i take the "easy" option backup up my config.xml replace the failed disk and install the fw again or is there a correct procedure to replace the disk without reinstallation ?
All the storage kit I get to use at work, when a disk fails you just swap it out and it takes care of it; going the route of installing the firewall again seems silly
-
@alactus On a straight FreeBSD system, I've added a 3rd disk to the mirror (yes a 3 way mirror), let it resilver, then removed the faulted device from the mirror.
But that assumes there are enough power/data connectors to do this.
Otherwise you should be able to "google freebsd replace a failed device in a mirror" and find the commands you need.
Regardless, I'm guessing you're powering down the system in order to add/replace a device. If you need to do that, I think your "easy" option is the cleanest.
If you can physically hot swap the devices, the standard commands should work. -
@mer Its a Dell R210 1u server, there arn't any spare sata ports (i've spare disks sat on the shelf)
Because of its 1u nature it will have to be powered off to get the faulted drive out the case, sad that dell never put 2.5" drive sleds in this design; knew i should have have nabbed the hp 1u server rather then this :D
I did try looking for answers on this, but they all point to systems that have disks that are spares already, really wish Netgate had the option in the ui that said replace failed disk etc; i'll go poke that google search you suggest.
Thanks
Drew
-
So first hit on google looks to be just the ticket
https://www.adminbyaccident.com/freebsd/how-to-freebsd/how-to-replace-a-disk-on-a-zfs-mirror-pool/
Slight tweak for the swapsize but it seems to cover the task needed, thanks
-
@alactus said in zfs zpool status DEGRADED - correct procedure to replace the failed disk ?:
https://www.adminbyaccident.com/freebsd/how-to-freebsd/how-to-replace-a-disk-on-a-zfs-mirror-pool/
So replaced the disk and followed the guide but its not right - ended up with two ada1p3 and for some reason the zpool status in pfsense doesn't give the option to remove this.
pool: zroot
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
scan: resilvered 3.83G in 00:01:58 with 0 errors on Mon Apr 17 16:00:44 2023
config:NAME STATE READ WRITE CKSUM zroot DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 UNAVAIL 0 0 0 cannot open ada1p3 ONLINE 0 0 0
errors: No known data errors
I am tempted to just back the cfg up and install fresh
-
So
zpool detach zroot ada1p3
This is odd, would think it would give you the guid of the disk to remove (like the guide seem to suggest, it feels like the zfs support in pfsense is a little half baked ?
pool: zroot
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 3.83G in 00:01:58 with 0 errors on Mon Apr 17 16:00:44 2023
config:NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 ada0p3 ONLINE 0 0 0 ada1p3 ONLINE 0 0 0
errors: No known data errors
Think the lesson here is to back the cfg up and reinstall perhaps
-
The 'features not enabled' is just an artifact of how the ZFS pool is created in the installer and should not be any issue here. You can run the upgrade to enable them if you want.
That guide seems to have additional steps, though they explain why.
In the past I have used this: https://farrokhi.net/posts/2020/05/replacing-a-faulty-disk-in-zfs/We should have our own docs for that though, I'll open a request.
-
The thing that threw me was that i had just created ada1p* and attached ada1p3 to the pool, to then detach the same 'device' [even though it isn't the same device anymore] it didn't feel like the right command to run.
I guess zfs support in pfsense doesn't expose the raw names for the device ? - different feature sets i guess ?
Maybe the correct procedure would have been to run zpool replace option, but with the fact there is nothing official from netgate i was left to try, worked out in the end.
Slight differences but i don't suspect they will cause a issue (worst case is i install fresh)
=> 40 976773088 ada0 GPT (466G)
40 1024 1 freebsd-boot (512K)
1064 984 - free - (492K)
2048 67108864 2 freebsd-swap (32G)
67110912 909662208 3 freebsd-zfs (434G)
976773120 8 - free - (4.0K)=> 40 976773088 ada1 GPT (466G)
40 1024 1 freebsd-boot (512K)
1064 67108864 2 freebsd-swap (32G)
67109928 909663200 3 freebsd-zfs (434G) -
Yeah, I wouldn't expect the difference in unformatted space to make a difference.
-
@alactus You've run into why some folk (myself included) prefer to partition devices for ZFS instead of raw devices. "One man's 256GB device is not the same size as another's 256GB device".
In general ZFS will be "smallest size" of the devices. You have a 1TB device and add a 3TB device as a mirror: ZFS says "you have a 1TB mirror".
The cool thing is:
I have a 1TB device, I add a 3TB device to make a mirror. I have a 1TB mirror, wait for it to resilver, then add another 3TB device to the mirror, wait for it to resilver, then remove the 1TB device from the mirror. Now I have a 3TB mirror.yes there are a couple of steps left out, but that is how you can grow a mirror device. I've done it and it's pretty neat.
-
Having a poke round in
/var/log/bsdinstall_log
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-boot" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 4k -l gptboot0 -t freebsd-boot -s 512k "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 4k -l gptboot1 -t freebsd-boot -s 512k "ada1"
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-swap" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 1m -l swap0 -t freebsd-swap -s 34359738368b "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 1m -l swap1 -t freebsd-swap -s 34359738368b "ada1"
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-zfs" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 1m -l zfs0 -t freebsd-zfs "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 1m -l zfs1 -t freebsd-zfs "ada1"Gives me the exact commands run when i set this up (only found this trick out in my travels of looking this up)
Realistic this should all be sat behind a menu in the gui to do this.
-
So just a mini write up of the actions of the above for future reference (so its all in one spot)
Assumptions
pFsense setup with 2 disks in a zfs mirror, ada0 and ada1 (as seen from the WebUI)
One of the disk fails in the mirror, you can see this if you have the WebUI widget on to monitor the disks etc
You have backed up your config and you have a usb key with the install image on ready to go again in case of issues
You have physically removed the failed disk from the system and replaced it with a new disk of the same size or bigger
Enable the option to ssh into the firewall via the WebUI, use your favourite client to ssh into the firewall and get to the root shell
zpool status
This will show you the status of the zpool mirror, in my case it said it was degraded because of one failed disk
We create the partition table on the new disk ada1 (change this for the actual disk in the mirror you are replacing)
gpart create -s gpt ada1
The sizes in the following commands are all based on my own sizes that got used at the time i installed pFsense on this hardware, if you wish to check the exact size used you can check the install log (bsdinstall_log) that is located in /var/log/
example
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-boot" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 4k -l gptboot0 -t freebsd-boot -s 512k "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 4k -l gptboot1 -t freebsd-boot -s 512k "ada1"
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-swap" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 1m -l swap0 -t freebsd-swap -s 34359738368b "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 1m -l swap1 -t freebsd-swap -s 34359738368b "ada1"
[23.01-RELEASE][admin@pfSense.localdomain]/var/log: grep "freebsd-zfs" bsdinstall_log
DEBUG: zfs_create_diskpart: gpart add -a 1m -l zfs0 -t freebsd-zfs "ada0"
DEBUG: zfs_create_diskpart: gpart add -a 1m -l zfs1 -t freebsd-zfs "ada1"Knowing the size you can continue (and the commands, you can change for the ones found in the log if its a different disk etc)
Create boot partition
gpart add -a 4k -l gptboot1 -t freebsd-boot -s 512k ada1
Create swap partition
gpart add -a 1m -l swap1 -t freebsd-swap -s 34359738368b ada1
Create the partition that will actually be added to the zfs mirror
gpart add -a 1m -l zfs1 -t freebsd-zfs ada1
in each case ada1 was the disk that had failed in my system, change for the actual one that had failed in yours
We can now add this disk (ada1) to the pool.
zpool attach zroot ada0p3 ada1p3
at this point (if everything is ok) all the data will be copied from ada0p3 to ada1p3 through a process called 're silvering'
zpool status will show this.
Once the re silver process is done, you need to add the boot code to this zfs boot mirror
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
Is the command i had to run for my setup.
-i 1 is the partition we are going to add boot code to and ada1 is the disk we are adding it to.
To check which is the boot partition (it should be 1 in the case of pfsense but just for your own information) you can run the command gpart show which will list all the disks and the partitions on the disk
Once the re-silver is done, the pool might still show a error because of the failed disk still attached, in my case i had to issue the command
zpool detach zroot ada1p3
Which seems counter because you had just attached ada1p3, well in this case i suspect it knows the original disk is failed and gone and so once the command is run it removed the failed disk and the pool health returns to normal
Is this the best way of doing it? possibly not but it worked for this setup and has returned the pool to normal for me; adjust the above commands to fit your own setup.
And if in doubt, if you have a copy of your config on a bootable install stick for pfsense, just install the fw again and recover your config that way