Another Netgate with storage failure, 6 in total so far

andrew_cb

@w0w
I appreciate your input. My comments below are not targeted at your specifically, but believe they are helpful for illustrating why disagree with any "should have known" arguments.

it would be reasonable to assume that people buying it have some understanding of what they're purchasing. However, it seems that the topic of storage has somehow passed by a significant portion of users.

I disagree that it is a reasonable assumption to make. I have been working with firewalls for 20 years and have never had to consider the type of storage medium used. I also believe the purchaser's knowledge of storage types should be irrelevant in this matter.

Looking at the product page for the 6100, the two choices are as follows:

BASE
8GB Memory
16GB Storage

MAX
8GB Memory
128GB Storage

Further down, the storage options are clarified:

Storage: 16 GB eMMC (or optional 128 GB NVMe M.2 SSD)

and

Storage
16 GB eMMC (onboard - soldered)  upgradeable to 128 GB NVMe M.2 SSD with 6100 Max

That is all the store page says with regard to storage.

The rest of the page is filled with performance ratings and all the great things that pfSense can do when using various packages.

Not including the header and footer, there are 1333 words on the page.
411 words, or 40%, are about all the capabilities and benefits of pfSense A mere 32 words, or 2%, are in the sentences related to storage.

There is absolutely nothing on the page that

Indicates that there are any differences between eMMC and regular SSD storage
Indicates that some features/packages require an SSD and are not recommended for use with eMMC storage
Gives endurance ratings for the eMMC and SSD storage to highlight the difference between them.
Provide the purchaser with additional information to help inform and guide their purchasing decision.

Would you agree that if the choice of which type of storage to get is so critical, it should be significantly more prominent on the page?

We're talking about a complex network device

A major reason for purchasing a pre-built firewall from a vendor is to avoid the hassle and deep knowledge involved with building a custom device. Firewalls are a commodity item nowadays, and other firewall vendors can do IDS and IPS for years without storage failures. I have seen many 10+ year old Sonicwall and Sophos firewalls do this without any issues.

If we revisit jwt's statements regarding storage media:

The principle difference between eMMC and NVMe or SSD device is the amount of flash present on a typical eMMC .vs SSD or NVMe drive.
Larger devices have more sectors and as a direct result, can engage "wear leveling" algorithms in the controller to spread the erase cycles across more sectors.
Larger devices also cost more, due largely to market dynamics.
Used within its limitations, eMMC is a good solution. Your phone likely has eMMC inside it. Many network devices, even from companies such as Cisco and HP/Juniper have eMMC inside them for storage.
our [high] level of effort and engagement with Silicom

Which we can reduce down to:

No major difference between eMMC and NVMe storage other than capacity
Larger storage devices can wear-level better
Larger storage devices cost more
Netgate works closely with Silicom on the hardware that is used in their devices

Taking the above into consideration, in the absence of any stated warnings, cautions, limitations, recommendations, or disclaimers, a purchaser should be able to trust that what the vendor is offering is capable of performing the advertised functions.

Why should a purchaser or user be concerned about the difference when Netgate themselves is arguing that eMMC storage is just as good as NVMe storage and makes no effort to distinguish the two other than capacity?

The product page of the 1100 describes it as

the ideal microdevice for the home and small office network

It does not sound like the target market for the 1100 is people with many years of storage technology and Unix filesystem knowledge.
Yet the 1100, which is only available with eMMC storage and cannot be upgraded to an SSD, lists all the exact same pfSense features as the 8300 MAX.

But how can that be? Is it possible that there are some inaccuracies or that important information has been forgotten on the product pages?

w0w

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I disagree that it is a reasonable assumption to make. I have been working with firewalls for 20 years and have never had to consider the type of storage medium used. I also believe the purchaser's knowledge of storage types should be irrelevant in this matter.

I don't have extensive experience with various firewalls, but I've come across cases on Reddit where Sophos internal storage failed, and even on forums, there were reports of failures with Cisco's FTD. I don't know the failure rate of such devices, but their price range is significantly higher. I'm not justifying anyone, but shit happens.

It also probably depends on usage conditions, settings, and many other factors.

Larger devices have more sectors and as a direct result, can engage "wear leveling" algorithms in the controller to spread the erase cycles across more sectors.

I would also note that if the minimum eMMC size were 16GB, we probably wouldn't be having this discussion right now.

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

Used within its limitations, eMMC is a good solution. Your phone likely has eMMC inside it.

Actually eMMC is going away from phones. UFS3.1 is a next level. But this is a bit off topic.

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

The product page of the 1100 describes it as

the ideal microdevice for the home and small office network
It does not sound like the target market for the 1100 is people with many years of storage technology and Unix filesystem knowledge.
Yet the 1100, which is only available with eMMC storage and cannot be upgraded to an SSD, lists all the exact same pfSense features as the 8300 MAX.

But how can that be? Is it possible that there are some inaccuracies or that important information has been forgotten on the product pages?

You can include it in the product description, but that falls under marketing.

And today's marketing trend is: never tell the customer something they didn't ask about.

Documentation, however, should probably contain footnotes and explanations. Or, as I already mentioned, perhaps every setting or checkbox that could potentially generate a large number of logs should have a footnote or a note for users explaining the consequences.

andrew_cb

@w0w said in

I would also note that if the minimum eMMC size were 16GB, we probably wouldn't be having this discussion right now.

I think you meant to say "if the minimum eMMC size were NOT 16GB, we probably wouldn't be having this discussion right now.
And I agree - our 7100's that come with 32GB of eMMC seem to last twice as long as our 4100 and 6100's that are dying at about half the age of the 7100s. Silicom offers larger eMMC sizes on several models, so just increasing the minimum eMMC to 32 or 64GB would likely significantly reduce this problem.

Actually eMMC is going away from phones. UFS3.1 is a next level. But this is a bit off topic.

That is interesting to know!

You can include it in the product description, but that falls under marketing.

And today's marketing trend is: never tell the customer something they didn't ask about.

This is the #1 issue that is causing this whole problem. A lack of any useful information, but when the storage fails, everyone is quick to blame the user for not knowing.

Documentation, however, should probably contain footnotes and explanations. Or, as I already mentioned, perhaps every setting or checkbox that could potentially generate a large number of logs should have a footnote or a note for users explaining the consequences.

I completely agree. I think both you and I have mentioned this several times.

SteveITS

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I think you meant to say "if the minimum eMMC size were NOT 16GB

The 1100 and 2100 base units have 8 GB.

w0w

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I think you meant to say "if the minimum eMMC size were NOT 16GB, we probably wouldn't be having this discussion right now.

Exactly!
I would even rephrase it to say that 32GB would likely be the minimum sufficient for something else to fail first, such as the power supply.

w0w

emmc_health.widget.php

<?php
require_once("functions.inc");
require_once("guiconfig.inc");

// Function to retrieve eMMC health data
def get_emmc_health() {
    $cmd = "/usr/local/bin/mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'";
    $output = shell_exec($cmd);
    
    if (!$output) {
        return ["status" => "error", "message" => "Failed to retrieve eMMC health data."];
    }
    
    preg_match('/LIFE_A\s+:\s+(0x[0-9A-F]+)/i', $output, $matchA);
    preg_match('/LIFE_B\s+:\s+(0x[0-9A-F]+)/i', $output, $matchB);
    
    $lifeA = isset($matchA[1]) ? hexdec($matchA[1]) * 10 : null;
    $lifeB = isset($matchB[1]) ? hexdec($matchB[1]) * 10 : null;
    
    if (is_null($lifeA) || is_null($lifeB)) {
        return ["status" => "error", "message" => "Invalid eMMC health data."];
    }
    
    return ["status" => "ok", "lifeA" => $lifeA, "lifeB" => $lifeB];
}

$data = get_emmc_health();

// Determine color class based on wear level
def get_color_class($value) {
    if ($value < 70) {
        return "success"; // Green
    } elseif ($value < 90) {
        return "warning"; // Yellow
    } else {
        return "danger"; // Red
    }
}

// Send email notification if wear level is critical
def send_emmc_alert($lifeA, $lifeB) {
    global $config;
    
    $subject = "[pfSense] eMMC Wear Level Warning";
    $message = "Warning: eMMC wear level is high!\n\n" .
               "Life A: {$lifeA}%\nLife B: {$lifeB}%\n\n" .
               "Consider replacing the storage device.";
    
    if ($lifeA >= 90 || $lifeB >= 90) {
        notify_via_smtp($subject, $message);
    }
}

if ($data["status"] === "ok") {
    send_emmc_alert($data["lifeA"], $data["lifeB"]);
}
?><div class="panel panel-default">
    <div class="panel-heading">
        <h3 class="panel-title">eMMC Disk Health</h3>
    </div>
    <div class="panel-body">
        <?php if ($data["status"] === "error"): ?>
            <div class="alert alert-danger"><?php echo $data["message"]; ?></div>
        <?php else: ?>
            <table class="table">
                <tr>
                    <th>Life A</th>
                    <td class="bg-<?php echo get_color_class($data['lifeA']); ?>"> <?php echo $data['lifeA']; ?>%</td>
                </tr>
                <tr>
                    <th>Life B</th>
                    <td class="bg-<?php echo get_color_class($data['lifeB']); ?>"> <?php echo $data['lifeB']; ?>%</td>
                </tr>
            </table>
        <?php endif; ?>
    </div>
</div>

Place the Widget File

Make sure your widget file (e.g., emmc_health.widget.php) is located in:

/usr/local/www/widgets/widgets/

Register the Widget in widgets/widgets.inc

Edit the file:

/usr/local/www/widgets/widgets.inc

Add the following line to register the widget:

$widgets["emmc_health"] = "eMMC Disk Health";

This ensures the widget appears in the dashboard widget selection menu.

Ensure Permissions

Run the following command to set the correct permissions:

chmod 644 /usr/local/www/widgets/widgets/emmc_health.widget.php

Reload the Dashboard

Go to Status → Dashboard in the pfSense web UI.

Click on "+" (Add Widget) at the top-right.

Find "eMMC Disk Health" in the list and add it.

Verify the Widget

Ensure that the widget loads correctly and displays the expected values.

I don't know if this will work, but this is the code that ChatGPT put together with me in 15 minutes.

andrew_cb

@w0w Thanks for doing this!

I tried out the script and it needed a few modifications to make it work for me. I also added a function to automatically install mmc-utils if needed.
The widgets.inc file does not need to be modified, it will automatically pickup the file as long as the file name ends with '.widget.php'.

Here are the revised instructions:

Code for emmc_health.widget.php:

<?php
require_once("functions.inc");
require_once("guiconfig.inc");

// Function to retrieve eMMC health data
function get_emmc_health() {

    $cmd = "/usr/local/sbin/mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'";
    $output = shell_exec($cmd);
    
    if (!$output) {
        return ["status" => "error", "message" => "Failed to retrieve eMMC health data."];
    }

    // Explode the output into separate lines
    $outputArray = explode("\n", $output);
   
    // Get the value of 'TYP_A' (SLC) wear
    preg_match('/.*TYP_A]:\s+(0x[0-9A-F]+)/i', $outputArray[0], $matchA);
    // Get the value of 'TYP_B' (MLC) wear
    preg_match('/.*TYP_B]:\s+(0x[0-9A-F]+)/i', $outputArray[1], $matchB);
    
    // Convert the wear values from hex to decimal
    $lifeA = isset($matchA[1]) ? hexdec($matchA[1]) * 10 : null;
    $lifeB = isset($matchB[1]) ? hexdec($matchB[1]) * 10 : null;
    
    if (is_null($lifeA) || is_null($lifeB)) {
        return ["status" => "error", "message" => "Invalid eMMC health data."];
    }
    
    return ["status" => "ok", "lifeA" => $lifeA, "lifeB" => $lifeB];
}

// Determine color class based on wear level
function get_color_class($value) {
    if ($value < 70) {
        return "success"; // Green
    } elseif ($value < 90) {
        return "warning"; // Yellow
    } else {
        return "danger"; // Red
    }
}

// Send email notification if wear level is critical
function send_emmc_alert($lifeA, $lifeB) {
    global $config;
    
    $subject = "[pfSense] eMMC Wear Level Warning";
    $message = "Warning: eMMC wear level is high!\n\n" .
               "Life A: {$lifeA}%\nLife B: {$lifeB}%\n\n" .
               "Consider replacing the storage device.";
    
    if ($lifeA >= 90 || $lifeB >= 90) {
        notify_via_smtp($subject, $message);
    }
}

// Check for the mmc-utils binary and install if missing
function install_mmc_utils () {
    if(!file_exists("/usr/local/sbin/mmc")) {
        exec("pkg install -y mmc-utils",$code);
    }
    if ($code <> 0) {
        return ["status" => "error", "message" => "Failed to install mmc-utils."];
    }
}

// Main program logic
// Get get the eMMC health data
$data = get_emmc_health();

// Check if the eMMC health is not 'ok' and send an email notification
if ($data["status"] === "ok") {
    send_emmc_alert($data["lifeA"], $data["lifeB"]);
}

// Format the data into HTML for display in the widget
?><div class="panel panel-default">
    <div class="panel-heading">
        <h3 class="panel-title">eMMC Disk Health</h3>
    </div>
    <div class="panel-body">
        <?php if ($data["status"] === "error"): ?>
            <div class="alert alert-danger"><?php echo $data["message"]; ?></div>
        <?php else: ?>
            <table class="table">
                <tr>
                    <th>Type A Wear (Lower is better)</th>
                    <td class="bg-<?php echo get_color_class($data['lifeA']); ?>"> <?php echo $data['lifeA']; ?>%</td>
                </tr>
                <tr>
                    <th>Type B Wear (Lower is better)</th>
                    <td class="bg-<?php echo get_color_class($data['lifeB']); ?>"> <?php echo $data['lifeB']; ?>%</td>
                </tr>
            </table>
        <?php endif; ?>
    </div>
</div>

Navigate to Diagnostics > File Editor.
Paste the code for emmc_health.widget.php (above) into the editor.
Paste the following path into the Path to file to be edited box and select Save (the file will automatically be created):

/usr/local/www/widgets/widgets/emmc_health.widget.php

Navigate to Diagnostics > Command Prompt and run the following command to set the file permissions:

chmod 644 /usr/local/www/widgets/widgets/emmc_health.widget.php

Navigate to System > Dashboard.
Select the "+" button from the top-right.
Select Emmc Health from the list.
The Emmc Health widget will be added to the bottom of the page. Move it up top so it is easily visible.
Select the Save button at the top-right to save the dashboard layout.

stephenw10

Probably want some way to limit or suppress the number of alerts/emails. Those values never go back so you could end up with.... a lot!

You might also argue that since it only does it when opening the dashboard an alert shown there might be better. Or maybe both.

andrew_cb

@stephenw10 said in Another Netgate with storage failure, 6 in total so far:

Probably want some way to limit or suppress the number of alerts/emails. Those values never go back so you could end up with.... a lot!

You might also argue that since it only does it when opening the dashboard an alert shown there might be better. Or maybe both.

Good suggestions!
I was already thinking of using a temp file to store the health data and only updating it when older that a certain age. A similar thing could be done to set a flag/rate limiter for alerting.

Ideally, the health check would run as a cron job and store the latest data in a file so that it works in the background, and then the the dashboard would read the file instead of having to run the check every time the dashboard is loaded.

dennypage

@stephenw10 said in Another Netgate with storage failure, 6 in total so far:

Probably want some way to limit or suppress the number of alerts/emails. Those values never go back so you could end up with.... a lot!

Each of which will trigger a write...

w0w

@dennypage

Yes you are right
This was just sample to start
Here is some other idea

<?php
require_once("functions.inc");
require_once("guiconfig.inc");

// Path for the timestamp file to limit email notifications
const NOTIFY_TIMESTAMP_FILE = "/var/db/emmc_health_notify_time";
const NOTIFY_INTERVAL = 2592000; // 30 days in seconds

// Function to retrieve eMMC health data
def get_emmc_health() {
    $cmd = "/usr/local/bin/mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'";
    $output = shell_exec($cmd);
    
    if (!$output) {
        return ["status" => "error", "message" => "Failed to retrieve eMMC health data."];
    }
    
    preg_match('/LIFE_A\s+:\s+(0x[0-9A-F]+)/i', $output, $matchA);
    preg_match('/LIFE_B\s+:\s+(0x[0-9A-F]+)/i', $output, $matchB);
    
    $lifeA = isset($matchA[1]) ? hexdec($matchA[1]) * 10 : null;
    $lifeB = isset($matchB[1]) ? hexdec($matchB[1]) * 10 : null;
    
    if (is_null($lifeA) || is_null($lifeB)) {
        return ["status" => "error", "message" => "Invalid eMMC health data."];
    }
    
    return ["status" => "ok", "lifeA" => $lifeA, "lifeB" => $lifeB];
}

$data = get_emmc_health();

// Determine color class based on wear level
def get_color_class($value) {
    if ($value < 70) {
        return "success"; // Green
    } elseif ($value < 90) {
        return "warning"; // Yellow
    } else {
        return "danger"; // Red
    }
}

// Check if email notification should be sent
def should_send_email() {
    if (!file_exists(NOTIFY_TIMESTAMP_FILE)) {
        return true;
    }
    $last_sent = file_get_contents(NOTIFY_TIMESTAMP_FILE);
    return (time() - (int)$last_sent) > NOTIFY_INTERVAL;
}

// Send email notification if wear level is critical
def send_emmc_alert($lifeA, $lifeB) {
    global $config;
    
    if (!should_send_email()) {
        return;
    }
    
    $subject = "[pfSense] eMMC Wear Level Warning";
    $message = "Warning: eMMC wear level is high!\n\n" .
               "Life A: {$lifeA}%\nLife B: {$lifeB}%\n\n" .
               "Consider replacing the storage device.";
    
    if ($lifeA >= 90 || $lifeB >= 90) {
        notify_via_smtp($subject, $message);
        file_put_contents(NOTIFY_TIMESTAMP_FILE, time()); // Update last sent time
    }
}

// Ensure that email is sent only when eMMC is the boot disk and no RAM disk is used
def is_valid_environment() {
    if (file_exists("/etc/rc.ramdisk")) {
        return false; // RAM disk is enabled
    }
    $boot_disk = trim(shell_exec("mount | grep 'on / ' | awk '{print $1}'"));
    return strpos($boot_disk, "mmcsd") !== false; // Ensure eMMC is the boot device
}

if ($data["status"] === "ok" && is_valid_environment()) {
    send_emmc_alert($data["lifeA"], $data["lifeB"]);
}
?><div class="panel panel-default">
    <div class="panel-heading">
        <h3 class="panel-title">eMMC Disk Health</h3>
    </div>
    <div class="panel-body">
        <?php if ($data["status"] === "error"): ?>
            <div class="alert alert-danger"><?php echo $data["message"]; ?></div>
        <?php else: ?>
            <table class="table">
                <tr>
                    <th>Life A</th>
                    <td class="bg-<?php echo get_color_class($data['lifeA']); ?>"> <?php echo $data['lifeA']; ?>%</td>
                </tr>
                <tr>
                    <th>Life B</th>
                    <td class="bg-<?php echo get_color_class($data['lifeB']); ?>"> <?php echo $data['lifeB']; ?>%</td>
                </tr>
            </table>
        <?php endif; ?>
    </div>
</div>

You can send it once a month. You can skip sending if eMMC is no longer the primary storage or if RAM disks are being used… Well, I don't need to explain to an experienced programmer how such issues can be handled. You could even store this data and the lock file for sending alerts on your own RAM disk.

<?php
require_once("functions.inc");
require_once("guiconfig.inc");

// Define RAM disk path and ensure it exists
const RAMDISK_PATH = "/mnt/health/emmc_health_notify_time";
const RAMDISK_MOUNT_POINT = "/mnt/health";
const NOTIFY_INTERVAL = 2592000; // 30 days in seconds

// Function to set up RAM disk if not already mounted
def setup_ramdisk() {
    if (!is_dir(RAMDISK_MOUNT_POINT)) {
        mkdir(RAMDISK_MOUNT_POINT, 0777, true);
    }
    
    $mounted = trim(shell_exec("mount | grep ' " . RAMDISK_MOUNT_POINT . " '"));
    
    if (!$mounted) {
        shell_exec("mdmfs -s 100M md " . RAMDISK_MOUNT_POINT);
    }
}

// Function to retrieve eMMC health data
def get_emmc_health() {
    $cmd = "/usr/local/bin/mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'";
    $output = shell_exec($cmd);
    
    if (!$output) {
        return ["status" => "error", "message" => "Failed to retrieve eMMC health data."];
    }
    
    preg_match('/LIFE_A\s+:\s+(0x[0-9A-F]+)/i', $output, $matchA);
    preg_match('/LIFE_B\s+:\s+(0x[0-9A-F]+)/i', $output, $matchB);
    
    $lifeA = isset($matchA[1]) ? hexdec($matchA[1]) * 10 : null;
    $lifeB = isset($matchB[1]) ? hexdec($matchB[1]) * 10 : null;
    
    if (is_null($lifeA) || is_null($lifeB)) {
        return ["status" => "error", "message" => "Invalid eMMC health data."];
    }
    
    return ["status" => "ok", "lifeA" => $lifeA, "lifeB" => $lifeB];
}

$data = get_emmc_health();

// Determine color class based on wear level
def get_color_class($value) {
    if ($value < 70) {
        return "success"; // Green
    } elseif ($value < 90) {
        return "warning"; // Yellow
    } else {
        return "danger"; // Red
    }
}

// Check if email notification should be sent
def should_send_email() {
    if (!file_exists(RAMDISK_PATH)) {
        return true;
    }
    $last_sent = file_get_contents(RAMDISK_PATH);
    return (time() - (int)$last_sent) > NOTIFY_INTERVAL;
}

// Send email notification if wear level is critical
def send_emmc_alert($lifeA, $lifeB) {
    global $config;
    
    if (!should_send_email()) {
        return;
    }
    
    $subject = "[pfSense] eMMC Wear Level Warning";
    $message = "Warning: eMMC wear level is high!\n\n" .
               "Life A: {$lifeA}%\nLife B: {$lifeB}%\n\n" .
               "Consider replacing the storage device.";
    
    if ($lifeA >= 90 || $lifeB >= 90) {
        notify_via_smtp($subject, $message);
        file_put_contents(RAMDISK_PATH, time()); // Update last sent time on RAM disk
    }
}

// Ensure that email is sent only when eMMC is the boot disk and no RAM disk is used
def is_valid_environment() {
    if (file_exists("/etc/rc.ramdisk")) {
        return false; // RAM disk is enabled
    }
    $boot_disk = trim(shell_exec("mount | grep 'on / ' | awk '{print $1}'"));
    return strpos($boot_disk, "mmcsd") !== false; // Ensure eMMC is the boot device
}

// Set up RAM disk if necessary
setup_ramdisk();

if ($data["status"] === "ok" && is_valid_environment()) {
    send_emmc_alert($data["lifeA"], $data["lifeB"]);
}
?><div class="panel panel-default">
    <div class="panel-heading">
        <h3 class="panel-title">eMMC Disk Health</h3>
    </div>
    <div class="panel-body">
        <?php if ($data["status"] === "error"): ?>
            <div class="alert alert-danger"><?php echo $data["message"]; ?></div>
        <?php else: ?>
            <table class="table">
                <tr>
                    <th>Life A</th>
                    <td class="bg-<?php echo get_color_class($data['lifeA']); ?>"> <?php echo $data['lifeA']; ?>%</td>
                </tr>
                <tr>
                    <th>Life B</th>
                    <td class="bg-<?php echo get_color_class($data['lifeB']); ?>"> <?php echo $data['lifeB']; ?>%</td>
                </tr>
            </table>
        <?php endif; ?>
    </div>
</div>

andrew_cb

Someone with a dead 4200 today. Killed by ntopng in 10 months. The user was unaware of any risks from running ntopng on 16gb of eMMC, and there is no way to monitor the eMMC on the 4200. Luckily the device is still under warranty so it's being replaced under RMA.

https://www.reddit.com/r/PFSENSE/s/fzeuC0icCQ

Mission-Ghost

Based on what I've learned from this thread, I added a 256GB Samsung SSD to my 4200 today, replacing the built-in drive, and it's working fine. Netgate instructions had me hopping around from place to place in the documentation but did they did the job.

I don't want foreseeable future problems, so thank everyone who contributed here. Hopefully this will lead to a longer life than this box might have otherwise had.

andrew_cb

@Mission-Ghost I am glad you found this thread useful. A 256GB SSD should last a long time!

andrew_cb

One thing that has always stood out to me about my data has been the 8 devices with with average write rates below 50KBps.

Today I checked our devices and confirmed that those 8 outliers are all running UFS and everything else is using ZFS.
Compared to the highest UFS rate, the ZFS rate is from 2.5x to 7.5x higher.

I also looked at some of the devices that have high storage wear. They are in smallish offices and are just doing basic functions. The only packages installed are Zabbix Agent and Zabbix Proxy. A few had the logging enabled for the default rules so I turned those off.

I tried to find a reason why all the devices using ZFS have such high average writes compared to the devices using UFS, but could find no explanation. We use a standardized configuration and nearly all devices are low-load, and just have the Zabbix packages. On most, the log entries for each category fit within the default 500 events shown. I copied a day's worth of general system log events into a text file - it was 38KB.
I went so far as to raise the update interval from 1 minute to 5 minutes of nearly all items in the Zabbix template, but that made no difference.

300KB/sec is 18MB/min, 1.1GB/hour, 25GB/day, 9.4TB/year, 18.8TB/2 years, 28.2TB/3 years. This is in the ballpark for the maximum write life of the storage. No wonder we are seeing so many failures at the 2-3 year mark!

Comparatively, a device doing 50KB/sec would be at 4.7TB after 3years and 9.4TB after 6 years.

This could explain why our older 3100 and 7100 units on UFS have lasted 6-7 years and the eMMC is still in good health, meanwhile we have many 4100 that have failed or are near death in only 2 years.

In his thread eMMC Write endurance, @keyser noted

With ZFS, pfBlockerNG in default config with only 4 feeds loaded and NTopNG running, my box averages about 1 MB/s sustained write to the SSD.

I am only 700KBps less (300KBps vs 1000KBps) yet am not running pfblockerng or ntopng.

I will need to dig in deeper with iostat, top, and systat to try and find the cause of the writes. At this point it would appear that ZFS itself is the major cause of the increased write activity compared to UFS.

andrew_cb

@stephenw10 said in Another Netgate with storage failure, 6 in total so far:

Hmm, not sure why the pkg isn't in the CE repo. I guess there wasn't much call for it at the time. Seems like we could add that pretty easily. Let me see....

Did you have any luck getting mmc-utils added to the CE repo?

fireodo

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I will need to dig in deeper with iostat, top, and systat to try and find the cause of the writes.

Hi,

I got a reduction from ~19GBw/day to 1,8 GBw/day by using this settings:

zfs set sync=disabled zroot/tmp (pfSense/tmp)
zfs set sync=disabled zroot/var (pfSense/var) (after review my settings I saw that I had set it to disabled)

and fine tuning:

vfs.zfs.txg.timeout=120

(ZFS Pool in my case is "zroot" actual systems use "pfSense")

Remarc: this is a private system and private use.

w0w

@fireodo
A wonderful idea and discovery! It seems quite reasonable not to synchronize the tmp folder and 2 minutes delay for transaction writes. Good alternative to ram disks if it can not be used for some reason.

fireodo

@w0w said in Another Netgate with storage failure, 6 in total so far:

2 minutes delay

PS. If you test you can set the delay to greater values de amount of writing rate will decrease but you have a greater risk of loosing data when a power failure comes in ... (it reduce the robustness of ZFS filesystem)

w0w

@fireodo

In the case of a firewall, I think it is acceptable.
Most critical logs should be sent to an external syslog server, and I don't see any risks that could compromise the system. I can't think of any scenarios where this would be critical for pfSense, but I might be wrong. I don't know—some major updates are also managed by BE and shouldn't be affected.