Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD-1569 / 13.0 / Pd 1569 re investigate and refresh disk replace procedures (by DjP-iX) #3298

Merged
merged 5 commits into from
Nov 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 97 additions & 33 deletions content/CORETutorials/Storage/Disks/DiskReplace.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,21 @@ tags:

{{< toc >}}

Hard drives or solid-state drives (SSDs) have a finite lifetime and can fail unexpectedly.
When a disk fails in a Stripe (RAID0) pool, the entire pool has to be recreated and all data restored from backups.
Creating non-stripe storage pools that have disk redundancy is always recommended.
Hard drives and solid-state drives (SSDs) have a finite lifetime and can fail unexpectedly.
When a disk fails in a Stripe (RAID0) pool, you must recreate the entire pool and restore all data backups.
We always recommend creating non-stripe storage pools that have disk redundancy.

To prevent further loss of redundancy or eventual data loss, always replace a failed disk as soon as possible!
TrueNAS integrates new disks into a pool to restore the pool to full functionality.
To prevent further redundancy loss or eventual data loss, always replace a failed disk as soon as possible!
TrueNAS integrates new disks into a pool to restore it to full functionality.

## Replacing a Disk
{{< hint type=important >}}
TrueNAS requires you to replace a disk with another disk of the same or greater capacity as a failed disk.
You must install the disk in the TrueNAS system.
It should not be part of an existing storage pool.
TrueNAS wipes the data on the replacement disk as part of the process.

Another disk of the same or greater capacity is required to replace a failed disk.
This disk must be installed in the TrueNAS system and not part of an existing storage pool.
Any data on the replacement disk is wiped as part of the process.
Disk replacement automatically triggers a pool resilver.
{{< /hint >}}

{{< expand "Can I replace a disk in a GELI-encrypted (Legacy) pool?" "v" >}}
Although GELI encryption is deprecated, TrueNAS implements GELI encryption during a "GELI-Encrypted (Legacy) pool" disk replacement. TrueNAS uses GELI encryption for the lifetime of that pool, even after replacement.
Expand All @@ -30,45 +33,52 @@ The TrueNAS **Dashboard** shows when a disk failure degrades a pool.

{{< trueimage src="/images/CORE/12.0/DashboardPoolDegraded.png" alt="Degraded Pool" id="Degraded pool on dashboard widget." >}}

Click the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> on the pool card to go to the **Storage > Pools > Pool Status** screen and locate the failed disk.
Click the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> on the pool card to go to the **Storage > Pools > Pool Status** screen to locate the failed disk.

### Taking a Failed Disk Offline
{{< expand "My disk is faulted. Should I replace it?" "v" >}}
If a disk shows a faulted state, TrueNAS has detected an issue with that disk and you should replace it.
{{< /expand >}}

To replace a disk in a pool without a hot spare available:

1. [Take the disk offline](#taking-a-failed-disk-offline).
2. [Replace the disk](#replacing-a-failed-disk).
3. Refresh the screen.

Clicking <i class="material-icons" aria-hidden="true" title="Options">more_vert</i> for the failed disk shows additional operations.
To replace a disk in a pool with a hot spare:

{{< trueimage src="/images/CORE/12.0/StoragePoolsStatusDiskFailedOptions.png" alt="Disk Options" id="Pool Status disk options." >}}
1. [Take the disk offline](#taking-a-failed-disk-offline).
2. [Detach the failed disk](#detaching-a-failed-disk) to promote the hot spare.
3. Refresh the screen.
4. [Recreate the hot spare VDEV](#recreating-the-hot-spare).

We recommend you *offline* the disk before starting the replacement.
This removes the device from the pool and can prevent swap issues. To offline a disk:
## Taking a Failed Disk Offline

Go to **Storage > Pools** screen.

Click on the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> settings icon, and then select **Status** to display the list of disks in the pools.
Go to the **Storage > Pools** screen, click on the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> settings icon, and then select **Status** to open the **Pool Status** screen and display the disks in the pools.

Click the <i class="material-icons" aria-hidden="true" title="Options">more_vert</i> icon for the disk you plan to remove, and then select **Offline**.
Click the <i class="material-icons" aria-hidden="true" title="Options">more_vert</i> icon for the disk you plan to remove and then click **Offline**.

Select **Confirm** to activate the **OFFLINE** button, then click **OFFLINE**. The disk should now be offline.
{{< trueimage src="/images/CORE/13.0/StoragePoolsStatusDiskFailedOptions.png" alt="Disk Options" id="Pool Status disk options" >}}

{{< expand "Can I use a disk that is failing but still active?" "v" >}}
There are some situations where a disk that has not completely failed can be left online to provide additional redundancy during the replacement procedure.
We don't recommend leaving failed disks online unless you know the exact condition of the failing disk!
Attempting to replace a heavily degraded disk without off-lining it first results in a significantly slower replacement process.
{{< /expand >}}
Select **Confirm**, then click **OFFLINE**.

When the disk status shows as **Offline**, physically remove the disk from the system.

{{< trueimage src="/images/CORE/13.0/StoragePoolsStatusOffline.png" alt="Offline Disk" id="Pool Status disk offline" >}}

{{< expand "The offline failed?" "v" >}}
If the offline operation fails with a **Disk offline failed - no valid replicas** message, go to **Storage > Pools**, click the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> for the degraded pool, and select **Scrub Pool**.
When the scrub operation finishes, reopen the pool **Status** and try to offline the disk again.
{{< /expand >}}

When the disk status shows as **Offline**, physically remove the disk from the system.

{{< trueimage src="/images/CORE/12.0/StoragePoolsStatusOffline.png" alt="Offline Disk" id="Pool Status disk offline." >}}
## Replacing a Failed Disk

If the replacement disk is not already physically added to the system, add it now.
If replacing the failed disk that you have taken offline and removed, insert the replacement disk now.
If replacing a failed disk with an available disk in the system, proceed to the next step.

### Bringing a New Disk Online

In the **Pool Status**, open the options for the offline disk and click **Replace**
In the **Pool Status** screen, open the options for the offline disk and click **Replace**

{{< trueimage src="/images/CORE/12.0/StoragePoolsStatusDiskReplace.png" alt="Replacing Disk" id="Replacing disk screen." >}}

Expand All @@ -77,16 +87,70 @@ The new disk must have the same or greater capacity as the disk you are replacin
The replacement fails when the chosen disk has partitions or data present.
To destroy any data on the replacement disk and allow the replacement to continue, set the **Force** option.

When the disk wipe completes and TrueNAS starts replacing the failed disk, the **Pool Status** changes to show the in-progress replacement.
When the disk wipe completes and TrueNAS starts replacing the failed disk, the **Pool Status** screen changes to show the in-progress replacement.

{{< trueimage src="/images/CORE/12.0/StoragePoolsStatusReplaceStart.png" alt="Replacing Started" id="Pool Status replacing disk." >}}

TrueNAS resilvers the pool during the replacement process.
For pools with large amounts of data, resilvering can take a long time.
When the resilver completes, the pool status screen updates to show the new disk, and the pool status returns to **Online**.

When the resilver completes, the **Pool Status** screen updates to show the new disk, and the pool status returns to **Online**.

{{< trueimage src="/images/CORE/12.0/StoragePoolsStatusReplaceComplete.png" alt="Replacement Complete" id="Pool Status disk replacement complete." >}}

{{< taglist tag="corerecovery" limit="10" >}}
## Replacing a Failed Disk with a Hot Spare

A **Hot Spare** vdev sets up drives as reserved to prevent larger pool and data loss scenarios. TrueNAS automatically inserts an available hot spare into a **Data** vdev when an active drive fails.
The pool resilvers after the hot spare is activated.

To replace a disk in a pool with a hot spare:

1. [Take the disk offline](#taking-a-failed-disk-offline).
2. [Detach the failed disk](#detaching-a-failed-disk) to promote the hot spare.
3. Refresh the screen.
4. [Recreate the hot spare VDEV](#recreating-the-hot-spare).

### Detaching a Failed Disk

Go to the **Storage > Pools** screen, click on the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> settings icon, and then select **Status** to open the **Pool Status** screen and display the disks in the pools.

After taking the failed disk offline and removing it from the system, the disk status changes to **REMOVED** and the disk name displays the gptid.

{{< trueimage src="/images/CORE/13.0/StoragePoolsStatusHotSpareActive.png" alt="Disk Removed - Hot Spare Active" id="Disk Removed - Hot Spare Active" >}}

Click the <i class="material-icons" aria-hidden="true" title="Options">more_vert</i> icon for the removed disk and then click **Detach**.

Select **Confirm**, then click **DETACH**.
TrueNAS detaches the disk from the pool and promotes the hot spare disk to a full member of the pool.

### Recreating the Hot Spare

After promoting the hot spare, recreate the **Spare** vdev and assign a disk to it.

{{< expand "Do I really need to promote the hot spare and then recreate the spare vdev?" "v" >}}
If you have a hot spare inserted into the pool and then follow the instructions in [Replacing a Failed Disk](#replacing-a-failed-disk), TrueNAS automatically returns the hot spare disk to the existing **Spare** vdev and **ONLINE** status.

However, we do not recommend this method, because it causes two resilver events: one when activating the hot spare and again when replacing the failed disk.
Resilvering degrades system performance until completed and causes unnecessary strain on the disk.

To avoid unnecessary resilvers, [promote the hot spare](#detaching-a-failed-disk) then recreate the hot spare vdev.
{{< /expand >}}

If recreating the spare with a replacement in place of the failed disk, insert the replacement disk now.
If recreating the spare with an available disk in the system, proceed to the next step.

Go to the **Storage > Pools** screen, click on the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> settings icon, and then select **Add Vdevs** to open the **Pool Manager** screen and display the disks in the pools.

Click **ADD VDEV** and select **Hot Spare**.

{{< trueimage src="/images/CORE/13.0/AddVdevsScreenHotSpare.png" alt="Add Vdev Hot Spare" id="Add Vdev Hot Spare" >}}

Select an available disk and click <i class="fa fa-arrow-right" aria-hidden="true" title="Right Arrow"></i> to add it to the **Spare VDev**.

Click **ADD VDEVS**.
Select **Confirm**, then click **ADD VDEVS**.

After completing the job, TrueNAS returns to the **Storage > Pools** screen.
Click on the <i class="material-icons" aria-hidden="true" title="Settings">settings</i> settings icon, and then select **Status** to open the **Pool Status** screen and confirm the hot spare is added.

{{< taglist tag="corestorage" limit="10" title= "Related Storage Articles" >}}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.