Detecting and replacing failed hard drives – OpenStack
OpenStack Object Storage won’t be of much use if it can’t access the hard drives where our data is stored; so being able to detect and replace failed hard drives is essential. OpenStack Object Storage can be configured to detect hard drive failures with the swift- drive-audit command. This will allow us to detect failures so that we can replace the failed hard drive which is essential to the system health and performance.
How to accomplish it…
To detect a failing hard drive, carry out the following:
- We first need to configure a cron job that monitors /var/log/kern.log for failed disk errors on our storage nodes. To do this, we create a configuration file named /etc/swift/swift-drive-audit.conf, as follows:
- We then add a cron job that executes swift-drive-audit hourly, or as often as needed for your environment, as follows:
echo '/usr/bin/swift-drive-audit /etc/swift/swift-drive-audit.conf' | sudo tee -a /etc/cron.hourly/swift-drive-audit
- With this in place, when a drive has been detected as faulty, the script will unmount it, so that OpenStack Object Storage can work around the issue. Therefore, when a disk has been marked as faulty and taken offline, you can now replace it.
Without swift-drive -audit taking care of this automatically, you should need act manually to ensure that the disk has been dismounted and removed from the ring.
- Once the disk has been physically replaced, we can follow instructions as described in the Managing swift cluster capacity recipe, to add our node or device back into our cluster.
How it works…
Detection of failed hard drives can be picked up automatically by the swift- drive-audit tool, which we set up as a cron job to run hourly. With this in place, it detects failures, unmounts the drive so it cannot be used, and updates the ring, so that data isn’t being stored or replicated to it.
Once the drive has been removed from the rings, we can run maintenance on that device and replace the drive.
With a new drive in place, we can then put the device back in service on the storage node by adding it back into the rings. We can then rebalance the rings by running the swift-ring-builder commands.