Replacing a faulty node in a clustered system

You can use the command-line interface (CLI) and the system front panel to replace a faulty node in a clustered system.

Before you begin

Before you attempt to replace a faulty node with a spare node you, must ensure that you meet the following requirements:
  • You know the name of the system that contains the faulty node.
  • A spare node is installed in the same rack as the system that contains the faulty node.
  • You must make a record of the last 5 characters of the original worldwide node name (WWNN) of the spare node. If you repair a faulty node, and you want to make it a spare node, you can use the WWNN of the node. You do not want to duplicate the WWNN because it is unique. It is easier to swap in a node when you use the WWNN.
Attention: Never connect a node with a WWNN of 00000 to a system. If this node is no longer required as a spare and is to be used for normal attachment, you must change the WWNN to the number that you recorded when a spare was created. Using any other number might cause data corruption.

About this task

If a node fails, the system continues to operate with degraded performance until the faulty node is repaired. If the repair operation takes an unacceptable amount of time, it is useful to replace the faulty node with a spare node. However, the appropriate procedures must be followed and precautions must be taken so you do not interrupt I/O operations and compromise the integrity of your data.

In particular, ensure that the partner node in the I/O group is online.
  • If the other node in the I/O group is offline, start the fix procedures to determine the fault.
  • If you have been directed here by the fix procedures, and subsequently the partner node in the I/O group has failed, see the procedure for recovering from offline volumes after a node or an I/O group failed.
  • If you are replacing the node for other reasons, determine the node you want to replace and ensure that the partner node in the I/O group is online.
  • If the partner node is offline, you will lose access to the volumes that belongs to this I/O group. Start the fix procedures and fix the other node before proceeding to the next step.
Table 1 describes the changes that are made to your configuration when you replace a faulty node in a clustered system.
Table 1. Summary of changes made to node attributes
Node attributes Description
Front panel ID This ID is the number that is printed on the front of the node and is used to select the node that is added to a system.
Node ID This ID is assigned to the node. A new node ID is assigned each time a node is added to a system; the node name remains the same following service activity on the system. You can use the node ID or the node name to perform management tasks on the system. However, if you are using scripts to perform those tasks, use the node name rather than the node ID. This ID will change during this procedure.
Node name The node name is the name that is assigned to the node. The system automatically re-adds nodes that have failed back to the system. If the system reports an error for a node missing (error code 1195) and that node has been repaired and restarted, the system automatically re-adds the node back into the system.

If you choose to assign your own names, you must type the node name on the Adding a node to a cluster panel. You cannot manually assign a name that matches the naming convention used for names assigned automatically by the system. If you are using scripts to perform management tasks on the system and those scripts use the node name, you can avoid the need to make changes to the scripts by assigning the original name of the node to a spare node. This name might change during this procedure.

Worldwide node name This is the WWNN that is assigned to the node. The WWNN is used to uniquely identify the node and the Fibre Channel ports. During this procedure, the WWNN of the spare node changes to that of the faulty node. The node replacement procedures must be followed exactly to avoid any duplication of WWNNs. This name does not change during this procedure.
Worldwide port names These are the WWPNs that are assigned to the node. WWPNs are derived from the WWNN that is written to the spare node as part of this procedure. For example, if the WWNN for a node is 50050768010000F6, the four WWPNs for this node are derived as follows:
WWNN                          50050768010000F6
WWNN displayed on front panel 000F6
WWPN Port 1                   50050768014000F6
WWPN Port 2                   50050768013000F6
WWPN Port 3                   50050768011000F6
WWPN Port 4                   50050768012000F6
These names do not change during this procedure.

Go to the procedure Replacing nodes nondisruptively for the specific steps to replace a faulty node in a system.