1 Introduction
This document provides the description and troubleshooting steps to take for the Storage Engine, PLDB Cluster Node Down alarm.
1.1 Alarm Description
This alarm is raised when one of the nodes of the cluster database is down or unreachable.
The alarm is issued in the following situations:
- The data node (NDB) of the cluster database is down or unreachable.
- The management node (MGM) of the cluster database is down or unreachable.
- The replication node (SQL) of the cluster database is down or unreachable.
- The access node (SQL) of the cluster database is down or unreachable.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
A management node of the database is down or unreachable. |
One of the two management node processes cannot start up or is unreachable. |
|
Blade or Virtual Machine (VM). |
No impact, as each cluster database has two management nodes. |
|
A data node of the database cluster is down or unreachable. |
The data node process cannot start up due to file system consistency errors, or is unreachable. |
|
Blade or VM. |
Database cluster performance is lower while the data node is down. |
|
A replication node of the database cluster is down or unreachable. |
One of the replication node processes cannot start up or is unreachable. |
|
Blade or VM. |
No impact, as each cluster database has two replication servers per replication type (master and slave). |
|
An access node of the database cluster is down or unreachable. |
One of the access node processes cannot start up or is unreachable. |
|
Blade or VM. |
No impact, as each cluster database has two access servers. |
- Note:
- An alarm can appear as a result of maintenance activity.
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
Yes |
|
Module |
STORAGE-ENGINE |
|
Error Code |
2 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raised. |
|
Resource ID |
.1.3.6.1.4.1.193.169.1.1.2.<ND>.<IP> |
|
Alarm Model Description |
Cluster node down, Storage Engine. |
|
Alarm Active Description |
Storage Engine (PLDB): <NT> node #<ND> down @ <IP>, uuid: <uuid> |
|
ITU Alarm Event Type |
processingErrorAlarm (4) |
|
ITU Alarm Probable Cause |
softwareProgramError (546) |
|
ITU Alarm Perceived Severity |
(4) – Major |
|
Originating Source IP |
Node ID where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which alarms were raised. |
In Table 2, the indicated variables are as follows:
- <NT> is the faulty node type (NDB, SQL, MGM).
- <ND> is the node number within the database cluster.
- <IP> is the IP address of the faulty node.
- <uuid> is the universally unique identifier of the computing resource (blade or virtual machine). It is blank if it is not possible to figure out its value.
For further information about attribute descriptions, refer to the Alarm Format and Description section of CUDB Node Fault Management Configuration Guide, Reference [1].
1.2 Prerequisites
This section lists the prerequisites required for the procedure described in Section 2.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [1]
- System Safety Information, Reference [6]
- Personal Health and Safety Information, Reference [7]
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
This section describes the procedure to follow when this alarm is received.
2.1 Actions for Data Node Goes Down or Is Unreachable
If the alarm is not cleared automatically in a short period of time, do the following:
- Check if there is an outstanding BSP alarm
or alert related to the hardware identified by <IP> address.
In case the blade is broken and cannot be fixed, replace the faulty blade. For more information on blade replacement, refer to Server Platform, Blade Replacement, Reference [3].
- Check if the Operating System, Disk Usage Too High alarm is raised. In case it is raised, refer to Operating System, Disk Usage Too High, Reference [2].
- Check if the data node can start up. To do
so, run the following command on the System Controller blades to search
for error code 2341 in the log:
grep "error 2341" /local/cudb/mysql/mgmt/pl/ndb_1_cluster.log
If any data nodes failed to start with this error, then file system or disk errors are the probable causes. Please contact the next level of support.
- Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this Operating Instruction.
2.2 Actions for Node Type MGM Cannot Retrieve Cluster Status
If the alarm is not cleared automatically in a short period of time, do the following:
- Check if there is an outstanding BSP alarm
or alert related to the hardware identified by <IP> address.
In case the blade is broken and cannot be fixed, replace the faulty blade. For more information on blade replacement, refer to Server Platform, Blade Replacement, Reference [3].
- Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this instruction.
2.3 Actions for Data Node is Unable to Start up
If the alarm is not cleared automatically in a short period of time, do the following:
- Check if the Operating System, Disk Usage Too High alarm is raised. In case it is raised, refer to Operating System, Disk Usage Too High, Reference [2].
- Check if the data node can start up. To do
so, run the following command on the System Controller blades to search
for error code 2341 in the log:
grep "error 2341" /local/cudb/mysql/mgmt/pl/ndb_1_cluster.log
If any data nodes failed to start with this error, file system or disk errors are the probable causes. Please contact next level of support.
- Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this instruction.
If the faulty node is an NDB, find out if the failed node in the cluster database belongs to the master replica of its Processing Layer Database (PLDB) by following the instructions in the Listing the Master Replicas section of CUDB System Administrator Guide, Reference [4]. If this is the case, CUDB might not be able to process the nominal amount of traffic for that PLDB. If the nominal traffic-processing capacity is likely to be needed before corrective actions are finished, do consider moving the mastership of the affected PLDB to a healthy replica by following the mastership change procedure in the Changing DSG or PLDB Mastership Manually section of CUDB System Administrator Guide, Reference [4].
Glossary
For the terms, definitions, acronyms, and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [5].
Reference List
| Other Ericsson Documents |
|---|
| [6] System Safety Information. |
| [7] Personal Health and Safety Information. |

Contents