Storage Engine, PLDB Cluster Node Down
Ericsson Centralized User Database

Contents

1Introduction
1.1Alarm Description
1.2Prerequisites

2

Procedure
2.1Actions for Data Node Goes Down or Is Unreachable
2.2Actions for Node Type MGM Cannot Retrieve Cluster Status
2.3Actions for Data Node is Unable to Start up

Glossary

Reference List

1   Introduction

This document provides the description and troubleshooting steps to take for the Storage Engine, PLDB Cluster Node Down alarm.

1.1   Alarm Description

This alarm is raised when one of the nodes of the cluster database is down or unreachable.

The alarm is issued in the following situations:

The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.

Table 1    Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

A management node of the database is down or unreachable.

One of the two management node processes cannot start up or is unreachable.

  • Network connection error.

  • Hardware error.

  • Disk is almost full.

Blade or Virtual Machine (VM).

No impact, as each cluster database has two management nodes.

A data node of the database cluster is down or unreachable.

The data node process cannot start up due to file system consistency errors, or is unreachable.

  • Non-graceful shutdowns.

  • Uncontrolled crash.

  • Hardware errors.

Blade or VM.

Database cluster performance is lower while the data node is down.

A replication node of the database cluster is down or unreachable.

One of the replication node processes cannot start up or is unreachable.

  • Network connection error.

  • Corrupted binlog or relay log files.

  • Hardware error.

Blade or VM.

No impact, as each cluster database has two replication servers per replication type (master and slave).

An access node of the database cluster is down or unreachable.

One of the access node processes cannot start up or is unreachable.

  • Network connection error.

  • Hardware error.

Blade or VM.

No impact, as each cluster database has two access servers.

Note:  
An alarm can appear as a result of maintenance activity.

The alarm attributes are listed and explained in Table 2.

Table 2    Alarm Attributes

Attribute Name

Attribute Value

Auto Cease

Yes

Module

STORAGE-ENGINE

Error Code

2

Timestamp First

Date and time when the alarm was raised for the first time.

Repeated Counter

Number which indicates how many times the alarm was raised.

Timestamp Last

Date and time of the most recent alarm raised.

Resource ID

.1.3.6.1.4.1.193.169.1.1.2.<ND>.<IP>

Alarm Model Description

Cluster node down, Storage Engine.

Alarm Active Description

Storage Engine (PLDB): <NT> node #<ND> down @ <IP>, uuid: <uuid>

ITU Alarm Event Type

processingErrorAlarm (4)

ITU Alarm Probable Cause

softwareProgramError (546)

ITU Alarm Perceived Severity

(4) – Major

Originating Source IP

Node ID where the alarm was raised.

Sequence Number

Number which indicates the order in which alarms were raised.

In Table 2, the indicated variables are as follows:

For further information about attribute descriptions, refer to the Alarm Format and Description section of CUDB Node Fault Management Configuration Guide, Reference [1].

1.2   Prerequisites

This section lists the prerequisites required for the procedure described in Section 2.

1.2.1   Documents

Before starting this procedure, ensure that you have read the following documents:

1.2.2   Tools

Not applicable.

1.2.3   Conditions

Not applicable.

2   Procedure

This section describes the procedure to follow when this alarm is received.

2.1   Actions for Data Node Goes Down or Is Unreachable

If the alarm is not cleared automatically in a short period of time, do the following:

  1. Check if there is an outstanding BSP alarm or alert related to the hardware identified by <IP> address.

    In case the blade is broken and cannot be fixed, replace the faulty blade. For more information on blade replacement, refer to Server Platform, Blade Replacement, Reference [3].

  2. Check if the Operating System, Disk Usage Too High alarm is raised. In case it is raised, refer to Operating System, Disk Usage Too High, Reference [2].
  3. Check if the data node can start up. To do so, run the following command on the System Controller blades to search for error code 2341 in the log:

    grep "error 2341" /local/cudb/mysql/mgmt/pl/ndb_1_cluster.log

    If any data nodes failed to start with this error, then file system or disk errors are the probable causes. Please contact the next level of support.

  4. Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this Operating Instruction.

2.2   Actions for Node Type MGM Cannot Retrieve Cluster Status

If the alarm is not cleared automatically in a short period of time, do the following:

  1. Check if there is an outstanding BSP alarm or alert related to the hardware identified by <IP> address.

    In case the blade is broken and cannot be fixed, replace the faulty blade. For more information on blade replacement, refer to Server Platform, Blade Replacement, Reference [3].

  2. Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this instruction.

2.3   Actions for Data Node is Unable to Start up

If the alarm is not cleared automatically in a short period of time, do the following:

  1. Check if the Operating System, Disk Usage Too High alarm is raised. In case it is raised, refer to Operating System, Disk Usage Too High, Reference [2].
  2. Check if the data node can start up. To do so, run the following command on the System Controller blades to search for error code 2341 in the log:

    grep "error 2341" /local/cudb/mysql/mgmt/pl/ndb_1_cluster.log

    If any data nodes failed to start with this error, file system or disk errors are the probable causes. Please contact next level of support.

  3. Confirm that the alarm has ceased. If the alarm remains, consult the next level of maintenance support. Further actions are outside the scope of this instruction.

If the faulty node is an NDB, find out if the failed node in the cluster database belongs to the master replica of its Processing Layer Database (PLDB) by following the instructions in the Listing the Master Replicas section of CUDB System Administrator Guide, Reference [4]. If this is the case, CUDB might not be able to process the nominal amount of traffic for that PLDB. If the nominal traffic-processing capacity is likely to be needed before corrective actions are finished, do consider moving the mastership of the affected PLDB to a healthy replica by following the mastership change procedure in the Changing DSG or PLDB Mastership Manually section of CUDB System Administrator Guide, Reference [4].


Glossary

For the terms, definitions, acronyms, and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [5].


Reference List

CUDB Documents
[1] CUDB Node Fault Management Configuration Guide.
[2] Operating System, Disk Usage Too High.
[3] Server Platform, Blade Replacement.
[4] CUDB System Administrator Guide.
[5] CUDB Glossary of Terms and Acronyms.
Other Ericsson Documents
[6] System Safety Information.
[7] Personal Health and Safety Information.


Copyright

© Ericsson AB 2016-2018. All rights reserved. No part of this document may be reproduced in any form without the written permission of the copyright owner.

Disclaimer

The contents of this document are subject to revision without notice due to continued progress in methodology, design and manufacturing. Ericsson shall have no liability for any error or damage of any kind resulting from the use of this document.

Trademark List
All trademarks mentioned herein are the property of their respective owners. These are shown in the document Trademark Information.

    Storage Engine, PLDB Cluster Node Down         Ericsson Centralized User Database