1 Introduction
This instruction concerns alarm handling for the Control, Blackboard Coordination Server Down alarm.
1.1 Alarm Description
The alarm is issued when a Blackboard Coordination (BC) server is down.
The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.
|
Alarm Cause |
Description |
Fault Reason |
Fault Location |
Impact |
|---|---|---|---|---|
|
The blade or Virtual Machine (VM) hosting a BC server is down. |
The blade or VM is rebooting or shut down, and cannot provide any service. |
The blade or VM holding the BC server (that is, the System Controllers (SCs), or PL_2_5). |
BC server redundancy is decreased, since the system is running with one less BC server instance. | |
|
A BC server goes down, or becomes unreachable. |
The BC server process is not running |
The process has been stopped or killed, and cannot be started. |
BC server redundancy is decreased, since the system is running with one less BC server instance. | |
|
A BC server does not provide any service. |
The BC server process is running, but is unable to provide any service. |
The BC server process is running, but in an unhealthy state. |
BC server redundancy is decreased, since the system is running with one less BC server instance. | |
|
The files on a BC server are corrupted because of inconsistent information in the data directory. |
The information stored in the files of the BC server is corrupted, or inconsistent. |
Problem in the /local file system in the blade or VM running the BC server, or wrong information in the BC server files. |
The files in the /local/cudb/BCServer folder on the SCs, or PL_2_5. |
BC server redundancy is decreased, since the system is running with one less BC server instance. |
The alarm attributes are listed and explained in Table 2.
|
Attribute Name |
Attribute Value |
|---|---|
|
Auto Cease |
Yes |
|
Module |
CONTROL |
|
Error Code |
4 |
|
Timestamp First |
Date and time when the alarm was raised for the first time. |
|
Repeated Counter |
Number which indicates how many times the alarm was raised. |
|
Timestamp Last |
Date and time of the most recent alarm raised. |
|
Resource ID |
.1.3.6.1.4.1.193.169.7.4.<port> |
|
Alarm Model Description |
Blackboard Coordination Server Down, Control. |
|
Alarm Active Description |
Control: Blackboard Coordination Server down on <hostname>, uuid: <uuid> |
|
ITU Alarm Event Type |
processingErrorAlarm (4) |
|
ITU Alarm Probable Cause |
softwareProgramError (546) |
|
ITU Alarm Perceived Severity |
(4) - Major |
|
Originating source IP |
Node IP where the alarm was raised. |
|
Sequence Number |
Number which indicates the order in which the alarms are raised. |
In Table 2, the indicated variables are as follows:
- <Port> is the port of the BC server that is down.
For more information about BC deployment and configuration, refer to CUDB High Availability, Reference [2].
- <hostname> is the hostname of blade or VM hosting the BC server instance which is down.
- <uuid> is the universally unique identifier of the computing resource (blade or VM). It is blank if it is not possible to figure out its value.
For further information about attribute descriptions, refer to the Alarm Format and Description section of CUDB Node Fault Management Configuration Guide, Reference [3].
1.2 Prerequisites
This section provides information on the documents, tools, and conditions that apply to the procedure.
1.2.1 Documents
Before starting this procedure, ensure that you have read the following documents:
- CUDB Node Fault Management Configuration Guide, Reference [3].
- CUDB Node Commands and Parameters, Reference [4].
- The "Zookeeper" section of CUDB Node Logging Events, Reference [5].
- System Safety Information, Reference [7].
- Personal Health and Safety Information, Reference [8].
1.2.2 Tools
Not applicable.
1.2.3 Conditions
Not applicable.
2 Procedure
If the alarm is raised, then do the following:
- Wait for a short time for the alarm to clear. If the alarm clears, no further actions must be taken. If it is not cleared after a short period of time, continue with the next step.
- Try to restart the process manually with
the following command:
/opt/ericsson/cudb/OAM/bin/cudbManageBCServer -restart
- Check the log file of the failing BC Server
on the blade or VM holding the BC Server (look for some IOException
on loading the database). The log is located in the following directory:
/var/log/bc_server.err
For further details, check the "Zookeeper" section of CUDB Node Logging Events, Reference [5].
- If the
BC Server is unable to read its database, and fails to start because
of file corruption in the transaction logs, then do the following:
- Make sure that all the other BC Servers in
the BC Cluster are up and running with the following command:
cudbSystemStatus -B
- If all the other BC Servers of the BC Cluster
are up, then clean the database of the corrupt BC Server with the
following command:
rm -rf /local/cudb/BCServer/version-2
- Try to restart the process manually with
the following command:
/opt/ericsson/cudb/OAM/bin/cudbManageBCServer -restart
- Wait for a short time for the alarm to clear.
- Make sure that all the other BC Servers in
the BC Cluster are up and running with the following command:
- If the problem is not identified, or the alarm does not cease with the measures taken, consult the next level of maintenance support. Further actions are outside the scope of this instruction.
Glossary
For the terms, definitions, acronyms and abbreviations used in this document, refer to CUDB Glossary of Terms and Acronyms, Reference [6].
Reference List
| Other Ericsson Documents |
|---|
| [7] System Safety Information. |
| [8] Personal Health and Safety Information. |

Contents