Operating Instructions 28/1543-CSH 109 067/10 Uen F

Server Platform, Storage Performance Degradation Detected
Ericsson Centralized User Database

Contents


1 Introduction

This instruction concerns alarm handling for the Server Platform, Storage Performance Degradation Detected alarm.

1.1 Alarm Description

The alarm is issued when a Ericsson Centralized User Data Base (CUDB) application detects that application components are impacted by a degradation of storage performance.

The alarm is issued in the following situations:

  • A monitored I/O heavy process gets stuck uninterruptedly due to missing storage system response.

  • Storage system responds to file system probe request with I/O error or timeout.

The possible alarm causes and the corresponding fault reasons, fault locations, and impacts are described in Table 1.

Table 1   Alarm Causes

Alarm Cause

Description

Fault Reason

Fault Location

Impact

File system probe detected an error.

Monitored partition could not be written for a longer period of time (preset timeout) due to I/O error.

Most probably faulty infrastructure.

Storage system.

Performance degradation in the CUDB system.

Lightweight process state check detected an error.

Monitored I/O heavy process got stuck in uninterruptible sleep ("disk sleep").

Most probably faulty infrastructure.

Storage system.

Performance degradation in the CUDB system.

The following are the consequences for the node if the alarm is not solved:

  • In case the alarm is raised in a payload blade or Virtual Machine (VM):

    • Performance degradation for the impacted Data Store/Processing Layer Database (DS/PLDB).

    • Lost local redundancy for the impacted DS/PLDB.

    • Lost DS/PLDB geographical redundancy in case both DS or all PLDB blades or VMs fail in a node.

    • Lost node in case all PLDB blades or VMs fail.

  • In case the alarm is raised in a System Controller (SC):

    • Service degradation in controlling processes running on the impacted SC.

    • Possible node reboots.

  • Unplanned mastership changes which can cause data durability issues.

The alarm attributes are listed and explained in Table 2.

Table 2   Alarm Attributes

Attribute Name

Attribute Value

Auto Cease

No

Module

SERVER-PLATFORM

Error Code

2

Timestamp First

Date and time when the alarm was raised for the first time.

Repeated Counter

Number which indicates how many times the alarm was raised.

Timestamp Last

Date and time of the most recent alarm raised.

Resource ID

.1.3.6.1.4.1.193.169.4.2.<Blade ID>

Alarm Model Description

Storage performance degradation detected, Server Platform

Alarm Active Description

Server Platform: Storage performance degradation detected on host <Blade>. <Additional info>

ITU Alarm Event Type

equipmentAlarm (5)

ITU Alarm Probable Cause

replaceableUnitProblem (69)

ITU Alarm Perceived Severity

Major (4)

Originating Source IP

Node IP where the alarm was raised.

Sequence Number

Number which indicates order in which alarms are raised.

In Table 2, the indicated variables are as follows:

  • <Blade ID> is the LDE or LOTC node ID for the blade or VM.

  • <Blade> is the LDE or LOTC hostname for the blade or VM.

  • <Additional info> is different depending on the CUDB system deployment and blade type:

    • For CUDB systems deployed on native BSP 8100 and payload blade: Automatic shutdown was performed.

    • In all other cases: The variable has no value.

The possible cause is a failure in the storage system.

1.2 Prerequisites

This section provides information on the documents, tools, and conditions that apply to the procedure.

1.2.1 Documents

Before starting this procedure, ensure that you have read the following documents:

1.2.2 Tools

Not applicable.

1.2.3 Conditions

Not applicable.

2 Procedure

This section describes the procedure to follow when this alarm is received.

2.1 Procedure for CUDB Systems Deployed on Native BSP 8100

Do!

Only in case of payload blade, Step 1 and Step 2 must be performed immediately. Even if the physical blade replacement is performed later. For more information about blade replacement, refer to the Replacing GEP Boards section of Server Platform, Blade Replacement.

Steps

  1. Identify the blade.
    For more information, refer to the Identifying the Faulty Blade section of Server Platform, Blade Replacement.
  2. Lock the blade.
    For more information, refer to Manage Blade in the BSP 8100 CPI.
  3. Perform the blade replacement.
    For more information, refer to the Replacing GEP Boards section of Server Platform, Blade Replacement or contact the next level of maintenance support.
  4. After the blade is replaced, clear the alarm.
    For more information, refer to the Clearing Alarms section of CUDB Node Fault Management Configuration Guide.

2.2 Procedure for CUDB Systems Deployed on a Cloud Infrastructure

In case the Storage Performance Degradation Detected alarm is raised, check the following in the cloud infrastructure:

  • Check if there is any ongoing infrastructure activity (for example, maintenance of the file systems used by the cloud infrastructure).

  • Check if there is a problem with the cloud infrastructure software.

  • Check if the cloud infrastructure hardware is hosting a faulty VM.

If everything is working correctly, clear the alarm (refer to the Clearing Alarms section of CUDB Node Fault Management Configuration Guide).

In case problems in the cloud infrastructure are identified:

Steps

  1. Make sure that the problems are fixed according to the Actions in the Case of Inrastructure Activities section of Virtualized CUDB Virtual Machine Recovery.
    Do!

    Until infrastructure issues are resolved, the VM must be shut down according to the steps provided in the document.

  2. Once the VM is recovered, clear the alarm.
    For more information, refer to the Clearing Alarms section of CUDB Node Fault Management Configuration Guide.

Reference List