Availability Management Framework

Contents


1   Understanding Availability ManagementFramework

1.1   Basic Availability Management Framework Concepts

The Availability Management Framework (AMF) is about managing applications and to keep the service they provide available always. This includes the following major responsibilities:

1.2   Application

By application in the AMF context is usually meant the server part in a client-server application. There are many types of servers such as web servers, database servers, and gaming servers.

Green field applications are applications written from scratch possibly with the AMF integration in mind. If so, they can freely use the AMF concepts depending on their ambition level to provide service availability and become Service Availability-aware (SA-aware).

Third-Party Programs (3PPs) or legacy applications are applications that exist and that are not integrated with the AMF. Such applications are referred to as non-SA-aware. Certain features exist in the AMF to support integration of these types of applications. Such integration is important to provide a complete highly available system solution that includes databases and storage solutions.

The AMF environment is a clustered environment and the application can be distributed in a cluster. The subparts of a distributed application do either or both of the following:

An AMF application can consist of only a single operating system process but this gives quite a bit of overhead because of the AMF modeling requirements. It is, however, a good starting point when there are plans to make the application High Availability (HA) or distributed, or both.

1.3   Cluster and Node

Cluster and Node are logical entities of the AMF system model. An AMF node corresponds to an operating system instance. The set of AMF nodes form the AMF cluster. Nodes in a cluster belong to the same communication subnet; no routing is needed within a cluster.

1.4   Component and Service Unit

The component is one logical entity of the AMF system model. A component represents a program in execution under control of the AMF. Usually a component corresponds 1:1 to an operating system process.

The term SA-aware component is used to describe a component that is integrated with and using the API.

Components are grouped into Service Units (SUs), a logical entity completely associated with an AMF node. All components in an SU execute on the same AMF node.

1.5   Health Monitoring

Health monitoring is important to achieve service availability and is used to detect errors and anomalies in the system. Monitoring is always performed on a per component basis and is also called component monitoring.

The AMF supports three different types of monitoring:

With active monitoring, latent faults, such as a looping and not responding program, can be detected, which is not the case using passive monitoring.

When active monitoring is used, it is also possible to validate the data received from the service monitored. For example, if system uptime is requested from an SNMP agent (because of active monitoring of it), the result can be validated and checked to see if it is reasonable. This kind of monitoring is out of the scope of the AMF and this document, besides it is service-specific. If used, it gives even higher service availability because another class of errors can be detected.

The recovery action taken by the AMF when a fault has been detected is configurable but can, for example, be COMPONENT_RESTART. If a monitored process dies, it is restarted again by the AMF. A recommended recovery action can also be specified in the API used to report errors.

For more information, refer to AIS AMF Specification at http://www.saforum.org.

1.5.1   Passive Monitoring

In passive monitoring, the AMF uses operating system features to assess the health of a component. Currently only monitoring the death of a process is defined but one can envision monitoring other system resources like main memory use.

As operating system features are used, the component is not actively involved in the monitoring and its code is not instrumented, hence the name passive monitoring.

The AMF implicitly performs passive monitoring on SA-aware components. If such a component dies, for example because of segmentation fault, the AMF automatically detects it.

To use passive monitoring for other types of components (or for a subprocess), it must be started using function saAmfPmStart() and stopped using function saAmfPmStop().

The time between fault and detection is implementation-specific and cannot be configured using either the API or through configuration objects.

1.5.2   External Active Monitoring

In external active monitoring the component code is not instrumented, instead an external entity called a monitor is used to assess the health of the component.

The monitor is preferably sending real service requests to the monitored component and supervising that a correct response is received in a timely manner.

An AMF component can be configured with optional Application Monitoring (AM) commands. Command AM_START is called after the instantiate command and AM_STOP is called before the terminate command.

AM_START starts a monitor process that periodically assesses the health of the monitored application by making a simple service request to it. The AMF is not involved in the actual monitoring, that is, the responsibility of the monitor process.

When the monitor detects a health problem with its monitored service, it is to call function saAmfComponentErrorReport() . This implies that the monitor itself is written in C/C++ or that a helper command exists that wraps saAmfComponentErrorReport() so that it can be called by a script implemented monitor.

In this case no one monitors the monitor, but as the monitor is simple and small it can probably be considered fault free by review. If this is not appropriate, the monitor can be implemented as an AMF SA-aware component to which the AM commands send monitoring requests.

For more information, refer to AIS AMF Specification at http://www.saforum.org.

1.5.3   Internal Active Monitoring

Using internal active monitoring, the component must be specifically designed. The purpose of such code is to monitor the component health and discover latent faults. The execution of such code (often called audits) is in the AMF called a health check.

As the code is instrumented, this type of monitoring is normally only used for SA-aware components.

A health check can be triggered by the component itself or by the AMF. When triggered by the AMF, health check requests are sent periodically to the component with a certain configurable period. The AMF expects a response within a certain configurable time called the duration. The duration is always shorter than the period.

A component can have several health checks active at the same time. Each health check is identified by a key – a name. Some reasoning for this: depending on the check performed, the impact on the service provided varies. A normal service request has little impact and can be run with a shorter period. More detailed component audits can have more service impact and are to be run with a longer period.

Active monitoring means that the provided service is to be checked. Therefore, health checks cannot be acted on by, for example, a separate decoupled thread in the component, unless it actually does a service request internally.

Configuration of period (and duration) must be done with high load in mind. It is a trade-off between fast true error detection and avoidance of false error detection. A longer period is good to avoid false error detection but it takes longer to detect latent faults. A health check period is normally in the second range or even 10 s of second range, it is most likely not less than a second. The health check duration most likely must be longer than the callback time-out, typically twice as long. It depends on the AMF implementation if two supervision timers run at the same time or if health checks are skipped when some other supervision is active, for example, callback time-out.

An unexpected death of the registered process for an AMF component is instantly detected by the AMF and requires no active monitoring.

Errors are reported to the AMF in two ways. When the AMF invoked health checks are used, a negative response is given using function saAmfResponse() . When component invoked health checks are used, the component responds with a negative response using function saAmfHealthcheckConfirm().

For more information, refer to AIS AMF Specification at http://www.saforum.org.

1.6   Workload

A normal non-AMF-aware program provides service directly when started. There is no distinction between the program and the service it provides. However, if the service or work the program performs can be categorized and quantified, it can also be modeled and managed. This categorized and quantified work/service is what the AMF means by workload. Workload is a core concept used by the AMF to enable high availability and is important to understand. When an application uses the workload concept, the AMF enables for sophisticated redundancy schemes.

An application designed with the workload separation in mind is called SA-aware in AMF terms. That is, it can be started and be "idle" – do nothing until the AMF tells it to be active or standby for a certain workload.

A simple example can be a web server that starts and initializes but does not bind to port 80 until assigned the corresponding active workload. On another node, the same program can be running as standby waiting to be activated if the other instance goes down. This is an example of a simple 2N redundancy scheme.

With AMF concepts, the workload is called a Service Instance (SI) and these are assigned to SUs. An SI is further broken down in to Component Service Instances (CSIs), which are assigned to components (processes) and visible in the API for the program designer.

1.7   Assignment

The AMF assigns a workload in active or standby state to an application. This means that the application upon receiving the assignment is to start providing service according to the state of the assignment, and the amount and type of service as described by the workload.

For simplicity, the application is often designed so that, when assigned an active workload, it already knows the amount and type of service the workload represents. In the web server example, the active workload means bind to port 80. But if the bind port number can vary, the AMF concepts for categorizing the workload can be used.

More complex schemes can be used to describe workloads. For example, a workload can describe a range of subscribers and their certain properties. Then one can imagine some application "workers" collectively providing the high-level service, each worker contributing with its little piece and all workers together provides the complete service.

1.8   Failover and Switchover

In this section, the "operator" can mean either a human or a management application running within the system such as software management.

Failover means an unexpected reassignment (from an operator point of view) of a workload to another instance of an application. In AMF terms, the SI is reassigned to another SU. Failover is always a consequence of a fault in the system of which the AMF is aware.

Switchover means an expected reassignment of a workload to another instance of an application. It is expected because it is either initiated by an operator or by the AMF itself. When recovering from a fault, the AMF can fail over some SIs and switch over others. This occurs in some conditions, always as a consequence of a fault and depending on the application model and configuration. This is to minimize disturbance in the system.

A switchover is supposed to be less intrusive to the service provided by the application. SA-aware components are to be designed with this objective.

1.9   Error Detection, Recovery, Repair, and Escalation

Error detection is the responsibility for all entities in the system.

After an error has been detected and reported, the AMF tries to recover the application provided service from the error. Recovery is performed automatically by the AMF to ensure that all assignments are reassigned to a non-erroneous component. If the AMF cannot reassign the workload, it sends the alarm "workload unassigned", which means that a service is not available at all.

A recovery action can be recommended when an error is reported. A default action is also configured for the component. The executed action is never weaker than the one recommended but can be stronger.

Normally the first level of recovery is restart of the erroneous component. The objective is to avoid reassigning the workload to another component. If component restart fails or another error occurs within the component probation time, the next action – because of escalation – is restart of the whole SU.

If the SU is restarted too many times during the SU probation time, the recovery action is escalated to failover.

If restart is disabled by configuration or the restart failed, the next level of recovery action – because of escalation or recommendation – is failover. This means assigning the workload to another SU than the failed component belongs to. The failover scope can because of escalation be extended from SU to node (all SUs hosted by a node).

After recovery, repair is by default automatically performed on the erroneous entity. By configuration, automatic repair can be disabled and thus make the responsibility a non-AMF issue. Restart recovery actions are considered as repair actions and no further action is needed. However, if the recovery action was failover, the AMF tries to reinstantiate the component and possibly reassign it.

For more information, refer to AIS AMF Specification at http://www.saforum.org.

1.10   Information Model

An SA-Forum system is managed through an information model. The information model consists of managed objects that represent various logical entities in the system.

The information model is managed by the Information Model Management service (IMM). It is out of scope of this document to describe the IMM. For more information, refer to AIS IMM Specification at http://www.saforum.org.

Most SA-Forum specified services defines an information model. This is particularly true for the AMF that defines a rich information model to support application modeling. For more information, refer to AIS AMF Specification at http://www.saforum.org.

The IMM supports administrative operations, which can be seen as a Remote Procedure Call on an object in the model. An operator, for example, stops an application because it is about to be upgraded.

An application can also use the IMM to store its specific configuration data, thus making it possible to configure and manage in SA-Forum intended way.

1.11   Redundancy Model

The AMF provides the concept of redundancy models. The redundancy model helps the AMF to keep the application service available per its requirements.

Historically telecommunications applications have been designed to have standby entities. The AMF support those types of applications by providing redundancy models that include standby workload assignments. Other SA-Forum services, such as the Checkpoint service (CKPT), provide means to make standby entities "warmer" – more ready to take over an active assignment. The CKPT enables an application to replicate its state data.

By leveraging on the separation of program and workload, the AMF can manage many instances of a program and transfer the active workload from a non-operational program to an operational program.

The following redundancy models are defined:

For more information, refer to AIS AMF Specification at http://www.saforum.org.

1.12   Administrative Operations

The AMF model specifies quite a few administrative operations defined for certain entities. The AMF is the implementer of such a call with help and cooperation from the affected application component or components.

Examples of administrative operations are LOCK and UNLOCK for workload management, but other operations also exist.

Administrative operations are needed so that an operator can communicate and control the AMF. For example, upgrading an AMF application without involving the AMF causes the AMF to consider the application to have failed.

Administrative operations are used by an operator or more likely a management program acting on behalf of an operator at a network management system.

One example of the latter is software management. When a program is upgraded, it is locked, updated, and finally unlocked again.

For more information, refer to AIS AMF Specification at http://www.saforum.org.