Fault Tolerance/Failover

Document ID : KB000053269
Last Modified Date : 14/02/2018
Show Technical Document Details

Description:

Typically, failover - the switch from a primary system to a backup system - is configured to perform automatically, while "fall back" is designated as a manual task. This ensures that the problem triggering failover is properly identified and corrected before functionality is returned to the original machine.

This section includes links to a number of documents that provide useful insights, and Best Practices for fault tolerance across BSO and ESM products. While Unicenter NSM includes specific technology to deliver fault tolerance other r11 products (e.g., Unicenter Service Desk, UAPM, Unicenter DSM) can exploit best practices for fault tolerance for their management components. Where best practices exist a link is provided.

Solution:

Working with Microsoft Clusters
For a SQL MDB, the SQL database engine must be launched by the Cluster Administrator and not by the Windows Service Control Manager. When Ingres is used for the MDB this applies to the Ingres Intelligent Database service, as well. When Enterprise Management's CA Unicenter service is run in High Availability mode, it also must be launched by the Cluster Administrator. This enables cluster controlled services to switch together from one node to another during a failover event.

In highly available installations, the MDB data files reside on the cluster shared disk instead of their regular location. When an r11 Manager and MDB fail over from one node to the next, the shared disk (containing the MDB data files) becomes available to the new node and the r11 Manager and MDB server on the new node pick up where they left off on the old node.

Click here for an overview of Highly Available Unicenter and BSO solutions.
Click here for a document which summarizes options for deploying a shared HA MDB in an MSCS environment.
The following presentations detail Best Practice Recommendations for deploying in a Microsoft Cluster Server (MSCS) environment:

For Unicenter NSM r11.0 (Ingres):

For Unicenter NSM r11.1 (MS-SQL):

For information about installing non-HA Unicenter NSM components in an MSCS environment, click here (for PowerPoint presentation) or here (for PDF document). Click here to download the NSM Resource Cluster kit (nsmCluster) which is referenced in these documents and can be used to expedite creation of cluster resources for non-HA components.

Note: Additional discussions regarding High Availability for Unicenter NSM can be found in the Systems Management Greenbook which is available through the following link:
https://support.ca.com/irj/portal/anonymous/phpdocs?filePath=0/common/greenbooks/NSM_SystemsManagementGreenBook_ENU.pdf.

Unicenter Desktop and Server Management (DSM) r11.2 does support clustering where DSM is, itself, installed on the Cluster, however if you install the MDB with CCS from the DSM media on a remote SQL cluster the CCS install will fail on the second node. If you must use a clustered remote MDB you will need to install the MDB and CCS Worldview Manager Component from CA Network and System Management (NSM) r11 media. Additional details are provided in Techdoc # TEC434407.

For Unicenter Service Desk MSCS r11.1 and r11.2 (MS-SQL):

In addition, click here to download the usdCluster.zip file referenced in both the first presentation and the PDF file. Click here to download the usdPSCluster.zip file referenced in the third presentation.

Note: Additional High Availability discussions for Unicenter Service Desk are provided in the Incident and Problem Management Greenbook (v1.1). This document is available through the following link:
https://support.ca.com/irj/portal/anonymous/phpdocs?filePath=0/common/greenbooks/Incident_and_Problem_Management_Green_Book_113007.pdf.

Note that ahd.dll is not Best Practice and will not work in a cluster.
If you are using USD r11.2, you should be aware that a change to the install behavior may generate the following error:
Cannot load JDBC driver class 'com.ca.common.EncodedPwDriver'

This is because, as part of the install process a simple check is made to determine if the pm.xml and wl.xml files exist in the CATALINA_BASE\wepapps. If they are not detected, the install copies the epdc.jar file from the CATALINA_BASE\common\lib to the tomcat\4.1.31\common\lib directory in Shared Components. Although this process is fine for the installation on the first node in the cluster, on subsequent nodes, because the install detects that the files already exist, it will perform an upgrade (rather than an install). Since the epdc.jar file copy step is not included in an upgrade, the above error will be generated.

To prevent this, manually copy the epdc.jar file to the Shared Components\tomcat\4.1.31\common\lib prior to starting the service on subsequent cluster nodes.
Note that this caveat only applies to USD r11.2 - it does not affect r11.1.

For Unicenter Asset Portfolio Management (UAPM) r11.2:

The second UAPM presentation references the uAPM Cluster Resource Kit (uAPMCluster.zip) which can be downloaded by clicking here.
Doc versions of these presentations are available in PDF format by clicking here for the Optional Components steps and here for the UAPM Manager Components steps.
Note: The Unicenter Asset Portfolio Management presentations are currently in draft status. Additional presentations and updates to the existing presentations will be posted when they are available.

HAS Compliance
The High Availability Service (HAS) is provided as part of CA Common Services to support fault tolerant functionality when running Unicenter in a cluster environment. HAS includes support for multiple node clients and graceful failover between client nodes.
HAS is already included in such products as Unicenter Management NSM r11 but not all components are HAS compliant yet. These components will need to be installed on a non-cluster environment. HAS is also available for NSM 3.x release and is installed automatically in NSM 3.x setup when a HAS compliant solution is installed. For NSM 3.1, Microsoft SQL Agent (A2) , Exchange Agent and Job Management Option are HAS compliant. JMO requires test maintenance for HAS support. For NSM release 3.x, the unicluster package which is a collection of field developed utilities - provides a simplified approach to implementing Unicenter NSM 3.x in a cluster environment. It also integrates with HAS. For further information, review documents provided in the unicluster zip file.

Additional information regarding high availability with Unicenter Service Desk can be found in the product Implementation Guide.

BrightStor High Availability Considerations
BrightStor High Availability (BHA) provides an alternative HA solution for the Windows platform. It performs selective Data Replication from one or more Windows 2000/2003 server(s) to one Windows 2000/2003 server and is entirely software based.
In trying to decide between BHA and cluster based solutions, there are several considerations to keep in mind. In general, BHA is the less expensive option. It does not require special hardware and has no stringent O/S requirements such as Windows 2003 Enterprise Edition. If there are no HA tools currently in practice (e.g., MSCS) or if failover is required for geographically separate servers, then BHA should be reviewed.
On the other hand, if Unicenter NSM is already installed in an MSCS environment, then BHA should not be considered. Also, since BHA requires more expertise in its setup it should generally only be used by trained staff or in conjunction with professional services.

MTTR (Mean Time To Recover) is typically more favorable for MSCS.
Click here for an overview presentation discussing BrightStor High Availability and Unicenter NSM r11.x (overview). Further details, including installation and configuration procedures, are provided in the BrightStor High Availability and Unicenter NSM r11 presentation. The corresponding presentation guidelines for Unicenter NSM r11.1 are in review and will be provided in a future update.

Other Considerations (pre-r11)
Following are additional topics for your review. Keep in mind, however, the use of HAS and, where applicable, unicluster, are considered best practice. Failover and fault tolerance practices which utilize a great deal of manual intervention should only be attempted after careful consideration and consultation with CA Technical Support and Services:

Additional information regarding high availability and cluster management with Unicenter NSM can be found in the "Making Components Cluster Aware and Highly Available" in the Administrator Guide.