Developing the SNMP Interface - Configuration and Administration

The table below documents the SNMP interface, all headings can be directly mapped to the bind variables in the MIB file. Use the Event Source Type, Metric, Event Severity and Event Message fields to inform SNMP development efforts. Some alerts (e.g., Num Good Proxy Errors, Number of GEMS User Events) are generated based on a number or percentage of events detected in the logs of the source system (e.g, Good Proxy Logs). These are designed to capture

repeated errors, high percentages of errors, percentage of users with errors, and other aggregations that may be indicative of a problem. When these aggregate alerts are generated, the resulting email and SNMP trap also contain a concatenated string of all of the events that triggered the alerts.

Sample Notification

Good Application Group

Total App Requests

Critical The <ApplicationName> application has not received any application requests for the last <numZeroRqstMinutes> minutes. This may indicate a decrease in app activity or an availability issue. This server normally receives <avgAppRequests> requests each 3 minute sample at this hour.

Possible Action: Ensure that network connectivity between the Good Proxy servers and external networks is functioning properly.

Warning The <ApplicationName> application has received fewer application requests than normal. This may indicate a decrease in app activity or an availability issue. This server normally receives <avgAppRequests> requests each 3 minutes sample at this hour. The last 3 minute sample showed only <TotalAppRequests> requests.

Possible Action: Ensure that network connectivity between the Good Proxy and external networks is functioning properly.

Good Proxy Cluster Group

Total App Requests

Critical The Good Proxy cluster <GPSClusterName> has not received any application requests for the last <numZeroRqstMinutes> minutes. This may indicate a decrease in app activity or an availability issue. This cluster normally receives <avgAppRequests> requests each 3 minute sample at this hour.

Possible Action: Ensure that network connectivity between the Good Proxy servers and external networks is functioning properly. For Good Proxy servers configured for Direct Connect, please see GP Direct Connect guide

Warning The Good Proxy cluster <GPSClusterName> has received fewer application requests than normal for this hour in the last <numAnblRqstMinutes> minutes. This may indicate a decrease in app activity or an availability issue. This cluster normally receives <abnlAppRequests> requests each 3 minutes sample at this hour. The last 3 minute sample showed only <TotalAppRequests> requests.

Warning 1 of <NumOfTotalGPS> servers in the proxy cluster <GPSClusterName> has been unavailable for <nMinutes> minutes. The affected server is <GPSList> . Please check the details of the affected server for possible actions.

Unavailable The only server in the Good Proxy cluster <GPSClusterName> has been Unavailable for <nMinutes> minutes. The affected server is <GPSList> . Please check the details of the affected server for possible actions.

Warning The Good Proxy cluster <GPSClusterName> does not contain any member servers.

Good Proxy Server Group

Proxy Server Host Availability

Critical Good Proxy host <GoodProxyServerHost> has not been reached for <numDownMinutes> minutes. This will increase the load on other Good Proxy servers and may impact response times and service quality. Possible Action: Ensure that the Good Proxy host is up and connected to the network. If the host is available, ensure that there are no network issues or firewall rules preventing Good MSM from reaching the host via WMI.

Proxy Server Window Service Availability

Critical Good Proxy Server is not running on <GoodProxyServerHost> . This will increase the load on other Good Proxy servers and may impact response times and service quality. Possible Action: Ensure that the Good Proxy Server service is running on the host. If the service is not running or does not stay running, check the Application log in Windows Event Viewer for errors related to the Good Proxy Server service. Ensure that anti-virus services are not preventing the Good Proxy Server service from starting.

Escalated_Proxy_Log_Errors

Critical Combined Log Events Failed App Server Connection In Percent

or exceeds the critical threshold of <criticalThreshold> %.

The application servers with the highest number of connection errors occurring through this Proxy Server are (Application Server, Error Count for Past 15 minutes, Last Error) : <App Server List> .

Possible Action: Have the application owner for the affected application servers check that they are running correctly. If the application servers are running correctly, check that network connectivity between the Good Proxy servers and application servers is working correctly.

Warning The Good Proxy <GoodProxyServerHost> has experienced a high percentage of application server connection errors for the last <numMinutes> minutes. These errors may result in app service errors for users connecting to these application servers. <FailedAppServerConnectionsInPercent> % of <TotalAppServerConnections> connection attempts have failed which meets or exceeds the warning threshold of <warningThreshold> %.

The application servers with the highest number of connection errors occurring through this Proxy Server are (Application Server, Error Count for Past 15 minutes, Last Error) : <App Server List> .

TotalAppRequests

Critical The Good Proxy <GoodProxyServerHost> has received fewer application requests than normal for the last <numAnblRqstMinutes> minutes. This server normally receives an average of <abnlAppRequests> requests every 3 minute sample at this hour of the day. The last 3 minute sample showed only <TotalAppRequests> requests.

Possible Action: Ensure that network connectivity between the Good Proxy and external networks is functioning properly. For Good Proxy servers configured for Direct Connect, please refer to the GD Direct Connect documentation for detailed configuration guidance

Warning The Good Proxy <GoodProxyServerHost> has received no application request in last <numZeroRqstMinutes> minutes. This server normally receives an average of <avgAppRequests> requests every 3 minute sample at this hour of the day. Possible Action: Ensure that network connectivity between the Good Proxy and external networks is functioning properly. For Good Proxy servers configured for Direct Connect, please refer to the GD Direct Connect documentation for detailed configuration guidance.

Most Critical Event

Combined User Events

Good for Domino Summary Group

High Availability/Monitor Primary GMM

Critical The GMM Server <GroupName> has failed over from the primary machine <GoodServerHost> to the standby machine <GoodStandbyServerHost> . Primary Services Availability/NumServiceAvailabilityErrors

Most Critical Event

The Sevice Status for <GroupName> at <GoodServerHost> :

Standby Services Availability/NumServiceAvailabilityErrors Most Critical

Event

The Sevice Status for <GroupName> at <GoodStandbyServerHost> : Escalated_GMM_Log_Errors

Most Critical Event

Combined Server Events

GDToHHFlows

Unavailable The number of mail flows from GOOD Server <GroupName> at host > GoodServerHost> to handheld has been 0 consecutively for at least <warningThreshold> samples, and also it has been below normal baseline range consecutively for at least <abnl_warningThrshld> samples, while the average amount of flows at this hour is <avgFlowPerHour> .

Critical The GMM Server <GroupName> has consistently been running at high RAM value ( <GMMS_RAM_ToDangerLevel> % of <GMMS_RAM_DangerLevel> MB) for the last <numMinutes> minutes. When GMMS is running at <GMMS_RAM_DangerLevel> MB, it is considered critical.

Standby GMMS Memory/GMMSProcTotalMem

Primary Host Availability/GoodServerHostAvailable

Unavailable The Good Server host <GoodServerHost> has not been reachable for <GoodServerHostDownCount> samples. The host may be unavailable, or a network issue may be preventing Good MSM from reaching the host.

Standby Host Availability/Good Server Host Available

Unavailable The Good Server host <GoodStandbyServerHost> has not been reachable for <StandbyHostDownCount> samples. The host may be unavailable, or a network issue may be preventing Good MSM from reaching the host.

Good for Exchange Summary Group

GDToHHFlows

Unavailable The GMMS <GroupName> at <GoodServerHost> is running and accessible, but Good MSM has detected no device sync activity for the last <thresholdMinutes> minutes. Normally <thresholdFlows> sync activities would be expected in this amount of time. For this hour of the day, Good MSM has learned that an average of <avgFlows> sync activities usually occur every 3 minutes. For an individual 3 minute sample period, Good MSM has observed that flows normally range from <abnlFlows> per sample up to <abnhFlows> per sample. Critical The GMMS <GroupName> at <GoodServerHost> is running and accessible, but Good MSM has detected no device sync activity for the last <thresholdMinutes> minutes. Normally

GEMS User Errors

Most critical event

Combined User Events

GEMS Group

GEMSWindowServicesAvailable

Unavailable GEMS host <GEMSHost> has not been reached for <num_down_minutes> minutes. Possible Action: Ensure that the GEM Server host is up and connected to the network. If the host is available, ensure that there are no network issues or firewall rules preventing Good MSM from reaching the host via WMI.

Critical The Windows service <Caption> is not running on <GEMSHost> . Possible Action: Ensure that the service is running on the host. If the service is not running or does not stay running, check the Application log in Windows Event Viewer for errors related to this service. Ensure that anti-virus services are not preventing the service from starting.

CAS Events

CAS Summary Group

CAS Host Availability

Unavailable The server <CASHost> has not been reachable for <CASHostDownCount> samples. The host may be unavailable, or a network issue may be preventing MSM from reaching the host. Escalated_CAS_

Warning The number of users on the ActiveSync Mailbox Exchange Server GroupName with service errors has exceeded the warning threshold of warningThreshold. Here is the list of errors and number of users with that error: <user list>

CASUserGroup

ActiveSync_User_ Errors

Most Critical Event

Combined User Events

BES Events

BES User Domino Support

Number of User Events Most Critical Event

Combined User Events HHFlashFreeMB

Warning The users BlackBerry smartphone is low on available memory ( HHFlashFreeMemory / 1048576.0 of memory available). As a smartphone runs low on memory (approximately 1.4 MB of free memory) it may begin deleting out-of-date calendar entries messages and call logs. Low memory may also slow the performance and responsiveness of the BlackBerry smartphone. Memory is consumed by media files (e.g. pictures music video) applications as well as your data (messages calendar entries and contacts). To free additional memory on the users smartphone instruct the user to transfer media files to a media card or delete any unneeded media files remove any rarely used applications and/or purge older messages and calendar entries. For step-by-step instructions look-up this user in the Good MSM Help Desk;

BES User Exchange Support

Number of User Events Most Critical Event

Combined User Events

HHFlashFreeMB

BESDominoSummaryGroup

BESHostAvailable

Unavailable The BES host BESHost has not been reachable for 0 samples. The BES host may be unavailable or a network issue may be preventing Good MSM from reaching the BES host Critical The BES host BESHost has not been reachable for BESHostDownCnt> 1 samples. The BES host may be unavailable or a network issue may be preventing Good MSM from reaching the

BES host.

Unavailable The Standby BES host StandbyBESHost has not been reachable for 0 samples. The Standby BES host may be unavailable or a network issue may be preventing Good MSM from reaching the Standby BES host.

Critical The Standby BES host StandbyBESHost has not been reachable for StandbyBESHostDownCnt <> 1 samples. The Standby BES host may be unavailable or a network issue may be preventing Good MSM from reaching the Standby BES host.

Warning The number of mail flows from BES <GroupName> at <ActiveBESHost> to handheld has been 0 consecutively for at least <warningThreshold> samples, and also it has been below normal baseline range consecutively for at least <abnl_warningThrshld> samples, while the average amount of flows at this hour is <avgFlowPerHour> .

Escalated_BES_Log_Errors Most Critical Event

Combined server events

Escalated_SRP_Log_Errors

Unavailable strDownErrs <tmpStr> Critical strCriticalErrs <tmpStr> Warning strWarnErrs <tmpStr>

Licenses Remaining

Critical You currently have <LicensesRemain> licenses remaining which is less than the Critical threshold of <criticalThrsld> . You currently are using <license_used> CALs from a total pool of <license_total> .

Warning You currently have <LicensesRemain> licenses remaining which is less than the Warning threshold of <warningThrsld> . You currently are using <license_used> CALs from a total pool of <license_total> .

NumBESMachines/High Availability

Warning For HA BES deployment there should have two BES hosts. For non-HA there should have one primary BES host. However we detected NumRows(HABESTable) BES hosts. HA rule will not fire.

NumLogLinesPerUser

Abnormally large amounts of log lines may indicate that a problem is occurring on the BES that is causing excessive activity or error rates. Utilize the Good MSM consoles to investigate further.

Critical For the last num min minutes BES <GroupName> at <ActiveBESHost> has generated an abnormally high amount of log lines - more than <LoggingRateCritical> times the expected amount. In the last sample <NumOfLogLines> Good MSM monitored log lines have been generated at an average rate of <LogLinesPerUser> log lines per user. Abnormally large amounts of log lines may indicate that a problem is occurring on the BES that is causing excessive activity or error rates. Utilize the Good MSM consoles to investigate further.

Warning BES <GroupName> at <ActiveBESHost> has not generated any Good MSM monitored log lines for the last <thresholdMins> minutes. Normally <thresholdLogLines> monitored log lines would be expected in this amount of time. For this hour of the day, Good MSM has learned that an average of <totalLogLinesPerSample> monitored log lines usually occur every 3 minutes, and it can normally range from <abnlLogLinesPerSmpl> per sample up to <abnhLogLinesPerSmpl> per sample. Good MSM generally monitors for message flows, status indicators and errors but has not detected any of these expected log lines. This may indicate that the BES is not providing any service to its end users. Utilize the Good MSM consoles to investigate further.

NUmServiceAvailabilityErrors Most Critical Event

The Service Status for BES <GroupName> at <ActiveBESHost> : NUmServiceAvailabilityErrors-Standby Services Availability

Most Critical Event

The Service Status for BES <GroupName> at <ActiveBESHost> :

NumUsersWithHungThread

Critical criticalThreshold or more users on BES ‘GroupName’ at ‘ActiveBESHost’ have hung threads for longer than AdjustableTable[0, HungThreadDurationThrshld] minutes. [These users are: UserList]

Warning The number of users on BES ‘GroupName’ whose Hung Thread Duration is greater than groupThreshold minutes is above the warning threshold of AdjustableTable[0, HungThreadCountWarning] users. [These users are: UserList]

PercentUsersWithMsgPendingCount

Critical <CurrUserCount> * <PercentUsersWithMsgPendingCount> / 100.0 of <CurrUserCount> total users on BES <Name> at <ActiveBESHost> with Message Pending Counts higher than <groupThreshold> is above the baselined threshold of <pvalue> percent

Warning <NumUsersWithHighPendingCount> of <CurrUserCount> total users on BES <Name> at <ActiveBESHost> with Message Pending Counts higher than AdjustableTable[0, <MsgPendingCntThrshld> is above the baselined threshold of <pvalue> percent.

Critical <NumUsersWithHighPendingCount> of <CurrUserCount> total users on BES <Name> at <ActiveBESHost> with Message Pending Counts higher than <groupThreshold> is above the baselined threshold of 100.0 * <numUserPendMsgCntCritical> / <CurrUserCount> percent.

Warning <NumUsersWithHighPendingCount> of <CurrUserCount> total users on BES <GroupName> at <ActiveBESHost> with Message Pending Counts higher than <groupThreshold> is above the baselined threshold of 100.0 * <numUserPendMsgCntWarning> / <CurrUserCount> percent.

TotalMessagesPending

Critical The total message pending for <BESHost> is regularly above the critical threshold of <TotalMsgsPendingCritical>] . This could be an indicator of a wireless carrier failure (affecting many users at once) a RIM service failure a network failure causing an SRP connect failure between the BES Server and RIM a problem with the BES SQL Server database or hung worker thread(s) causing delays in message delivery and eventually a BES Messaging Agent restart on the BES Server. Check (1) whether the RIM Service is up (e.g. ping srp.na.blackberry.net) (2) whether a large number of users on the same wireless carrier are down (3) whether the SRP connection between the BES server and RIM and (4) the SQL connection are up by going to the BlackBerry Server Configuration console BlackBerry Router tab and clicking Test Network Connection button and then the Database Connectivity tab and clicking the Test SQL Server Connection button and (5) look in the BES server Messaging Agent logfile for log entries with phrase No Response for a specific worker thread and a specific user name.

Warning The total message pending for <BESHost> is regularly above the critical threshold of <TotalMsgsPendingCritical>. This could be an indicator of a wireless carrier failure (affecting many

In document Configuration and Administration (Page 100-115)