Network Manager
3.9 Fix Pack 4/ 4.1.1
IBM Tivoli Network Manager
Best Practices for Network Monitoring
Version 1
n
Rob ClarkThis edition applies to 3.9 FP4, 4.1.1 of IBM Tivoli Network Manager and to all subsequent releases and modifications until otherwise indicated in new editions.
© Copyright IBM Corporation 2014.
Contents
About this Guide...v
Chapter 1: Poller concepts and terms to get started...7
Policy...7
Poll Definitions...8
Multiple domains...9
Chapter 2: How do the various scopes work together?...11
How scopes are applied...11
Poll Definition Classes filter...11
Poll Definition Interface filter...12
Policy level filters...12
Chapter 3: How to get the best out of threshold events...13
Event ID...13
Event summary description...13
Rules file...14
Event enrichment in the Event Gateway...17
Chapter 4: Making the most of historical data...21
Short-term diagnostic tool...21
Do I need Tivoli Data Warehouse?...21
Storage capacity...22
Use of the Data Label...24
Chapter 5: What is adaptive polling?...25
Chapter 6: How many poller instances do I need?...27
How many?...27
Tips for defining multiple pollers...27
Using multiple pollers...29
Chapter 7: Are the pollers healthy?...31
1) Is the historical poll data table being maintained?...34
2) Is the poller keeping up with the policy load at the scheduled frequencies?...37
3) Is the poller's memory stable?...40
4) Is the poller successfully storing data?...40
5) Do I need to add a new poller?...43
Chapter 8: Am I pinging all the IP addresses I want?...47
Generate the report...47
The report...48
Chapter 9: Poller Configuration...51
Poller settings...51
For IBM Support use...53
Notices...55
Trademarks...57
About this Guide
Assuring the health of your network is one of the most important functions of Network Manager. This guide helps you take advantage of the full capabilities available in Network Manager to plan your polling policies and help you enrich the monitoring events for your operational needs.
The Network Manager poller is very flexible and provides you many options to mold it to your business needs and environment. This guide explains
various concepts and examples to help you understand how to get the most out of using the poller to monitor your network.
IBM Tivoli Network Manager 3.9 Fix Pack 4 and 4.1.1 introduced many improvements to the poller for scalability and manageability and this guide assumes you are using those releases or higher.
Chapter 1: Poller concepts and terms to get started The first chapter covers the basic concepts and terms to get started and provide a context that the
following chapters can build on.
Chapter 2: How do the various scopes work together? explains how the scope settings on the poll definitions work with the scopes defined at the policy level. Filtering poll definitions at the class level for vendor specific MIBs, for instance, can make it easier when defining the scope at the policy level. But you should always check the class filtering at the poll definition level to avoid surprises.
Chapter 3: How to get the best out of threshold events, using examples, helps you make the most of the events, in context with the poller, to provide useful contextual information for operators.
Chapter 4: Making the most of historical data covers the how and why on storing data for following trends or reporting on top offenders for diagnosing network problems.
Chapter 5: What is adaptive polling? explains how you can take advantage of using event-based network view scopes to make your polling coverage more efficient.
Chapter 6: How many poller instances do I need? looks at the best practice setup for pollers to maximize their capability.
Chapter 7: Are the pollers healthy? looks at using the new Health metrics to quickly understand how the pollers are running. You can use these graphs to monitor load conditions and decide when to setup additional poller instances. Chapter 8: Am I pinging all the IP addresses I want? looks at troubleshooting your ping polls. It uses a supplied list of IP addresses that you are responsible for monitoring for availability. It shows you how to ensure the poller is actually polling all of them for your peace of mind, and if it isn't, why not, so you can fix it.
Chapter 9: Poller Configuration Provides insights into the best practices for configuring the poller processes.
Chapter 1: Poller concepts and terms to get started
This chapter explains the Network Manager monitoring policy concepts that you can use right away to get started, and also provide a platform for the other chapters to build on. Refer to the Knowledge Center for IBM Tivoli Network Manager for full documentation on monitoring and the poller:
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_monitornw.html Setting up the polling policies is fairly straightforward, especially if your needs are simple. But it will be useful to point out some options and best practices along the way.
Policy
A policy is a package that describes a set of devices and the data to poll for. You assign a set of devices to a policy and then add one or more poll definitions that describe what data will be polled and its threshold condition and details of the alert. For each poll definition, you set the polling frequency and whether to store the data. Then assign the policy to a specific poller.
Availability
Start by setting up your availability assurance. Apply a policy to all devices that will test for availability using some or all of the Chassis ping, Interface ping, and End Node ping polls. Using both the Chassis and Interface Pings will allow Root Cause Analysis (RCA) to correlate a Chassis ping failure as root cause for interface ping failures.
By default, the Default Chassis Ping and Default Interface Ping polls are set up to poll all the network device classes and the End Node Ping only pings the devices classified as end nodes, that is, belonging to the classes under the EndNode superclass (AIX, Linux, NoSNMPAccess, Sun, Windows, and so on ). This class-based scope is configured on the poll definition, not usually on the policy. See Chapter 2: How do the various scopes work together?
Port Link State
Create another policy for all the switches and use the SNMP Link State poll definition to test the ifOperStatus and ifAdminStatus of all ports and send an alert on state changes. Unlike all the other poll definitions, the poller will only send an alert on state changes – the event count does not increase for these events. For all other poll definitions, the poller sends an alert for each poll that breaches the threshold condition and lets Netcool/OMNIbus perform de-duplication to simply increase the count on that event.
Other SNMP polls
Other standard polls to consider are interface Bandwidth usage, Errors, and Discards, and also memory and CPU usage for the network devices. If the same thresholds can apply to all devices, then this task is straightforward. Otherwise, copy the poll definitions and edit the threshold conditions and
assign to a separate policies with appropriate scopes.
Poll Definitions
Poll definitions describe the data to collect, the threshold and clear conditions, the description and severity of the threshold event, and some class and
interface filtering capability that is covered in the next chapter. This allows you to target a poll definition to a specific set of classes, so that, for instance, this poll definition will collect Cisco data only from Cisco class devices. Or you might use different threshold conditions for the Cisco64xx class than other Cisco devices.
Note: Don't forget to set the event severity. By default it is zero which is Clear. If left
unchanged it will result in the events being removed from Netcool/OMNIbus after 2 or 3 minutes, leaving you wondering why events are not being generated!
Default event severities: 0 Clear 1 Indeterminate 2 Warning 3 Minor 4 Major 5 Critical
Here are the main differences between the poll definition types.
Chassis Ping & Interface Ping
These poll definitions use ICMP polls. You can store the up/down value and optionally the ping response time, and the Packet Loss percentage. Note that the timeout and retries can be adjusted in the poll definition Ping tab so if you need to allow for slow links to different regions for instance, you can create separate poll definitions with difference values. Use the Copy button to duplicate poll definitions, since most attributes will be the same.
Basic Threshold
This poll definition evaluates to an integer which is used in the threshold condition. The value can also be stored in the historical database.
Generic Threshold
The threshold expression defines the data to collect and the expression evaluates to a boolean. This value cannot be stored. But you can create powerful boolean logic correlating MIB values.
However, while you can include several MIB variables in the expression, they must all be from the same MIB table – since there is no way to explicitly identify table instances from other tables.
Typically you use the Basic expression builder to construct straightforward expressions. For more complex logic expressions for Basic and Generic Thresholds, use the Advanced panel to enter the expression using the OQL
and eval statement syntax, which is covered in the Knowledge center:
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/reference/nmip_poll_syntaxforpolld efexpressions.html
SNMP Link State
This is a fixed poll definition – you cannot change its logic which is described in the Knowledge center
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/concept/nmip_poll_linkstate.html One other thing to note is the initial state that the poller uses to calculate if a change has occurred. By default, it assumes the Up state if there is no current event. This can cause a flood of events on startup for unconnected ports where the ifAdminStatus has not been set to Down. Deleting those events, does not prevent another flood of events the next time the poller starts. Either set those ports to ifAdminStatus down on the devices, or if you don't want to do that you can configure the poller to use the initial state from the first poll if no event exists. With this scenario you risk missing ports that have gone down when you take the poller down for maintenance.
To set this option, edit the file,
$NCHOME/etc/precision/NcPollerSchema.domain.cfg
and set this property to 1:
update config.properties set UseFirstPollForInitialState = 1;
Multiple domains
Policies are tied to a domain, so if you want to use them in another domain, you can copy the policies (with their poll definitions) when creating a new domain, or afterwards. See the details of the domain_create.pl and get_policies.pl in the Knowledge center
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/ref/reference/nmip_ref_perlscripts.html However, poll definitions themselves are global and can be used from any domain without copying. So use a naming convention for convenience if you design them to be domain specific.
Chapter 2: How do the various scopes work together?
When creating policies we tend to focus on the main scope definitions that are defined within the policy. However, by ignoring the Classes filter in the poll definitions, you can find that you are not polling all the devices you expected.
How scopes are applied
As you can see from the figure, for a device to be actually polled for a set of data, it must first pass through the policy filters (Network Views and the Device Filter), and then any filter defined within the poll definition for that piece of data (the Classes filter and the Interface filter).
Do not assume the Class filter is correctly set, even for default poll
definitions. Always check that the classes you want are selected. For instance, if the Cisco parent class is unchecked, that is because one of the sub-classes is also unchecked and this is easy to miss at a quick glance.
Note that the parent class itself can also contain devices – thus if the parent class itself is not selected those devices will not get polled.
If you intend not to use the class filter and let all devices through, then simply uncheck all the classes. A quick way to ensure nothing is checked is to click the Core class, which will check everything, and then click it again to clear it. Tip: After any new AOC classes are added to the system, you must revisit the poll definitions in use to select the new classes if you want to poll those devices. Note that new classes will not be selected in the filter by default and that can cause the parent class to become unselected.
As a best practice, set up the Classes filter in the poll definition relative to the context of the data, rather than using it as part of a policy level broader scope. That will make it less confusing as you reuse poll definitions with other policies over time. For example, if you are polling a specific Cisco MIB, then use the class filter to select Cisco devices. So now your policy scope can be created based on more general criteria, such as geography or device type.
Poll Definition Interface filter
The Interface filter is used to:
• reduce the load on devices which have very large interface MIB tables when responding to SNMP requests
• prevent unimportant interfaces from generating non-interesting alerts • reduce poller processing resources
A poll definition interface filter is effective in reducing poller processing only when the percentage of interfaces selected is small. An interface filter that only excludes a few rows of data is more resource-intensive for the poller since it queries the SNMP table one row at a time with snmpget requests, instead of the more efficient snmpgetnext table query. So a useful tip is to avoid use of the interface filter unless it reduces the number of interfaces selected considerably.
Policy level filters
Network Manager 3.9 introduced the ability to use Network Views as the scope for polling. This allows you to establish complex, but easily verified, reusable scopes for monitoring. Using the tabular view of the Network View is useful to quickly see the membership at any time. You can also use the
Monitoring reports to verify the scope based on the policy filtering – however, it does not take account of the poll definition filters.
Once you have selected the Network View or Views, you can refine the scope further with a simple device filter in the Device Filter tab if necessary. You can also select All Devices in the Network View tab and simply define a simple device filter if you don't want to create a Network View just for this policy.
Chapter 3: How to get the best out of threshold events
This chapter covers the various controls to enrich and manipulate the events generated by the poller in order to maximize the relevant information about the problem and the device for the operator.
This section involves a deeper level of knowledge of scripting, the Network Manager OQL language, and Netcool/OMNIbus probe languages. This section will provide you ideas to tailor the events to your environment and needs. Note: The Netcool/Impact is a powerful tool to correlate and enrich any event with
data from virtually any data source. It is beyond the scope of this guide, but worth considering if you are looking to enrich events from custom tables and data sources beyond what is possible with the methods described here.
Event ID
When creating new poll definitions decide whether you need an existing event ID or a new one. The event ID is used to define that event for use in event handling throughout Network Manager and Netcool/OMNIbus. It corresponds to the EventId field within the event, such as NmosPingFail, inbandwidth, and so on. When you create a new event, a new event ID will be automatically assigned, but you will need to reenter the poll definition edit window to see the new ID.
If you are simply creating a variant of the poll definition, for example with a different threshold value, then use the Copy button to duplicate the existing poll definition and keep the same event ID. This will ensure the event is treated the same in terms of event enrichment, filtering, and so on through the system.
If you are creating a new poll definition for new data or expression, then use the New button so a new event ID is created and can be distinguished from the others.
Event summary description
You can build the event description for the Basic Threshold and Generic Threshold type events as part of the poll definition. You can change the descriptions to suit your operating processes. In the description, you can embed live SNMP variables (even if they are not in the poll expression itself), information from NCIM for the device, and information from the policy and poll definition. See the details, including the syntax, in the Knowledge center: http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/reference/nmip_poll_syntaxforpolld efexpressions.html
For some event types, (Ping Polls, Link State, Remote Ping) the description is hard coded, but you have an opportunity to change it at the next step in the event processing, in the nco_p_ncpmonitor Probe rules file.
description. While there is no way to include the value of an expression in the description in the poll definition edit window, you can add it at the next step of the event processing – and this is described in Example 3 in the Probe rules file section below.
Rules file
The poller sends the events to the nco_p_ncpmonitor probe which converts them into the Netcool/OMNIbus event format. The probe has a rules file, $NCHOME/probes/<arch>/nco_p_ncpmonitor.rules, which controls
the conversion and import into the Netcool/OMNIbus ObjectServer. For the syntax details, see the Knowledge center,
http://www.ibm.com/support/knowledgecenter/SSSHTQ_7.4.0/com.ibm.netco ol_OMNIbus.doc_7.4.0/omnibus/wip/probegtwy/reference/omn_prb_proberul esfilesyntax.html
TIP: Customizing this file is fully documented, but you should understand the logic carefully so that your changes are in the right logic path for the events you are working with. Also be careful to avoid changing other event variables that play a key role in the life of the event.
TIP: When customizing system files, create a backup of them, and then clearly separate your code from the default code. This will be very helpful later on when someone has to migrate this file.
Customizing the Probe rules file is a very powerful way to customize and standardize your polling events. You can add information to the event or filter events under certain conditions so they are discarded and never reach the ObjectServer. A full treatment of the rules file is beyond this guide, but these examples will give you some ideas.
Example 1: Standardize the identification of devices
In the event viewer, some events use the device IP address in the Node field of the event, others use the name. If you prefer to always see the name, then one way to do this is to use the EntityName for the Node field in the event. This includes the ifName for certain events such as interface Ping Fails and SNMP Link State which provides better interface identification for the operator than just the ifIndex value.
Edit the nco_p_ncpmonitor.rules file and locate the beginning of the standard fields. Add the following line to set the Node field to the EntityName as shown here.
Note: This will only work if you use an interface filter in the SNMP Link State poll
definition – otherwise the entityName will always be that of the main node. #
# populate some standard fields #
@Severity = $Severity @Summary = $Description .
.
# CUSTOM: Use the EntityName to standardize on the name
# rather than the IP address in general. Do the same for # Link State events, but using a different variable. # rob 4/23/2014 @Node = $EntityName if (match($EventName, "NmosLinkState")) { @Node = $ExtraInfo_ENTITYNAME }
This is the result:
Example 2: Modify the description for Ping Fail events
This example modifies the Summary field for the event by prefixing the name of the domain. If you are using Netcool/OMNIbus to forward events, e.g. email, SMS, or to a ticketing system, you might want to pack the description with information for convenience. See the end of this section for a description of the details table which contains variables the poller passed on for that event, but might not be in an event variable yet. The event variables are in the ObjectServer's alert.status table.
Custom Tip: You can do this in many ways within the file. Examine the logic
to make sure you are placing new code in a path that will be executed for your event.
#
# CUSTOM: Prefix ping fail events with the domain name # (rob 4/23/2014)
#
if ( match( $EventName, "NmosPingFail" )) {
@Summary = $Domain + ": " + @Summary }
Example 3: Adding value calculation to event summary
This is a technique you can use to add the value of an expression to the event. You cannot do any calculations within the event description in the poll
definition itself, but you can bring the numbers to the rules file and do the calculation here.
SnmpInBandwidth poll definition. Put the MIB values on the line for the
Threshold and Clear description, as in the following example, with a space in between each one:
eval(text,"&SNMP.VALUE.ifName")
eval(long64,"&SNMP.DELTA.ifOutOctets") eval(long64,"&POLL.POLLINTERVAL")
eval(long64,"&SNMP.VALUE.ifSpeed") Exceeded
In the Clear description use the word “Clear” instead of “Exceeded”.
Calculate the value in the rules file. Edit the nco_p_ncpmonitor.rules file and add this section in the path of the standard events:
#
# CUSTOM: Calculate value for snmpInBandwidth event # rob 4/24/2014
#
if (match($EventName, "inbandwidth")) {
1 [$if_name, $octets, $pollint, $ifspd, $msg] =
scanformat(@Summary,"%s %d %d %d %s")
2 $calculated = ($octets * 800)/($pollint * $ifspd) 3 $percent = int($calculated)
4 @Summary = "Bandwidth threshold " + $msg +
", value is currently " + $percent + "% on " + $if_name
}
Line 1: scans the 5 values from the Summary into each variable.
Line 2: performs the same bandwidth utilization calculation as in the poll definition. The division forces the result to be a decimal number.
Line 3: Converts the real number back to an integer
Line 4: Rebuilds the summary string with the calculation result Here are the events after implementing Examples 2 and 3:
Some nuances to be aware of
Note that the @ prefixes the event fields (in the ObjectServer alerts.status
table) and $ prefixes the variables from the poller in the Details section. To see the fields passed to the probe from the poller, add the following line to the rules file at the end,
details($*)
This will populate the ObjectServer details table which can be seen by looking at the Information for an event, and clicking the Detail tab.
After making changes to the rules file you can either restart the
nco_p_ncpmonitor process or just send a SIGHUP signal to the running
process to reread the rules file:
# itnm_status ncp Network Manager: Domain: ITNMDEMO
ncp_ctrl RUNNING PID=1245 ITNMDEMO .
.
nco_p_ncpmonitor RUNNING PID=1423 ITNMDEMO .
.
# kill -HUP 1423
Check $NCHOME/log/precision/nco_p_monitor.domain.log for syntax
errors.
Event enrichment in the Event Gateway
Once events have been inserted to the ObjectServer by the probe, they are acted on by other agents: ObjectServer triggers for instance, as well as the Network Manager Event Gateway. This gateway reads events from the ObjectServer and updates them – enrichment from the various Network Manager plugins including the Root Cause Analysis engine and the StandardEventEnrichment stitcher.
For this example, we want to include the ifAlias for interface events. We could do this in the rules file, but that would only affect events from the poller. The Event Gateway can affect events form all sources that can be matched to the discovered topology.
Step 1: Add new field to alerts.status in the ObjectServer
TIP: Keep a record of fields you add to the ObjectServer for future reminder during upgrades.
Create a new field called InterfaceAlias. Add this line to a file: let's call it customObjectServerFields.sql,
alter table alerts.status add column InterfaceAlias varchar(64 );
Run this Netcool/OMNIbus command to execute the script,
nco_sql -server objectservername -user root -password password < customObjectServerFields.sql
Step 2: Edit EventGatewaySchema.cfg to act on the new field
Near the bottom of this file you will see the insert statements for the two tables,
nco2ncp (controls events read from the ObjectServer (nco) into the Gateway (ncp))
ncp2nco (controls events being written back to the ObjectServer)
Add the new InterfaceAlias field to the nco2ncp table, so it is read in from the ObjectServer,
insert into config.nco2ncp ( EventFilter, StandbyEventFilter, FieldFilter ) values (
"LocalNodeAlias <> '' and (NmosDomainName = '$DOMAIN_NAME' or NmosDomainName = '')",
"EventId in ('ItnmHealthChk', 'ItnmDatabaseConnection')", [
"Acknowledged", "AlertGroup", "EventId", .
//CUSTOM: added by rob 4/24/2014
"InterfaceAlias",
. .
Step 3: Add new field to the ncp2nco table
Now add the new NmosInterfaceAlias field to the ncp2nco table so that it will be written back to the ObjectServer:
insert into config.ncp2nco ( FieldFilter ) values ( [ "NmosCauseType", "NmosDomainName", "NmosEntityId", "NmosManagedStatus", "NmosObjInst", "NmosSerial",
//CUSTOM: added by rob 4/24/2014
"InterfaceAlias"
] );
Step 4: Populate the new InterfaceAlias field
Edit the file,
$NCHOME/precision/eventGateway/stitchers/StandardEventEnrichment.stch
and add code to populate the new field above the line, GwEnrichEvent( enrichedFields );
//CUSTOM: Populate the new InterfaceAlias field,
1 if ( entityType == 2 ) {
2 text ifAlias = @entity.interface.IFALIAS;
3 if ( ifAlias <> eval(text, '&InterfaceAlias') )
{
4 @enrichedFields.InterfaceAlias = ifAlias;
} }
5 GwEnrichEvent( enrichedFields ); Line 1: Only do this for interface events (entityType of 2).
Line 2: Declare and initialize the ifAlias variable with the current value from the NCIM interface table.
Line 3: Check if the current value is different from the value in the event (which will be empty the first time it is read from the ObjectServer.) Line 4: Set the variable to the NCIM value, if different
Line 5: Process the newly enriched event variables.
After making changes to these two files you can either restart the
ncp_g_event process to reread them or send a SIGHUP signal to the running
process to reread the files:
# itnm_status ncp Network Manager: Domain: ITNMDEMO
ncp_ctrl RUNNING PID=1245 ITNMDEMO .
.
ncp_g_event RUNNING PID=1569 ITNMDEMO .
.
# kill -HUP 1569
Check $NCHOME/log/precision/ncp_g_event.domain.log for syntax
Chapter 4: Making the most of historical data
Short-term diagnostic tool
Network Manager provides short-term historical data information by storing SNMP and ICMP data collected from network devices and using Tivoli Common Reporting to view and analyze the data. Features typical for a full performance management product, such as optimized data storage, routine data gathering over extended periods required for capacity planning, and regulatory reporting, is possible with Tivoli Netcool Performance Manager, but is not a goal with the Network Manager historical reports.
Use the Network Polling configuration panels to:
• Define data to collect, including setting any threshold triggers for alerts
• Define the scope, and time interval for polling • Determine what data to store
• Start the data collection
You can use the Tivoli Common Reporting viewer to:
• View sets of defined reports detailing trends and analysis based on SNMP and ICMP short-term historical data collections for a subset of the collected data
• View generic Trend and TopN graphs of ad hoc collections of stored data
You can use this for closely monitoring problematic or key network devices after a maintenance period, or an area where you suspect problems and want to get a better understanding of behavior trends with throughput, device CPU/memory resources, interface usage, errors, discards, etc. The TopN reports can help operators compare and focus on the right devices and drill down to see patterns over time. Summarization reports (with Tivoli Data Warehouse) can help extend the time period you want to compare
performance over. The default reports can be used as examples to edit to meet your needs.
Do I need Tivoli Data Warehouse?
If storing and reporting on performance data is important and you are not using a dedicated performance management product, then you might want to consider using Tivoli Data Warehouse to store the historical data.
By default the data is stored and maintained in the local NCIM database. If your environment includes IBM Tivoli Monitoring and Tivoli Data
Warehouse, you can take advantage of these tools for storing, summarizing, and managing the performance data collected by Network Manager.
Tivoli Data Warehouse supports summarizing data across time periods such as
hourly, daily, weekly, and monthly.
There are sample Summary reports that make use of the Summarization tables in TDW (and therefore cannot be run without TDW).
The Device Summarization report and Interfaces Summarization report
present data in raw, hourly, and daily graphs on the same page for the data you have stored. This allows you to view behavior over a longer period of time. The Device Availability and Interface Availability reports present ping response time as well as graphs for availability in the last 24 hours, last 30 days, and last 3 months.
Tivoli Data Warehouse also supports advanced data pruning and archiving to other stores.
Storage capacity
When calculating how much data you can store, you need to consider not just the number of rows or data points, but also the rate of storage.
By default, the Network Manager poller maintains a pruning policy to maintain the latest 5 million rows in the local database. You can modify this limit if you are achieving satisfactory performance results when generating the reports in your environment. Reset the limit in the
$NCHOME/etc/precision/NcPollerSchema.cfg file for the local cache, as described in the Knowledge center,
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/admin/task/nmip_adm_increasestorageli mitforhistperfdata.html.
You can store up to 20 million rows in either the local database or Tivoli Data Warehouse. Depending on your hardware and database performance, you might see degradation in the storage and reporting performance above 20 million rows. With increasing storage rates and table size, you might also need the services of your Database Administrator to optimize, run statistics and perform transaction log maintenance regularly on the database.
The sustained rate of data storage depends on a number of factors: • Number of polled entities
• Number of metrics polled • Frequency of polling • Number of policies • Number of pollers
• Database performance (for large rates, a slow database will have an impact)
Insertion rates to the local ncpolldata database have been seen up to 7 million data points per day across all pollers. However, with high rates like this you need to watch the pruning to make sure it is keeping up.
A single polling policy will not store much above 1 million data points per day, but you can use multiple policies and multiple pollers within the
suggested overall range. You can increase the number of pollers, as described in Chapter 6: How many poller instances do I need? To avoid excessive impact on the database, don’t use more than 3 pollers for storing data. Create one ITM agent instance per Network Manager poller.
Throughput to the Tivoli Data Warehouse is up to a total of 7 million rows per day with all pollers. Exceeding this limit shortens the tolerance limit for outages and might cause loss of data. While rates higher than this can be achieved, you will want to allow for error conditions on the network link and transfer processes which include store and forward techniques. ITNM will tolerate and recover from transfer outages without loss of data for a short period of time that depends on the data rate, disk space, and period of time. The longer the period, the longer it takes for the system to recover.
To calculate the rate of data points you want to store, follow these steps. For data based on the device, e.g. memory, CPU, for each SNMP poll definition, mins in freq Polling day a in minutes Number devices of Number day per Datapoints = ×
For data based on the interface, e.g. bandwidth, ifInDiscards, for each SNMP poll definition, mins in freq Polling day a in minutes Number interfaces of Number day per Datapoints = ×
To get a feel for the storage capacity, here is an example.
A user wants to poll 1000 devices with an average of 5 network interfaces for the following historical polled data with the following polling intervals:
• Device level:
o memory utilization, 5 minute intervals • Interface level:
o ifInDiscards, 10 minute intervals o ifInErrors, 10 minute intervals o bandwidth, 10 minute intervals
o Pings for up/down status and response time, 5 minute intervals Based on these device level and interface level polling requirements, a user would calculate the daily rate of database rows using the previously described guidelines.
Note: In the example, the SNMP poll specifies a count of three data points for the
ifInDiscards, ifInErrors, and bandwidth historical polled data. The ICMP poll specifies a count of two data points, one for the up/down status and another for the response time.
Number of device level database rows per day (SNMP)
= 1000 devices * 60 * 24/5 polls per day = 288,000 rows
Number of network interface level database rows per day (SNMP) = 1000 * 5 interfaces * 3 data points * 60 * 24/10 polls per day = 2,160,000 rows
Number of ICMP database rows per day
= 1000 devices * 5 interfaces * 2 data points * 60 * 24/5 polls per day = 2,880,000 rows
Total database rows per day = 5,328,000
The previous example shows the total database rows per day of 5,328,000, which is within the upper guidance of 7 million database rows per day. Thus, this example shows a scenario that results in maintaining about 4 days of raw data after increasing the storage limit for historical polled data to 21 or 22 million database rows.
This example is copied from the Knowledge center:
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/concept/nmip_poll_storagecapacitye xample.html
For ICMP data, you can choose to store the up/down value, and optionally the response time and/or packet loss. Use the above formula for the SNMP collection based on whether you are using device or interface based pings and multiply the result by 1, 2 or 3 depending on whether or not you need the response time and packet loss data in addition to the up/down data point.
Use of the Data Label
It is often useful to graph the results gathered from different poll definitions. For example, if you have duplicated poll definitions for different threshold values and maybe event severities for ifInDiscards, but you want to compare the values on the same reports, regardless of the specific poll definition. Most reports will use the Data Label field for grouping purposes. By default, the Data Label is set to the poll definition name, but you can set a common name in the poll definition GUI panel across the different poll definitions. When selecting the parameters for the report, you choose the data to present in the report using the common data label. The report will present data tagged with the same data label but collected by more than one poll definition.
Chapter 5: What is adaptive polling?
You can define a network view based on events associated with the device, which makes the population of this view variable and dependant on the life cycle of these events. Adaptive polling allows you to poll devices only under certain conditions, for example when they are at risk of failure. When an interface is experiencing high throughput it might not be a red flag by itself, but it is now more at risk of discarding packets and/or errors and worth monitoring more closely.
If you wanted to conserve polling on all your devices for bandwidth, discards and errors, you could set up a policy just to poll for discards and errors on devices that exceeded the bandwidth threshold.
First, define a network view based on the existence of the event; let's call it HighThroughput. Select the Type filtered and define the filter based on the activeEvent table. To find the eventId of the event, go to the poll definition for snmpInBandwith and note the Event ID on the General tab. In this case it is inbandwidth. So the filter will be,
Using activeEvent table: eventId = 'inbandwidth'
Of course you can combine this with any other filter when defining network views to narrow down or expand the scope to suit your needs.
Now you can view all the devices with at least one interface that exceeded the bandwidth threshold. As the throughput declines and the events clear, you will see that the devices no longer appear in the view.
Now set up a policy using this new HighThroughput view to poll for ifInDiscards and ifInErrors.
Note on the policy General tab there is an edit box for Policy Throttle. By default this is zero so that the policy is not affected. But when using polling scopes that can be variable like this, it is sometimes prudent to enter a value for the maximum number of devices to poll. In this case you might feel that even in the worst case, the poller will be able to handle the load, but if you were thinking of setting up an accelerated polling scheme, as described in the examples in the Knowledge center,
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_managingadaptivep olling.html , then it would be advisable to guard against event storms.
This is an advanced technique so consider starting small, and evaluate how useful it is for you.
Chapter 6: How many poller instances do I need?
When you increase the amount of polling, you might reach the point where you need to set up more poller instances. The number of devices and
interfaces, the frequency of polling, how many metrics or data being polled, and your network latency, all contribute to the load. Use the new poller metrics in Chapter 7: Are the pollers healthy? to monitor the pollers as you expand the polling demands across your network. This will help you determine when to set up more pollers.
How many?
We suggest, as best practice, that you set up three pollers per domain.
1. One poller to perform the administration functions for all pollers in this domain. In addition, use this poller to perform the MIB Grapher real-time polling so that this variable use does not impact the pollers handling the regular polls.
2. One poller to perform ICMP polls. These are lightweight polls and one poller is generally enough for the biggest networks with 5 minute frequency.
3. One poller to perform SNMP polls. These polls require more resources for the poller. Use the poller metrics to monitor the poller so that you can determine if you need to set up additional pollers. See Chapter 7: Are the pollers healthy?
Availability polling is your mission-critical monitoring and is relatively light weight on the poller, so use a separate poller for all the pings and verify from the poller metrics that they are healthy even during high load discovery
periods. Ideally you do not want to risk overloading or destabilizing this poller over time with people adding ad-hoc policies since it is the most important. Add additional pollers for SNMP and performance polling. Policies that store data place additional burdens on the poller and it is important to ensure that the poller will be able to sustain the loads even under mild duress, such as brief database maintenance outages.
If you are storing polled data, try to store data from as few pollers as possible, no more than 3, to avoid undue database contention when using high volumes.
Tips for defining multiple pollers
By default, all poller instances will perform the administration duties, so designate just one of them for the admin role and give it a name like “AdminPoller” to remind everyone not to assign policies to it. Explicitly designate each poller using the command line arguments -admin or -noadmin when defining the poller instances in the CtrlServices.cfg file.
Note that the default poller does not have an explicit instance name and you will see it referred to in the Network Polling GUI as DEFAULT_POLLER. There is nothing special about the default poller and you are free to rename it
explicitly for clarity.
It is a good idea when setting up the new pollers in
CtrlServices.domain.cfg, to modify the service name (serviceName field
in the services.inTray table) to be the same as the poller name. By doing
this you ensure the name is used consistently for that poller across the product. Tip: Poller naming convention
1. Avoid spaces, since the poller name is used as part of file names, such as the logs, metrics, and cfg files (for example,
NcPollerSchema.AdminPoller.NCOMS.cfg). Therefore to make life
easier, chose names that are compatible with the file system and avoid spaces.
2. ServiceName: since the poller processes run under ncp_ctrl (see
itnm_status ncp), use a naming convention with the “ncp” prefix for
the serviceName when you add additional pollers. This allows you to continue to use the ps -ef|grep ncp command to view all the core processes.
For example, in CtrlServices.domain.cfg, use something like,
“ncp_poller_AdminPoller” when the -name argument is “AdminPoller” “ncp_poller_PingPoller” when the -name argument is “PingPoller”
insert into services.inTray ( serviceName, . argList, . ) values ( "ncp_poller_AdminPoller", .
[ "-domain" , "$PRECISION_DOMAIN" , "-latency" , "100000", "-debug", "0", "-messagelevel", "warn", "-admin", "-name", "AdminPoller" ],
. );
insert into services.inTray ( serviceName, . argList, . ) values ( "ncp_poller_PingPoller", .
[ "-domain" , "$PRECISION_DOMAIN" , "-latency" , "100000", "-debug", "0", "-messagelevel", "warn", "-noadmin", "-name",
"PingPoller" ], .
);
MIB Grapher
Don't forget to configure the MIB Grapher to use the admin designated poller. Edit
$NCHOME/precision/profiles/TIPProfile/etc/tnm/tnm.properties.
By default, the MIB Grapher is configured to use the default poller:
tnm.graph.poller=DEFAULT_POLLER
Change to, (using the -name argument),
tnm.graph.poller=”AdminPoller”
Using multiple pollers
When dividing up your policies to specific poller instances, consider the following:
• Multiple pollers are only supported on the same server. This ensures a consistent source point with the discovery for event correlation in RCA. • OQL service name. Normally you do not need to query the pollers with
OQL, but it can be useful when diagnosing some issues and you want to see exactly what devices and data the poller has scheduled for polling and when it last polled each data point. When using OQL to query the pollers, use the following syntax for the unnamed default poller (if you have one),
ncp_oql -domain NCOMS -service SnmpPoller
and for named pollers, such as “PingPoller”,
ncp_oql -domain NCOMS -service SnmpPoller -poller 'PingPoller'
For full details, see the Knowledge center:
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/admin/task/nmip_adm_admindistpoll.ht ml
Chapter 7: Are the pollers healthy?
In order to prevent problems from occurring on a poller, it is important for an administrator to monitor the health of their poller. To assist them in doing this, the poller outputs a set of metrics that show how the poller is handling the load placed upon it. These metrics show the state of both the poller and it's active policies.
Metric Name Description
Health Policy Health. This is the percentage of devices that are polled during a policy cycle. If this value is 100%, the poller is working properly. If the value is below 100%, not all the devices were polled during the last polling interval.
Memory The amount of system memory (in MB) that the poller is using. Memory usage increases as more devices are discovered or more policies are enabled.
BatchQueueSize The number of SNMP batch requests waiting for a thread in which to complete the operation
PollDataQueueSize The number of INSERT statements that are queued to the
NCPOLLDATA database. Shows whether the poller is successfully storing polling data at a rate consistent with the rate of polling. PollDataRowCount The number of rows in the ncpolldata.polldata table. This
table stores the historical poll data and should not exceed the maximum set.
Table 1: Poller Metric Data
The metrics are written to a file, one per poller:
$NCHOME/log/precision/ncp_poller.SnmpPoller.<pollername>.<domai n>.metrics
and are structured such that they can be easily parsed, or manually scanned by the user:
2014-04-09T16:36:24 PollerStart
2014-04-09T16:36:30 PollStart Policy=41 PollDef=1 2014-04-09T16:36:30 Memory=724 2014-04-09T16:39:00 BatchQueueSize=1 2014-04-09T16:39:34 PollDataRowCount=1909311 2014-04-09T16:45:45 PollDataRowCount=1909169 2014-04-09T16:51:48 PollDataRowCount=1903141 2014-04-09T16:57:51 PollDataRowCount=1903141 2014-04-09T17:03:54 PollDataRowCount=1903141
2014-04-09T17:05:34 Health=100 Monitors=44 Behind=0 Policy=41 PollDef=1
Use this command line tool to graph the metric data:
NCHOME/precision/scripts/perl/scripts/itnm_poller.pl.
The script scans the metrics file and presents simple charts of the data.
Run this script in the location of the metric file you wish to view. For example,
ncp_perl $NCHOME/precision/scripts/perl/scripts/itnm_poller.pl -domain <name> [-poller <pollername>] -metrics -window
<interval in hours>
For full information on the itnm_poller.pl utility, use the -help argument or see the Knowledge center:
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_monitorpollerhealth. html
The script produces charts for each metric, lining them up on the same timeline making it easy to get a complete picture of the factors involved.
Illustration 1: Policy Health (one for each active policy)
Illustration 3: SNMP Batches in Queue
Illustration 4: Data Collection and Storage row count
Each of these metrics is designed to assist the administrator in answering specific questions about the state of their poller.
1. Is the historical poll data table being maintained?
2. Is the poller keeping up with the policy load at the scheduled frequencies? 3. Is the poller's memory stable?
4. Is the poller successfully storing data? 5. Do I need to add a new poller?
We will go through each of these questions and show how to use these metrics
to help in answering them.
1) Is the historical poll data table being maintained?
As part of its normal operation the poller performs the task of pruning old and obsolete data from the NCPOLLDATA database. The NCPOLLDATA
database is kept trimmed to the cap set in the poller configuration file. If the poller is unable to keep up with deleting records from this database it can result in issues when it attempts to store more data. By reviewing the metric data for the poll data row count the administrator can assess if they have such a situation. If an issue is detected the reasons can vary, by reviewing the complete set of charts the administrator can make an assessment as to the cause.
In the chart below we can see an upward trend in the poll data row count. The cap for this instance is set at 5,000,000 and in the beginning the poller is able to maintain the level. At around 12:00 we can see the trend go upward, and the poller unable to keep the count below the desired cap.
This chart alone would not indicate the reason for the trend; we have to review the other metric data available. We first look at the Poll Data Queue size and from this metric chart we see a jump in the queue during the same time period.
The poll data queue increases, but it is not a continual upward trend, there are occasional drops shown in the chart. This type of trend would direct us to
check the polling load, specifically the amount and rate at which the poll data is being collected. The next thing to check is what policies are currently enabled. A review of the metrics file (or graph), we can see a large number of new polls started at the point the Poll Data Queue started to climb:
2014-04-09T08:11:08 PollStart Policy=11 PollDef=6 2014-04-09T08:11:09 PollStart Policy=11 PollDef=6
2014-04-11T12:01:56 PollStart Policy=91 PollDef=4 <=== new policy starts 2014-04-11T12:01:56 PollStart Policy=91 PollDef=13
2014-04-11T12:01:56 PollStart Policy=91 PollDef=20 2014-04-11T12:01:56 PollStart Policy=91 PollDef=21 2014-04-11T12:01:56 PollStart Policy=91 PollDef=22 2014-04-11T12:01:56 PollStart Policy=91 PollDef=23 2014-04-11T12:01:56 PollStart Policy=91 PollDef=27 2014-04-11T12:01:56 PollStart Policy=91 PollDef=28 2014-04-11T12:01:56 PollStart Policy=91 PollDef=29 2014-04-11T12:01:56 PollStart Policy=91 PollDef=30 2014-04-11T12:01:56 PollStart Policy=91 PollDef=31 2014-04-11T12:01:56 PollStart Policy=91 PollDef=1
At this point we would want to review the actual policy scope, polling rate, and storage settings. The poller's profiling.policy OQL table shows the
target load in the policy.
ncp_oql -domain <domain> -service SnmpPoller -tabular -query “select * from profiling.policy;”
The polling rate can be seen in the ncpoller.job OQL table:
ncp_oql -domain <domain> -service SnmpPoller -tabular -query “select * from ncpoller.job;”
From these tables we see a large number of targets being polled at a rate of 18 seconds. The polling interval is much too aggressive, probably a typo on the part of the user who configured the policy, and is likely the cause of the climb in our Poll Data Queue and storage counts. At this point the administrator would want to disable the policy, allow for the poller to catch up in storing and pruning the poll data, then fix the policy polling rate and restart it.
This is just an example of how this data can be used to diagnose data storage issues. The inability of the poller to prune the data might not always be related to the storage rate. A poller can also be unable to prune if the database is experiencing issues that prevent the batches of poll data deletes from completing.
In this example we see the Poll Data Row count steadily increasing, with no drops.
Reviewing the Poll Data Queue we see that it is steady, with no increases at all:
At this point we would suspect the poller's ability to prune the poll data. If the database is fully loaded, it might not be able to handle the batches of deletes from the poller. If the batches of deletes are failing, this information would be recorded in the poller log file.
There are times when database performance causes issues with the poller's ability to perform the poll data delete in a timely manner. When this happens it can result in the pollData database table exceeding capacity.
In this next chart we see an example of the Poll Data Row count exceeding the limit and being pruned below the desired level. In this example our chosen limit is 250,000 records.
The downward steps in the chart span roughly 1 hour each, implying that it is taking the poller an hour to delete 50,000 records. The lengthy amount of time to perform the delete would be a cause of concern and at this point the user would want to contact their database administrator. A typical cause is often poor log pruning maintenance.
2) Is the poller keeping up with the policy load at the scheduled
frequencies?
As policies are enabled on a poller it is important to monitor the policy health to determine if the poller is able to keep up with the load. Since there can be multiple poll definitions on a policy, the health is computed for each poll definition in the policy separately.
If the poller can not keep up with the load it writes records in the metrics file of the Policy Health. This health is the percentage of the devices in the poll that the poller is able to complete polling on during the polling interval.
Example 1 – increasing the number of policies
Below is a sample chart of the health of a policy/poll definition. As you can see the policy/poll definition is considered healthy up until 11:30, at which point it begins to decline. The administrator is made aware of this by a status alert that the poller sends for the poll.
Since this particular poll was once perfectly healthy we need to dig down a bit further to find out the cause of the change. Reviewing other metric charts we see that there are other active polls that belong to this poller:
Both of these polls appear to have been started around the time of the
declining health of the first poll, so it is a strong indication that the load from these two additional polls is more than the poller can handle.
Example 2 – Increased number of entities
There are other factors that might result in a policy/poll definition to become unhealthy, such as increased load after a discovery. Below is another example of a Policy Health chart, but this time no other policies are enabled:
In this case the health fluctuates a little at first but then reaches 100 percent. Later on it drops off, never fully recovering. By reviewing the actual Health records in the metrics file for this policy/poll definition we can see a jump in the entity count (Monitors) at 11:10:
2014-04-11T09:52:42 Health=100 Monitors=53 Behind=0 Policy=89 PollDef=38 2014-04-11T10:00:53 Health=100 Monitors=53 Behind=0 Policy=89 PollDef=38 2014-04-11T10:09:39 Health=98 Monitors=1240 Behind=24 Policy=89 PollDef=38 2014-04-11T10:11:49 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38 2014-04-11T10:13:57 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38 2014-04-11T10:16:05 Health=31 Monitors=1240 Behind=855 Policy=89 PollDef=38 2014-04-11T10:18:47 Health=74 Monitors=1240 Behind=322 Policy=89 PollDef=38 2014-04-11T10:21:16 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38 2014-04-11T11:04:00 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38 2014-04-11T11:06:17 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38 2014-04-11T11:08:36 Health=100 Monitors=1240 Behind=0 Policy=89 PollDef=38 2014-04-11T11:10:42 Health=92 Monitors=6278 Behind=450 Policy=89 PollDef=38 2014-04-11T11:12:55 Health=11 Monitors=6278 Behind=5547 Policy=89 PollDef=38 2014-04-11T11:15:07 Health=52 Monitors=6278 Behind=2974 Policy=89 PollDef=38 2014-04-11T11:17:16 Health=62 Monitors=6278 Behind=2372 Policy=89 PollDef=38 2014-04-11T11:19:31 Health=5 Monitors=6278 Behind=5902 Policy=89 PollDef=38 2014-04-11T11:21:39 Health=0 Monitors=6278 Behind=6278 Policy=89 PollDef=38
2014-04-11T12:04:59 Health=25 Monitors=6278 Behind=4708 Policy=89 PollDef=38 2014-04-11T12:14:59 Health=50 Monitors=6278 Behind=3139 Policy=89 PollDef=38 2014-04-11T12:24:59 Health=50 Monitors=6278 Behind=3139 Policy=89 PollDef=38
So in this example the policy health declined as a result of a discovery or scope change that caused a drastic increase in the number of entities being polled. At this point the user would need to review the policy and either reduce the scope or increase the polling interval.
Example 3 – loss of SNMP access
In this next example we have an SNMP based poll that was running fine and then dropped in health by a small amount.
A review of the metric file shows no change in the number of policies enabled or in the total target count being handled. Looking at the event console and the poller trace we can see a flood of SNMPTIMEOUT alerts:
Description='SNMP poll failure (SNMPTIMEOUT) for poll aFastPoll/ifOutErrors and target 172.31.23.52'
When the poller gets an SNMPTIMEOUT it will try to retest each credential that is scoped for the target. This added testing, as well as the SNMP timeouts, results in the poll taking an extended amount of time. If a large number of targets experience this issue it can result in a poll falling behind. In this example, the trace file shows a poll failure for each entity in scope, so at this point the user needs to check if someone updated the SNMP credentials incorrectly.
The next two poller health questions are frequently related.
3) Is the poller's memory stable?
and
4) Is the poller successfully storing data?
During normal operation of a poller, the amount of system memory being used can fluctuate, as policies are enabled or discovery scope increases. On many systems the amount of memory an individual program can consume is limited and when a program reaches that limit it results in a failure. To help diagnose when this does happen, the poller records its memory usage in the metrics file.
If a poller experiences an out of memory condition the corresponding metric chart can show if there was a growth issue.
Example 1 – poller fails on startup
When a poller first starts up, the expected behavior is for the memory to climb as the monitors for each target in scope are started. Once all of the monitors have been started, the memory should level off and remain at constant level until more monitors are started either from new targets added to scope or more policies enabled.
In this chart we see an example of runaway memory growth by a poller:
Admittedly, this is a bit of an extreme case, but illustrates the concept. You can easily see the memory usage climbing until it reaches the system limit, then drops at the point the poller fails because it is unable to allocate more memory. The poller gets restarted by ncp_ctrl and the pattern repeats until the poller reaches its limit on restarts. This pattern indicates the load is such that the poller can not start up all of the needed monitors before running out of memory. At this point the user needs to review the polling load and either consider reducing the load or creating additional pollers.
Example 2 – failure to store data to database
In this next example we see a poller with steady memory followed by growth that continues until the limit is reached.
At this point we want to take a look at some of the other charts, such as Policy Health. If the memory growth were the result of too much load then the policies would show signs of being unhealthy. For this example we see two policies enabled:
The health charts are showing the policies are doing fine, so the next chart we want to look at is the Poll Data Queue. The poller keeps a queue of the poll data waiting to be written out to the database, and as poll data is collected it is added to this queue. If the poller is unable to write the data to the database the queue can grow and, in extreme cases, it can grow to the point that the poller runs out of memory.
From this chart we can easily see that our data queue is growing, indicating a problem writing the data to the NCPOLLDATA database. At this point the user needs to determine the cause of this growth. First they would want to check if there is a database connection issue. If there is a connection problem, the poller will write a message to the log file and send a database connection alert. If there are no connection issues being reported then the user needs to review the amount of data being collected and determine if the collection rate exceeds the rate at which data can be stored to the database. This rate limit depends upon the database. If the user suspects the collection rate they can
review what the policy is storing using the Policy Details Tivoli Common Reporting report, which provides a graphic on Insert Rate Estimate for the policy:
If the insert rate is too high the user needs to review the reasons for the policy collections they have enabled and determine what action that they need to take, such as increasing the interval, reducing items collected, or adding a new poller instance.
Gracefully handle the Poll Data Queue growth
Regardless of the reason for the Poll Data Queue growth you can set a cap on the queue. This would prevent a poller from exceeding memory during prolonged database outages, or excessive data collection. A configuration option in the poller's configuration file, NcPollerSchema.cfg, will direct the
poller to dump the queue to avoid excessive growth.
update config.properties set PollDataQueueLimit = 5000;
When the queue exceeds this number of data points waiting to be inserted to the database, the poller will write the data off to a flat file instead,
ncp_poller.SnmpPoller.<domain>.data
You can use this file to import the data into the database later if the data is still desired.
5) Do I need to add a new poller?
Network Manager has the ability to run multiple pollers within a domain. This allows for the polling to scale as needed. When to add additional pollers is not always obvious. There are three metrics that can give an indication to the user that a new poller is needed: Batch Queue, Memory, and Policy Health.
If there are multiple policies enabled on a poller, you can compare the health of each. In this example we have some long running policies that have been perfectly healthy in the past. Just after the 8:40 timestamp we see that the SnmpBandWidth poll is starting to fall behind:
By reviewing these charts the user can easily see that some of the policies do not extend as far back indicating that they were enabled during the time period that the policies started to fall behind. These added policies resulted in the drop in policy health. If the new policies are important to keep then a new poller is needed to handle this additional load.
Chapter 8: Am I pinging all the IP addresses I want?
After setting up your ICMP polls, you can make sure the poller will ping all the devices you are responsible for. Maybe some devices have not been discovered, or out of discovery scope on a discovered device, or not in the scope of the policies you set up, or unmanaged for some reason.
Ideally, you start with an independent list of IP addresses that you are
responsible for and maintain separately. Run the check against that list. Failing that, you could extract a list of access IP addresses from the NCIM database if you are confident all the devices you are responsible for are actually
discovered.
List of management IP addresses for all discovered devices
select accessIPAddress from chassis c
inner join domainMembers dm on dm.entityId = c.entityId inner join domainMgr d on dm.domainMgrId = d.domainMgrId where d.domainName = 'domainname';
List of management IP addresses for all interfaces (including the device management address)
select ip.address from ipEndPoint ip
inner join domainMembers dm on dm.entityId = ip.entityId inner join domainMgr d on dm.domainMgrId = d.domainMgrId where d.domainName = 'domainname';
Generate the report
Step 1
You start by registering this list from the file containing the IP addresses, one per line in the file. This loads them into a table which will delete all previous entries. You can do this to check on one IP address or thousands.
cd $NCHOME/precision/scripts/perl/scripts
ncp_perl ncp_upload_expected_ips.pl -domain domainname -file filename
This step only needs to be run when the list changes. The script will first remove all existing entries, so each execution replaces the table with the new list.
Step 2
Run this command each time you want to generate a new snapshot to correlate the IP addresses with the poller's list of IP addresses:
ncp_perl ncp_ping_poller_snapshot.pl -domain domainname Step 3
Run the report using the following command:
ncp_perl ncp_polling_exceptions.pl -domain domainname
The report
This table contains the categories that are checked and why the IP addresses it lists are not being polled.
Undiscovered Check the scope and seed lists of the discovery configuration.
Out of scope These IP addresses are missing from the policy scope for the Default Ping polls.
Unmanaged, status = 1 Devices or interfaces that have been unmanaged from the GUI or using the UnmanagedNode.pl
script are considered in maintenance mode and will not be polled. They will have Status of 1.
Unmanaged, status = 2 These are unmanaged during discovery, usually in the TagManagedEntities.stch stitcher. Check the
filter in this stitcher if it is unmanaging interfaces it should not do.
Secondary or
Unpingable interfaces The discovery selects the management address for each interface with multiple IP addresses and only those will be pinged. Network Manager does not ping the secondary IP addresses.
IBM Tivoli Network Manager Monitoring Status ============================================ UNDISCOVERED
============
List of IP addresses from the reference list that are not in the management database. +---+ | IP Address | +---+ | | +---+ OUT OF SCOPE ============ +---+---+---+ | IP Address | Hostname | AOC Class | +---+---+---+
| 172.31.23.4 | 172.30.23.4 | JuniperMSeries | | 172.30.5.1 | 172.31.23.112 | 3ComSuperStack | | 172.30.23.4 | 172.30.23.4 | JuniperMSeries | +---+---+---+ UNMANAGED =========
List of IP addresses from the reference list that are not being monitored because
they were unmanaged in the GUI (status = 1)
+---+---+---+---+---+ | IP Address | Hostname | AOC Class | Entity Status | Device Status | +---+---+---+---+---+ | | | | | | +---+---+---+---+---+ List of IP addresses unmanaged from Discovery (status = 2)
Check ifDescr in TagManagedEntities.stch for the following interfaces: +---+---+---+---+
| IP Address | Hostname | ifDescr | Entity Status | +---+---+---+---+ | | | | | +---+---+---+---+ SECONDARY or UNPINGABLE
=======================
List of IP addresses not polled as they are considered secondary addresses +---+---+---+---+
| IP Address | Hostname | ifIndex | Primary IP | +---+---+---+---+ | 10.0.0.1 | 172.30.23.4 | 14 | 128.0.0.4 | | 10.0.0.4 | 172.30.23.4 | 14 | 128.0.0.4 | | 127.0.0.1 | 172.31.23.26 | 16 | 172.25.0.81 | |+---+---+---+---+ NOT POLLED IN LAST 15 MINS
==========================
List of IP addresses that have not been polled during the last 15 minutes: +---+---+---+
| IP Address | Hostname | AOC Class | +---+---+---+ | | | | +---+---+---+ FALLING BEHIND
==============
List of IP addresses in policies that are falling behind by more than twice the polling interval
+---+---+---+---+---+---+
| IP Address | Hostname | Policy | Poll Interval | Last Poll Int | Time since Last Poll | +---+---+---+---+---+---+ | | | | | | | +---+---+---+---+---+---+ NO SNMP ADDRESS ==============
These devices may have other IP addresses that were not be discovered, but only the management address shown here will be polled (unless unmanaged above ).
+---+---+---+ | IP Address | Hostname | Node Managed | +---+---+---+ | | | | +---+---+---+
For full details see,
http://www.ibm.com/support/knowledgecenter/SSSHRK_4.1.1/com.ibm.netw orkmanagerip.doc_4.1.1/itnm/ip/wip/poll/task/nmip_poll_troubleshootingnwp olling.html