Chapter 4. Continuous availability and manageability
4.3 Serviceability
4.3.5 Locating and servicing
The final component of a comprehensive design for serviceability is the ability to effectively locate and replace parts requiring service. POWER processor-based systems utilize a combination of visual cues and guided maintenance procedures to ensure that the identified part is replaced correctly, every time.
Packaging for service
The following service enhancements are included in the physical packaging of the systems to facilitate service:
Color coding (touch points)
– Terracotta-colored touch points indicate that a component (FRU/CRU) can be concurrently maintained.
– Blue-colored touch points delineate components that are not concurrently maintained. Those that require the system to be turned off for removal or repair.
Tool-less design: Selected IBM systems support tool-less or simple tool designs. These designs require no tools or simple tools such as flathead screw drivers to service the hardware components.
Positive retention: Positive retention mechanisms help to assure proper connections between hardware components such as cables to connectors, and between two cards that
and cables are included to help prevent loose connections and aid in installing (seating) parts correctly. These positive retention items do not require tools.
Light Path
The Light Path LED feature is for low-end systems, including Power Systems up to the models 750 and 755, that might be repaired by clients. In the Light Path LED implementation, when a fault condition is detected on the POWER7 processor-based system, an amber FRU fault LED will be illuminated, which will be rolled up to the system fault LED. The Light Path system pinpoints the exact part by turning on the amber FRU fault LED associated with the part to be replaced.
The system can clearly identify components for replacement by using specific
component-level LEDs, and can also guide the servicer directly to the component by signaling (turning on solid) the system fault LED, enclosure fault LED, and the component FRU fault LED.
After the repair, the LEDs shut off automatically if the problem is fixed.
Guiding Light
The enclosure and system identify LEDs turn solidly on and can be used to follow the path from the system to the enclosure and down to the specific FRU.
Guiding Light uses a series of flashing LEDs, allowing a service provider to quickly and easily identify the location of system components. Guiding Light can also handle multiple error conditions simultaneously, which might be necessary in some very complex high-end configurations.
In these situations, Guiding Light waits for the servicer’s indication of what failure to attend first and then illuminates the LEDs to the failing component.
Data centers can be complex places, and Guiding Light is designed to do more than identify visible components. When a component might be hidden from view, Guiding Light can flash a sequence of LEDs that extend to the frame exterior, clearly
guiding
the service representative to the correct rack, system, enclosure, drawer, and component.Service labels
Service providers use these labels to assist them in performing maintenance actions. Service labels are found in various formats and positions, and are intended to transmit readily available information to the servicer during the repair process. Several of these service labels and the purpose of each are:
Location diagrams are strategically located on the system hardware, relating information regarding the placement of hardware components. Location diagrams might include location codes, drawings of physical locations, concurrent maintenance status, or other data pertinent to a repair. Location diagrams are especially useful when multiple
components are installed, such as DIMMs, CPUs, processor books, fans, adapter cards, LEDs, and power supplies.
The remove or replace procedure labels contain procedures often found on a cover of the system or in other spots accessible to the servicer. These labels provide systematic procedures, including diagrams, detailing how to remove/replace certain serviceable hardware components.
Numbered arrows are used to indicate the order of operation and serviceability direction of components. Certain serviceable parts such as latches, levers, and touch points must be pulled or pushed in a certain direction and certain order for the mechanical mechanisms to
The operator panel
The operator panel on a POWER processor-based system is a four-row by 16-element LCD display used to present boot progress codes, indicating advancement through the system power-on and initialization processes. The operator panel is also used to display error and location codes when an error occurs that prevents the system from booting. It includes several buttons allowing a service support representative (SSR) or client to change various boot-time options and other limited service functions.
Concurrent maintenance
The IBM POWER7 processor-based systems are designed with the understanding that certain components have higher intrinsic failure rates than others. The movement of fans, power supplies, and physical storage devices naturally make them more susceptible to wearing down or burning out; other devices such as I/O adapters might begin to wear from repeated plugging and unplugging. For this reason, these devices are specifically designed to be concurrently maintainable, when properly configured.
In other cases, a client may be in the process of moving or redesigning a data center, or planning a major upgrade. At times like these, flexibility is crucial. The IBM POWER7 processor-based systems are designed for redundant or concurrently maintainable power, fans, physical storage, and I/O towers.
The most recent members of the IBM Power Systems family based on the POWER7 processor will continue to support concurrent maintenance of power, cooling, PCI adapters, media devices, I/O drawers, GX adapter and the operator panel. In addition, they support concurrent firmware fix pack updates when possible. The determination of whether a
firmware fix pack release can be updated concurrently is identified in the
readme
file released with the firmware.Blind-swap casette
Blind-swap PCIe adapters represent significant service and ease-of-use enhancements in I/O subsystem design, and maintains high PCIe adapter density.
Standard PCI designs supporting hot-add and hot-replace require top access so that
adapters can be slid into the PCI I/O slots vertically, this is the case of the Power 750 and 755. Blind-swap allows PCIe adapters to be concurrently replaced without having to put the I/O drawer into a service position. Since first delivered, minor carrier design adjustments have improved an already well thought out service design.
For PCIe adapters on the POWER7 processor-based servers, blind swap cassettes include the PCIe slot in order to avoid the top to bottom movement for inserting the card on the slot required on previous designs. The adapter is correctly connected by sliding in the cassette.
Firmware updates
Firmware updates for Power Systems are released in a cumulative sequential fix format, packaged as an RPM for concurrent application and activation. Administrators can install and activate many firmware patches without cycling power or rebooting the server.
When an HMC is connected to the system, the new firmware image is loaded from any of the following sources:
A download from the IBM Fix Central Web site: http://www.ibm.com/support/fixcentral/
IBM supports multiple firmware releases in the field, so under expected circumstances, a server can operate on an existing firmware release, using concurrent firmware fixes to stay up-to-date with the current patch level. Because changes to several server functions (for example, changing initialization values for chip controls) cannot occur during system
operation, a patch in this area requires a system reboot for activation. Under normal operating conditions, IBM provides patches for an individual firmware release level for up to two years after first making the release code generally available. After this period, clients should plan to update in order to stay on a supported firmware release.
Activation of new firmware functions, as opposed to patches, will require installation of a new firmware release level. This process is disruptive to server operations because it requires a scheduled outage and full server reboot.
In addition to concurrent and disruptive firmware updates, IBM also offers concurrent patches that include functions which are not activated until a subsequent server reboot. A server with these patches operates normally. The additional concurrent fixes is installed and activated when the system reboots after the next scheduled outage.
Additional capability is added to the firmware to be able to view the status of a system power control network background firmware update. This subsystem will update as necessary as migrated nodes or I/O drawers are added to the configuration. The new firmware provides an interface to be able to view the progress of the update, and also control starting and stopping of the background update if a more convenient time becomes available.
Repair and verify
Repair and verify (R&V) is a system used to guide a service provider step-by-step through the process of repairing a system and verifying that the problem has been repaired. The steps are customized in the appropriate sequence for the particular repair for the specific system being repaired. Repair scenarios covered by repair and verify include:
Replacing a defective field-replaceable unit (FRU)
Reattaching a loose or disconnected component
Correcting a configuration error
Removing or replacing an incompatible FRU
Updating firmware, device drivers, operating systems, middleware components, and IBM applications after replacing a part
Repair and verify procedures are designed to be used both by service representative providers who are familiar with the task at hand and those who are not. Education On Demand content is placed in the procedure at the appropriate locations. Throughout the repair and verify procedure, repair history is collected and provided to the Service and Support Problem Management Database for storage with the serviceable event, to ensure that the guided maintenance procedures are operating correctly.
Clients can subscribe through the Subscription Services to obtain the notifications on the latest updates available for service-related documentation. The latest version of the documentation is accessible through the Internet, and a CD-ROM-based version is also available.