This case study describes the complete root cause analysis and resolution of a File Descriptor (Too
"Main Thread" id=1 idx=0x4 tid=5996 prio=5 alive, native_blocked at org/ph/javaee/tool/cpu/HighCPUSimulator.main
(HighCPUSimulator.java:31)
at jrockit/vm/RNI.c2java(IIIII)V(Native Method) -- end of trace
many open files) related problem that we faced following a migration from Oracle ALSB 2.6 running on Solaris OS to Oracle OSB 11g running on AIX.
This section will also provide you with proper AIX OS commands you can use to troubleshoot and validate the File Descriptor configuration of your Java VM process.
Environment specifications
• Java EE server: Oracle Service Bus 11g
• Middleware OS: IBM AIX 6.1
• Java VM: IBM JRE 1.6.0 SR9 - 64 bit
• Platform type: Service Bus - Middle Tier Problem overview
Problem type: java.net.SocketException: Too many open files error was observed under heavy load causing our Oracle OSB managed servers to suddenly hang.
Such problem was observed only during high load and did require our support team to take corrective action e.g. shutdown and restart the affected Weblogic OSB managed servers.
Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derive other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
• What is the client impact? HIGH; Full JVM hang
• Recent change of the affected platform? Yes, recent migration from ALSB 2.6 (Solaris OS) to Oracle OSB 11g (AIX OS)
• Any recent traffic increase to the affected platform? No
• What is the health of the Weblogic server? Affected managed servers were no longer responsive along with closure of the Weblogic HTTP (Server Socket) port
• Did a restart of the Weblogic Integration server resolve the problem? Yes but temporarily only Conclusion #1: The problem appears to be load related
Weblogic server log files review
File Descriptor - Why so important for an Oracle OSB environment?
The File Descriptor capacity is quite important for your Java VM process. The key concept you must understand is that File Descriptors are not only required for pure File Handles but also for inbound and outbound Socket communication. Each new Java Socket created to (inbound) or from (outound) your Java VM by Weblogic kernel Socket Muxer requires a File Descriptor allocation at the OS level.
An Oracle OSB environment can require a significant number of Sockets depending how much inbound load it receives and how much outbound connections (Java Sockets) it has to create in order to send and receive data from external / downstream systems (System End Points).
For that reason, you must ensure that you allocate enough File Descriptors / Sockets to your Java VM process in order to support your daily load; including problematic scenarios such as sudden slowdown of external systems which typically increase the demand on the File Descriptor allocation.
Runtime File Descriptor capacity check for Java VM and AIX OS
Following the discovery of this error, our technical team did perform a quick review of the current observed runtime File Descriptor capacity & utilization of our OSB Java VM processes. This can be done easily via the following AIX command:
## Java VM process File Descriptor total capacity
## Java VM process File Descriptor current utilization
As you can see, the current capacity was found at 2000; which is quite low for a medium size Oracle OSB environment. The average utilization under heavy load was also found to be quite close to the upper limit of 2000.
The next step was to verify the default AIX OS File Descriptor limit via the command:
Conclusion #2: The current File Descriptor limit for both OS and OSB Java VM appears to be quite low and setup at 2000. The File Descriptor utilization was also found to be quite close to the upper limit
procfiles <Java PID> | grep rlimit & lsof -p <Java PID> | wc -l
>> procfiles 5425732 | grep rlimit Current rlimit: 2000 file descriptors
>> lsof -p <Java PID> | wc -l 1920
>> ulimit -S -n 2000
which explains why so many JVM failures were observed at peak load.
Weblogic File Descriptor configuration review
The File Descriptor limit can typically be overwritten when you start your Weblogic Java VM. Such configuration is managed by the WLS core layer and script can be found at the following location:
Root cause: File Descriptor override only working for Solaris OS!
As you can see with the script screenshot below, the override of the File Descriptor limit via ulimit is only applicable for Solaris OS (SunOS) which explains why our current OSB Java VM running on AIX OS did end up with the default value of 2000 vs. our older ALSB 2.6 environment running on Solaris OS which had a File Descriptor limit of 65536.
<WL_HOME>/wlserver_10.3/common/bin/commEnv.sh
** Please note that the activation of any change to the Weblogic File Descriptor configuration requires a restart of both the Node Manager (if used) along with the managed servers. **
A runtime validation was also performed following the activation of the new configuration which did confirm the new active File Descriptor limit:
No failure has been observed since then.
Conclusion and recommendations
When upgrading your Weblogic Java EE container to a new version, please ensure that you verify your current File Descriptor limit as per the above case study. From a capacity planning perspective, please ensure that you monitor your File Descriptor utilizaiton on a regular basis in order to identify any potential capacity problem, Socket leak etc.