Monitoring and Managing a JVM
Erik Brakkee & Peter van den Berkmortel
#jfall15
Overview
• About Axxerion
• Challenges and example
• Troubleshooting
• Memory management
• Tooling
• Best practices
• Conclusion
#jfall15
Axxerion is an Integrated Workplace Management System, aiming to
• make organizations more efficient
• enable collaboration
• adapt to an organization Axxerion organization aims to
• offer employees a stable and innovative workplace
#jfall15
About Axxerion
• 10 developers
• 30 consultants
• 100 virtual and physical servers
• 14,000 monitored items
• 62 items per second
• 6 clusters
• 300+ clients in 14 countries
• 80,000 users
#jfall15
Axxerion Metrics
Middleware stack
• CentOS 6 & 7
• KVM Virtualization (standard linux)
• MySQL 5.6
• JBoss 5
• Java EE 5
• Java 8
• Eclipse Link (JPA 2)
#jfall15
Middleware Stack
• Multi-tenancy
• Trace back issues
• Load
• Some issues only occur under load
• Difficult to reproduce in test
• After-the-fact troubleshooting
• Monitoring many variables
• Is there a problem?
• Distinguish between cause and effect
• Pro-active
• Self-defeating monitoring
#jfall15
Challenges
One year of troubleshooting
• Crashes in native code
• Replace native libraries
• Application logs freeze
• Automatic reload of log configuration
• Heap space problems
• Finalizer queue size monitoring
• Introduced our own log appender
• Server generates 25 million exceptions per hour
• Introduced exception log analysis
#jfall15
Example of an Incident
One year of troubleshooting
• Server log fragments end up in unexpected places
• File descriptor errors
• Final solution found through monitoring and reverse thinking
#jfall15
Example of an Incident
All of this was caused by an object cloner
After-the-fact
• Which actions led to a problem?
• Users do not remember what they were doing
#jfall15
Troubleshooting
Therefore, we log a lot
• Every login & logout
• Every user action
• Every update of a field of an object
• Anything else that can be useful
Event logging
• Start and end statements
• At one-minute intervals
• Number of bytes allocated
• CPU usage
• Allows troubleshooting of individual threads
#jfall15
Troubleshooting
Event logging
2015-11-03 13:29:23,509 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 0 @cputime 231809
@allocated 1336 @state RUNNABLE @blockedCount 0 @blockedTime -1 @waitedCount 0
@waitedTime -1 @lockInfo null @subevent workertask.start @realtime 9615399862973957
2015-11-03 13:29:23,559 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 10000000 @cputime 20074642 @allocated 2361888 @state RUNNABLE @blockedCount 11 @blockedTime -1
@waitedCount 1 @waitedTime -1 @lockInfo null @subevent workertask.end @realtime 9615399912328419
(based on ThreadMXBean)
#jfall15
Troubleshooting
Exception logging
• Bulk approach
• Exceptions similar on stack frames
• Ignore message
• Custom log appender/handler
• Periodically write hash data
• Use hash as key
• Store hash on disk with full (first seen) exception
#jfall15
Troubleshooting
com.axxerion.Fault: error_no_administrator_defined: client axdsr1
at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:180) at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:31) at com.axxerion.R.execute(RunnableExecutableWrapper.java:38)
… 10 more
Caused by: Fault: error_no_administrator_defined: client axdsr1
at com.axxerion.server.util.ContactUtil.getAdministratorSystemUserId(ContactUtil.java:384) ... 8 more
#jfall15
Troubleshooting
GIT principle: content addressable storage (hash == object)
Secure hash based on exception class, stack frames, cause of exception (recurse) Log the hash in exception.log file
Simple hash — script can add up similar hashes
Log file sample
2015-11-02 16:28:16,184 INFO [com.axxerion.exceptiontracker]
stats 2d4eed091276af79d526e88115f68a82bbb0c1de INFO occurrences delta 2 cumulative 1101
#jfall15
Troubleshooting
Script output
Processing /var/…/log/exceptions.log ...
Most occurring exceptions Occurrences: 1647
Hash: 2d4eed091276af79d526e88115f68a82bbb0c1de First exception of this kind seen:
Time: 2015-‐09-‐28 23:30:20.954301678 +0200 Level: INFO
xyz.svc.webservices.data.ServiceResponseException: The specified object was not found at xyz.svc.ws.data.ServiceResponse.internalThrow(ServiceResponse.java:266) at xyz.svc.ws.data.Request.execute(Request.java:152)
at xyz.svc.ws.data.svcService.lookupItems(Service.java:1364) at xyz.svc.ws.data.svcService.bindToItem(Service.java:1407) Total number of unique exceptions: 160
#jfall15
Troubleshooting
Troubleshooting experience
• Ad-hoc scripting
• Some issues take days others might take years
• Issues getting harder to find
• Psychology
• Relax
• Switch to stand-by
• Gather data
• This might be your only chance
• Get out of the denial phase
• Reverse thinking
Some essential tips
• GC logs
• Heap usage
• Heap fragmentation
• Initiating occupancy fraction
• Maximum chunk size
• Trigger full GCs
• jmap -histo:live
#jfall15
Memory Management
Identify core parameters
• User experience
• Response time
• Outstanding requests
• Technical
• Multiple full GCs within a short time frame
• Failing scheduled tasks (e.g. backups, restores)
#jfall15
Monitoring
See also techblog.netflix.com for several ideas and tooling around monitoring.
Zabbix
• Flexible
• Easily customizable
• Monitor large numbers of servers
#jfall15
Tooling
Technical approach
• Middleware, OS, Database
• Use standard items
• Use your own scripts
• Application level
• Use JMX as much as possible
• Internal statistics service
#jfall15
Tooling
JStack
• Annotate and sort threads with script
• Deadlock detection and debugging
• Stack dump (jstack -l) has negligible disruption
#jfall15
Tooling
JMap
• Heap dumps
• jmap -dump:format=b,file=$dumpfile $pid: stop-the-world up to 3 minutes
• After disaster
• Histogram
• jmap -histo: negligible disruption
• After-the-fact analysis
• Trigger full GC
• jmap -histo:live: stop-the-world up to 30 seconds
• Avoid heap fragmentation
#jfall15
Tooling
JVisual VM
• MBean monitoring
• CPU sampling (5s interval)
Do not use
• Memory sampling
• Profiling
#jfall15
Tooling
Eclipse memory analyzer
• Works with 17 GB memory dumps
• Has saved the day
• Run on separate machine
#jfall15
Tooling
#jfall15
Wishlist
Missing JVM features
• Stop running threads
• Deallocated bytes per thread
• MetaSpace GC can trigger full GC
• Garbage collection
• Compacting
• Predictable performance
• Guaranteed no stop-the-world
• Simple statements can kill your server
• System.getProperty(…)
• exception.toString()
• InetAddres.getLocalHost().getHostname()
• Third-party libraries
• Reflection bottleneck
• Risk management
• Runtime control over new features
• Assure a loop or recursion breaks off
• Do not use finalizers
#jfall15
Best Practices
#jfall15