Monitoring and Managing a JVM

(1)

Monitoring and Managing a JVM

Erik Brakkee & Peter van den Berkmortel

#jfall15

(2)

Overview

• About Axxerion

• Challenges and example

• Troubleshooting

• Memory management

• Tooling

• Best practices

• Conclusion

#jfall15

(3)

Axxerion is an Integrated Workplace Management System, aiming to

• make organizations more efficient

• enable collaboration

• adapt to an organization Axxerion organization aims to

• offer employees a stable and  innovative workplace

#jfall15

About Axxerion

(4)

• 10 developers

• 30 consultants

• 100 virtual and physical servers

• 14,000 monitored items

• 62 items per second

• 6 clusters

• 300+ clients in 14 countries

• 80,000 users

#jfall15

Axxerion Metrics

(5)

Middleware stack

• CentOS 6 & 7

• KVM Virtualization (standard linux)

• MySQL 5.6

• JBoss 5

• Java EE 5

• Java 8

• Eclipse Link (JPA 2)

#jfall15

Middleware Stack

(6)

• Multi-tenancy

• Trace back issues

• Load

• Some issues only occur under load

• Difficult to reproduce in test

• After-the-fact troubleshooting

• Monitoring many variables

• Is there a problem?

• Distinguish between cause and effect

• Pro-active

• Self-defeating monitoring

#jfall15

Challenges

(7)

One year of troubleshooting

• Crashes in native code

• Replace native libraries

• Application logs freeze

• Automatic reload of log configuration

• Heap space problems

• Finalizer queue size monitoring

• Introduced our own log appender

• Server generates 25 million exceptions per hour

• Introduced exception log analysis

#jfall15

Example of an Incident

(8)

One year of troubleshooting

• Server log fragments end up in unexpected places

• File descriptor errors

• Final solution found through monitoring and reverse thinking

#jfall15

Example of an Incident

All of this was caused by an object cloner

(9)

After-the-fact

• Which actions led to a problem?

• Users do not remember what they were doing

#jfall15

Troubleshooting

Therefore, we log a lot

• Every login & logout

• Every user action

• Every update of a field of an object

• Anything else that can be useful

(10)

Event logging

• Start and end statements

• At one-minute intervals

• Number of bytes allocated

• CPU usage

• Allows troubleshooting of individual threads

#jfall15

Troubleshooting

(11)

Event logging

2015-11-03 13:29:23,509 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 0 @cputime 231809

@allocated 1336 @state RUNNABLE @blockedCount 0 @blockedTime -1 @waitedCount 0

@waitedTime -1 @lockInfo null @subevent workertask.start @realtime 9615399862973957

2015-11-03 13:29:23,559 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 10000000 @cputime 20074642 @allocated 2361888 @state RUNNABLE @blockedCount 11 @blockedTime -1

@waitedCount 1 @waitedTime -1 @lockInfo null @subevent workertask.end @realtime 9615399912328419

(based on ThreadMXBean)

#jfall15

Troubleshooting

(12)

Exception logging

• Bulk approach

• Exceptions similar on stack frames

• Ignore message

• Custom log appender/handler

• Periodically write hash data

• Use hash as key

• Store hash on disk with full (first seen) exception

#jfall15

Troubleshooting

(13)

com.axxerion.Fault: error_no_administrator_defined: client axdsr1

at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:180) at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:31) at com.axxerion.R.execute(RunnableExecutableWrapper.java:38)

… 10 more

Caused by: Fault: error_no_administrator_defined: client axdsr1

at com.axxerion.server.util.ContactUtil.getAdministratorSystemUserId(ContactUtil.java:384) ... 8 more

#jfall15

Troubleshooting

GIT principle: content addressable storage (hash == object)

Secure hash based on exception class, stack frames, cause of exception (recurse) Log the hash in exception.log file

Simple hash — script can add up similar hashes

(14)

Log file sample

2015-11-02 16:28:16,184 INFO [com.axxerion.exceptiontracker]

stats 2d4eed091276af79d526e88115f68a82bbb0c1de INFO occurrences delta 2 cumulative 1101

#jfall15

Troubleshooting

Script output

Processing /var/…/log/exceptions.log ...

Most occurring exceptions Occurrences: 1647

Hash: 2d4eed091276af79d526e88115f68a82bbb0c1de First exception of this kind seen:

Time: 2015-‐09-‐28 23:30:20.954301678 +0200 Level: INFO

xyz.svc.webservices.data.ServiceResponseException: The specified object was not found at xyz.svc.ws.data.ServiceResponse.internalThrow(ServiceResponse.java:266) at xyz.svc.ws.data.Request.execute(Request.java:152)

at xyz.svc.ws.data.svcService.lookupItems(Service.java:1364) at xyz.svc.ws.data.svcService.bindToItem(Service.java:1407) Total number of unique exceptions: 160

(15)

#jfall15

Troubleshooting

Troubleshooting experience

• Ad-hoc scripting

• Some issues take days others might take years

• Issues getting harder to find

• Psychology

• Relax

• Switch to stand-by

• Gather data

• This might be your only chance

• Get out of the denial phase

• Reverse thinking

(16)

Some essential tips

• GC logs

• Heap usage

• Heap fragmentation

• Initiating occupancy fraction

• Maximum chunk size

• Trigger full GCs

• jmap -histo:live

#jfall15

Memory Management

(17)

Identify core parameters

• User experience

• Response time

• Outstanding requests

• Technical

• Multiple full GCs within a short time frame

• Failing scheduled tasks (e.g. backups, restores)

#jfall15

Monitoring

Zabbix

• Flexible

• Easily customizable

• Monitor large numbers of servers

#jfall15

Tooling

(19)

Technical approach

• Middleware, OS, Database

• Use standard items

• Use your own scripts

• Application level

• Use JMX as much as possible

• Internal statistics service

#jfall15

Tooling

(20)

JStack

• Annotate and sort threads with script

• Deadlock detection and debugging

• Stack dump (jstack -l) has negligible disruption

#jfall15

Tooling

(21)

JMap

• Heap dumps

• jmap -dump:format=b,file=$dumpfile $pid: stop-the-world up to 3 minutes

• After disaster

• Histogram

• jmap -histo: negligible disruption

• After-the-fact analysis

• Trigger full GC

• jmap -histo:live: stop-the-world up to 30 seconds

• Avoid heap fragmentation

#jfall15

Tooling

(22)

JVisual VM

• MBean monitoring

• CPU sampling (5s interval)

Do not use

• Memory sampling

• Profiling

#jfall15

Tooling

(23)

Eclipse memory analyzer

• Works with 17 GB memory dumps

• Has saved the day

• Run on separate machine

#jfall15

Tooling

(24)

#jfall15

Wishlist

Missing JVM features

• Stop running threads

• Deallocated bytes per thread

• MetaSpace GC can trigger full GC

• Garbage collection

• Compacting

• Predictable performance

• Guaranteed no stop-the-world

(25)

• Simple statements can kill your server

• System.getProperty(…)

• exception.toString()

• InetAddres.getLocalHost().getHostname()

• Third-party libraries

• Reflection bottleneck

• Risk management

• Runtime control over new features

• Assure a loop or recursion breaks off

• Do not use finalizers

#jfall15

Best Practices

(26)

#jfall15

Conclusion

Results

• Server stability (99.998% uptime)

• Getting better Troubleshooting

• A lot of tools required

• Do not assume anything

• Go top-down

• Sometimes more issues than one

• Cause or effect

• There is no silver bullet

(27)

Monitoring and Managing a JVM