• No results found

Monitoring and Managing a JVM

N/A
N/A
Protected

Academic year: 2021

Share "Monitoring and Managing a JVM"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

Monitoring and Managing a JVM

Erik Brakkee & Peter van den Berkmortel

#jfall15

(2)

Overview

• About Axxerion

• Challenges and example

• Troubleshooting

• Memory management

• Tooling

• Best practices

• Conclusion

#jfall15

(3)

Axxerion is an Integrated Workplace Management System, aiming to

• make organizations more efficient

• enable collaboration

• adapt to an organization Axxerion organization aims to

• offer employees a stable and
 innovative workplace

#jfall15

About Axxerion

(4)

• 10 developers

• 30 consultants

• 100 virtual and physical servers

• 14,000 monitored items

• 62 items per second

• 6 clusters

• 300+ clients in 14 countries

• 80,000 users

#jfall15

Axxerion Metrics

(5)

Middleware stack

• CentOS 6 & 7

• KVM Virtualization (standard linux)

• MySQL 5.6

• JBoss 5

• Java EE 5

• Java 8

• Eclipse Link (JPA 2)

#jfall15

Middleware Stack

(6)

• Multi-tenancy

• Trace back issues

• Load

• Some issues only occur under load

• Difficult to reproduce in test

• After-the-fact troubleshooting

• Monitoring many variables

• Is there a problem?

• Distinguish between cause and effect

• Pro-active

• Self-defeating monitoring

#jfall15

Challenges

(7)

One year of troubleshooting

• Crashes in native code

• Replace native libraries

• Application logs freeze

• Automatic reload of log configuration

• Heap space problems

• Finalizer queue size monitoring

• Introduced our own log appender

• Server generates 25 million exceptions per hour

• Introduced exception log analysis

#jfall15

Example of an Incident

(8)

One year of troubleshooting

• Server log fragments end up in unexpected places

• File descriptor errors

• Final solution found through monitoring and reverse thinking

#jfall15

Example of an Incident

All of this was caused by an object cloner

(9)

After-the-fact

• Which actions led to a problem?

• Users do not remember what they were doing

#jfall15

Troubleshooting

Therefore, we log a lot

• Every login & logout

• Every user action

• Every update of a field of an object

• Anything else that can be useful

(10)

Event logging

• Start and end statements

• At one-minute intervals

• Number of bytes allocated

• CPU usage

• Allows troubleshooting of individual threads

#jfall15

Troubleshooting

(11)

Event logging

2015-11-03 13:29:23,509 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 0 @cputime 231809

@allocated 1336 @state RUNNABLE @blockedCount 0 @blockedTime -1 @waitedCount 0

@waitedTime -1 @lockInfo null @subevent workertask.start @realtime 9615399862973957

2015-11-03 13:29:23,559 FINE [com.axxerion.performance] (rp-worker-3821) performance @id 395972 @name rp-worker-3821 @usertime 10000000 @cputime 20074642 @allocated 2361888 @state RUNNABLE @blockedCount 11 @blockedTime -1

@waitedCount 1 @waitedTime -1 @lockInfo null @subevent workertask.end @realtime 9615399912328419

(based  on  ThreadMXBean)

#jfall15

Troubleshooting

(12)

Exception logging

• Bulk approach

• Exceptions similar on stack frames

• Ignore message

• Custom log appender/handler

• Periodically write hash data

• Use hash as key

• Store hash on disk with full (first seen) exception

#jfall15

Troubleshooting

(13)

com.axxerion.Fault: error_no_administrator_defined: client axdsr1

at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:180) at com.axxerion.server.B.B.execute(DirectMessageQueueEntry.java:31) at com.axxerion.R.execute(RunnableExecutableWrapper.java:38)

… 10 more

Caused by: Fault: error_no_administrator_defined: client axdsr1

at com.axxerion.server.util.ContactUtil.getAdministratorSystemUserId(ContactUtil.java:384) ... 8 more

#jfall15

Troubleshooting

GIT  principle:  content  addressable  storage  (hash  ==  object)  

Secure  hash  based  on  exception  class,  stack  frames,  cause  of  exception  (recurse)   Log  the  hash  in  exception.log  file  

Simple  hash  —  script  can  add  up  similar  hashes

(14)

Log  file  sample  

2015-11-02 16:28:16,184 INFO [com.axxerion.exceptiontracker]

stats 2d4eed091276af79d526e88115f68a82bbb0c1de INFO occurrences delta 2 cumulative 1101

#jfall15

Troubleshooting

Script  output  

Processing    /var/…/log/exceptions.log  ...  

Most  occurring  exceptions   Occurrences:            1647  

Hash:                          2d4eed091276af79d526e88115f68a82bbb0c1de   First  exception  of  this  kind  seen:  

Time:    2015-­‐09-­‐28  23:30:20.954301678  +0200   Level:  INFO                                                                

xyz.svc.webservices.data.ServiceResponseException:  The  specified  object  was  not  found                  at  xyz.svc.ws.data.ServiceResponse.internalThrow(ServiceResponse.java:266)                  at  xyz.svc.ws.data.Request.execute(Request.java:152)  

               at  xyz.svc.ws.data.svcService.lookupItems(Service.java:1364)                  at  xyz.svc.ws.data.svcService.bindToItem(Service.java:1407)   Total  number  of  unique  exceptions:  160

(15)

#jfall15

Troubleshooting

Troubleshooting experience

• Ad-hoc scripting

• Some issues take days others might take years

• Issues getting harder to find

• Psychology

• Relax

• Switch to stand-by

• Gather data

• This might be your only chance

• Get out of the denial phase

• Reverse thinking

(16)

Some essential tips

• GC logs

• Heap usage

• Heap fragmentation

• Initiating occupancy fraction

• Maximum chunk size

• Trigger full GCs

• jmap -histo:live

#jfall15

Memory Management

(17)

Identify core parameters

• User experience

• Response time

• Outstanding requests

• Technical

• Multiple full GCs within a short time frame

• Failing scheduled tasks (e.g. backups, restores)

#jfall15

Monitoring

See also techblog.netflix.com for several ideas and tooling around monitoring.

(18)

Zabbix

• Flexible

• Easily customizable

• Monitor large numbers of servers

#jfall15

Tooling

(19)

Technical approach

• Middleware, OS, Database

• Use standard items

• Use your own scripts

• Application level

• Use JMX as much as possible

• Internal statistics service

#jfall15

Tooling

(20)

JStack

• Annotate and sort threads with script

• Deadlock detection and debugging

• Stack dump (jstack -l) has negligible disruption

#jfall15

Tooling

(21)

JMap

• Heap dumps

• jmap -dump:format=b,file=$dumpfile $pid: stop-the-world up to 3 minutes

• After disaster

• Histogram

• jmap -histo: negligible disruption

• After-the-fact analysis

• Trigger full GC

• jmap -histo:live: stop-the-world up to 30 seconds

• Avoid heap fragmentation

#jfall15

Tooling

(22)

JVisual VM

• MBean monitoring

• CPU sampling (5s interval)

Do not use

• Memory sampling

• Profiling

#jfall15

Tooling

(23)

Eclipse memory analyzer

• Works with 17 GB memory dumps

• Has saved the day

• Run on separate machine

#jfall15

Tooling

(24)

#jfall15

Wishlist

Missing JVM features

• Stop running threads

• Deallocated bytes per thread

• MetaSpace GC can trigger full GC

• Garbage collection

• Compacting

• Predictable performance

• Guaranteed no stop-the-world

(25)

• Simple statements can kill your server

• System.getProperty(…)

• exception.toString()

• InetAddres.getLocalHost().getHostname()

• Third-party libraries

• Reflection bottleneck

• Risk management

• Runtime control over new features

• Assure a loop or recursion breaks off

• Do not use finalizers

#jfall15

Best Practices

(26)

#jfall15

Conclusion

Results

• Server stability (99.998% uptime)

• Getting better Troubleshooting

• A lot of tools required

• Do not assume anything

• Go top-down

• Sometimes more issues than one

• Cause or effect

• There is no silver bullet

(27)

“please rate my talk in the official J-Fall app”

techblog.axxerion.com

www.axxerion.com/nl/careers/

References

Related documents

Hipped roof, covered in red plain clay tiles, with red brick construction to walls and sandstone plinth, and brick dentil course to eaves. Small diamond-shaped aeration

Det finns många och utbredda uppfattningar om stereotypier bland hästägare, som till exempel att det är beteenden som är skadliga och kan ”smitta” andra

Complete social and political equality for the Palestinian Arab minority in Israel is only possible, even purely as an idea, if the notion of a Jewish state is understood simply as

Performance guides / guide are influenced by many factors, internal factors and external factors. Internal factors are associated with limited knowledge and

Our alliance combines Deloitte’s breadth and depth of experience in Information Management with Informatica’s industry-leading software in the areas of data integration,

Member, Pennsylvania Physical Therapy Association 2000-present Member, Virginia Physical Therapy Association 2006-2012 Member, Virginia Physical Therapy Association 2006-2012

Alex, skim through those articles and see if there is any useful information about a link between caff eine and Parkinson’s Disease while I call Granddad,” added

As the European Union (EU) now accounts for a lower share of world trade, investment, currency holdings, defence expenditure, and development assistance, this shift