Troubleshooting Your SUSE TUT OpenStack Cloud. Adam Spiers SUSE Cloud/HA Senior Software Engineer

75  Download (0)

Full text

(1)

Troubleshooting Your

SUSE

®

OpenStack Cloud

TUT19873

(2)

SUSE

®

OpenStack Cloud ...

(3)

SUSE

®

OpenStack Cloud

(4)

SUSE

®

OpenStack Cloud

4653

Parameters

(5)

SUSE

®

OpenStack Cloud

14

(6)

SUSE

®

OpenStack Cloud

2

Hours

(7)

SUSE

®

OpenStack Cloud

Troubleshooting

<1

(8)

Billling VM Mgmt Image Tool Portal App Monitor Sec & Perf

Cloud Management

Heat Dashboard

(Horizon)

Cloud APIs

Required Services Message Q

Database

AUTH (Keystone)

Images (Glance)

Hypervisor Xen, KVM Vmware, HyperV

Compute (Nova)

Operating System

Physical Infrastructure: x86-64, Switches, Storage

OpenStack Management Tools OS and Hypervisor

Object (Swift)

Network (Neutron)

Adapters

Block (Cinder)

Adapters Telemetry

Physical Infrastructure SUSE Cloud Adds

Required Services RabbitMQ Postgresql

Hypervisor SUSE Manager

SUSE Studio

Hypervisor Xen, KVM

SUSE Linux Enterprise Server 11 SP3

SUSE Product

Physical Infrastructure: x86-64, Switches, Storage

Billling Portal App Monitor Sec & Perf

Adapters Adapters

Vmware, HyperV

Partner Solutions

Rados RBD

SUSE

®

OpenStack Cloud

RadosGW

(9)

Non-HA SUSE

®

OpenStack Cloud

(10)

HA SUSE

®

OpenStack Cloud

(11)

Crowbar and Chef

(12)

‹#›

Generic SLES

®

Troubleshooting

All Nodes in SUSE® OpenStack Cloud are SLES based

Watch out for typical issues:

dmesg for hardware-related errors, OOM, interesting kernel messages

usual syslog targets, e.g. /var/log/messages

Check general node health via:

top, vmstat, uptime, pstree, free

core files, zombies, etc

(13)

Supportconfig

supportconfig can be run on any cloud node

supportutils-plugin-susecloud.rpm

installed on all SUSE OpenStack Cloud nodes automatically

collects precious cloud-specific information for further analysis

(14)

‹#›

Admin Node: Crowbar UI

Useful Export Page available in the

Crowbar UI in order to export various log files

(15)

Cloud Install

screen install-suse-cloud --verbose /var/log/crowbar/install.log

(16)

SUSE

®

OpenStack Cloud Admin

SLES 11 SP3

SUSE Cloud Addon

Crowbar UI

Crowbar Services

Chef/Rabbit

Repo Mirror

Install logs:

/var/log/crowbar/install.log Chef/Rabbit:

/var/log/rabbitmq/*.log /var/log/chef/server.log

/var/log/couchdb/couchdb.log Crowbar repo server:

/var/log/apache2/provisioner*log Crowbar:

/var/log/crowbar/production.{out, log}

(17)

Chef

Cloud uses Chef for almost everything:

All Cloud and SLES non-core packages

All config files are overwritten

All daemons are started

Database tables are initialized

http://docs.getchef.com/chef_quick_overview.html

(18)

Admin Node: Using Chef

knife node list

knife node show <nodeid>

knife search node "*:*"

(19)

SUSE

®

OpenStack Cloud Admin

Populate ~root/.ssh/authorized_keys prior install

Barclamp install logs:

/var/log/crowbar/barclamp_install

Node discovery logs:

/var/log/crowbar/sledgehammer/d<macid>.<domain>.log

Syslog of crowbar installed nodes sent via rsyslog to:

/var/log/nodes/d<macid>.log

(20)

‹#›

Useful Tricks

Root login to the Cloud installed nodes should be possible from admin node (even in discovery stage)

If admin network is reachable:

~/.ssh/config:

host 192.168.124.*

StrictHostKeyChecking no user root

(21)

SUSE

®

OpenStack Cloud Admin

If a proposal is applied, chef client logs are at:

/var/log/crowbar/chef-client/<macid>.<domain>.log

Useful crowbar commands:

crowbar machines help

crowbar transition <node> <state>

crowbar <barclamp> proposal list|show <name>

crowbar <barclamp> proposal delete default crowbar_reset_nodes

crowbar_reset_proposal <barclamp> default

(22)

‹#›

Admin Node: Crowbar Services

Nodes are deployed via PXE boot:

/srv/tftpboot/discovery/pxelinux.cfg/*

Installed via AutoYaST; profile generated to:

/srv/tftpboot/nodes/d<mac>.<domain>/autoyast.xml

Can delete & rerun chef-client on the admin node

Can add useful settings to autoyast.xml:

<confirm config:type="boolean">true</confirm>

(don’t forget to chattr +i the file)

(23)

Admin Node: Crowbar UI

Raw settings in barclamp proposals allow access to

"expert" (hidden) options Most interesting are:

debug: true verbose: true

(24)

Admin Node: Crowbar Gotchas

(25)

Admin Node: Crowbar Gotchas

Be patient

Only transition one node at a time

Only apply one proposal at a time

Cloud nodes should boot from:

1. Network 2. First disk

(26)

SUSE

®

OpenStack Cloud

Cloud Node

SLES 11 SP3 SUSE Cloud Addon

Chef Client

All managed via Chef:

/var/log/chef/client.log rcchef-client status

chef-client can be invoked manually

Node specific services

(27)

SUSE

®

OpenStack Cloud Control Node

Control Node

SUSE Cloud

Addon Just like any other cloud node:

/var/log/chef/client.log rcchef-client status

chef-client

Chef overwrites all config files it touches

• chattr +i is your friend

OpenStack API services..

(28)

High Availability

(29)

What is High Availability?

Availability = Uptime / Total Time

99.99% (“4 nines”) == ~53 minutes / year

99.999% (“5 nines”) == ~5 minutes / year

High Availability (HA)

Typically accounts for mild / moderate failure scenarios

e.g. hardware failures and recoverable software errors

automated recovery by restarting / migrating services

HA != Disaster Recovery (DR)

Cross-site failover

Partially or fully automated

(30)

31

Internal architecture

(31)

Resource Agents

Executables which start / stop / monitor resources

RA types:

LSB init scripts

OCF scripts (~ LSB + meta-data + monitor action + ...)

/usr/lib/ocf/resource.d/

Legacy Heartbeat RAs (ancient, irrelevant)

systemd services (in HA for SLE12+)

(32)

33

Results of resource failures

If fail counter is exceeded, clean-up is required:

crm resource cleanup $resource

Failures are expected:

when a node dies

when storage or network failures occur

Failures are not expected during normal operation:

applying a proposal

starting or cleanly stopping resources or nodes

Unexpected failures usually indicate a bug!

Do not get into the habit of cleaning up and ignoring!

(33)

Before diagnosis

Understand initial state / context

crm conf igure graph is awesome!

crm_mon

Which fencing devices are in use?

What's the network topology?

What was done leading up to the failure?

Look for first (relevant) failure

Failures can cascade, so don't confuse cause and effect

(34)

35

crm configure graph FTW!

(35)

Diagnosis

What failed?

Resource?

Node?

Orchestration via Crowbar / chef-client?

cross-cluster ordering

Pacemaker config? (e.g. incorrect constraints)

Corosync / cluster communications?

chef-client logs are usually a good place to start

More on logging later

(36)

37

Symptoms of resource failures

Failures reported via Pacemaker UIs

Failed actions:

neutron-ha-tool_start_0 on d52-54-01-77-77-01 'unknown error' (1): call=281, status=complete, last-rc-change='Thu Jun 4 16:15:14 2015', queued=0ms,

exec=1734ms

neutron-ha-tool_start_0 on d52-54-02-77-77-02 'unknown error' (1): call=259, status=complete, last-rc-change='Thu Jun 4 16:17:50 2015', queued=0ms,

exec=392ms

Services become temporarily or permanently unavailable

Services migrate to another cluster node

(37)

Symptoms of node failures

Services become temporarily or permanently unavailable, or migrated to another cluster node

Node is unexpectedly rebooted (STONITH)

Crowbar web UI may show a red bubble icon next to a controller node

Hawk web UI stops responding on one of the controller nodes (should still be able to use the others)

ssh connection to a cluster node freezes

(38)

39

Symptoms of orchestration failures

Proposal / chef-client failed

Synchronization time-outs are common and obvious

INFO: Processing crowbar-pacemaker_sync_mark[wait-keystone_database] action guess (keystone::server line 232)

INFO: Checking if cluster founder has set keystone_database to 5...

FATAL: Cluster founder didn't set keystone_database to 5!

Find synchronization mark in recipe:

crowbar_pacemaker_sync_mark "wait-keystone_database"

# Create the Keystone Database

database "create #{node[:keystone][:db][:database]} database" do ...

So node timed out waiting for cluster founder to create keystone database

i.e. you're looking at the wrong log! So ...

root@crowbar:~ # knife search node founder:true -i

(39)

Logging

All changes to cluster configuration driven by chef-client

either from application of barclamp proposal

admin node: /var/log/crowbar/chef-client/$NODE.log

or run by chef-client daemon every 900 seconds

/var/log/chef/client.log on each node

Remember chef-client often runs in parallel across nodes

All HAE components log to /var/log/messages on each cluster node

(40)

41

HAE logs

Which nodes' log files to look at?

Node failures:

/var/log/messages from DC

Resource failures:

/var/log/messages from DC and node with failed resource

but remember the DC can move around (elections)

Use hb_report or crm history or Hawk to assemble chronological cross-cluster log

Saves a lot of pain – strongly recommended!

(41)

Syslog messages to look out for

Fencing going wrong

pengine[16374]: warning: cluster_status: We do not have quorum - fencing and resource management disabled

pengine[16374]: warning: stage6: Node d52-54-08-77-77-08 is unclean!

pengine[16374]: warning: stage6: Node d52-54-0a-77-77-0a is unclean!

pengine[16374]: notice: stage6: Cannot fence unclean nodes until quorum is attained (or no-quorum-policy is set to ignore)

Fencing going right

crmd[16376]: notice: te_fence_node: Executing reboot fencing operation (66) on d52-54-0a-77-77-0a (timeout=60000)

stonith-ng[16371]: notice: handle_request: Client crmd.16376.f6100750 wants to fence (reboot) 'd52-54-0a-77-77-0a' with device '(any)'

Reason for fencing is almost always earlier in log.

Don't forget all the possible reasons for fencing!

Lots more – get used to reading /var/log/messages!

(42)

43

Stabilising / recovering a cluster

Start with a single node

Stop all others

rcchef-client stop

rcopenais stop

rccrowbar_join stop

Clean up any failures

crm resource cleanup

crm_resource -C is buggy

crm_resource -o | \

awk '/\tStopped |Timed Out/ { print $1 }' | \ xargs -n1 crm resource cleanup

Make sure chef-client is happy

(43)

Stabilising / recovering a cluster

(cont.)

Add another node in

rm /var/spool/corosync/block_automatic_start

service openais start

service crowbar_join start

Ensure nothing gets fenced

Ensure no resource failures

If fencing happens, check /var/log/messages to find out why, then rectify cause

Repeat until all nodes are in cluster

(44)

‹#›

Degrade Cluster for Debugging

crm configure location fixup-cl-apache cl-apache \ rule -inf: '#uname' eq $HOSTNAME

• Allows to degrade an Activate/Activate resource to only one instance per cluster

• Useful for tracing Requests

(45)

TL; DR: Just Enough HA

crm resource list

crm_mon

crm resource restart <X>

crm resource cleanup <X>

(46)

Now Coming to OpenStack…

(47)

OpenStack Architecture Diagram

(48)

OpenStack Block diagram

Keystone: SPOF

Accesses almost everything

(49)

OpenStack Architecture

Typically each OpenStack component provides:

an API daemon / service

one or many backend daemons that do the actual work

openstack / <prj> command line client to access the API

<proj>-manage client for admin-only functionality

dashboard ("Horizon") Admin tab for a graphical view on the service

uses an SQL database for storing state

(50)

‹#›

OpenStack Packaging Basics

Packages are usually named:

openstack-<codename>

usually a subpackage for each service (-api, -scheduler, etc)

log to /var/log/<codename>/<service>.log

each service has an init script:

dde-ad-be-ff-00-01:~# rcopenstack-glance-api status Checking for service glance-api ...running

(51)

OpenStack Debugging Basics

Log files often lack useful information without verbose enabled

TRACEs of processes are not logged without verbose

Many reasons for API error messages are not logged unless debug is turned on

Debug is very verbose (>10GB per hour) https://ask.openstack.org/

http://docs.openstack.org/icehouse/

(52)

OpenStack Architecture

Keystone: SPOF Accesses almost everything

(53)

OpenStack Dashboard: Horizon

/var/log/apache2

openstack-dashboard- error_log

Get the exact URL it tries to access!

Enable “debug” in Horizon barclamp

Test components individually

(54)

‹#›

OpenStack Identity: Keystone

Needed to access all services

Needed by all services for checking authorization

Use keystone token-get to validate credentials and test service availability

(55)

OpenStack Imaging: Glance

To validate service availability:

glance image-list

glance image-download <id> > /dev/null glance image-show <id>

• Don’t forget hypervisor_type property!

(56)

OpenStack Compute: Nova

nova-manage service list

nova-manage logs errors

nova show <id> shows compute node virsh list, virsh dumpxml

(57)

Nova Overview

API

Scheduler

Conductor

Compute Compute

Compute

(58)

Nova Booting VM Workflow

(59)

Nova: Scheduling a VM

Nova scheduler tries to select a matching compute node for the VM

(60)

‹#›

Nova Scheduler

Typical errors:

No suitable compute node can be found

All suitable compute nodes failed to launch the VM with the required settings

nova-manage logs errors

INFO nova.filters [req-299bb909-49bc-4124-8b88- 732797250cf5 c24689acd6294eb8bbd14121f68d5b44 acea50152da04249a047a52e6b02a2ef] Filter

RamFilter returned 0 hosts

(61)

OpenStack Volumes: Cinder

API

Scheduler

Volume Volume Volume Volume

(62)

‹#›

OpenStack Cinder: Volumes

Similar syntax to Nova:

cinder-manage service list cinder-manage logs errors cinder-manage host list cinder list|show

(63)

OpenStack Networking: Neutron

Swiss Army knife for SDN neutron agent-list neutron net-list neutron port-list neutron router-list

There's no neutron-manage

(64)

Basic Network Layout

(65)

Networking with OVS:

Compute Node

(66)

Networking with LB:

Compute Node

(67)

Neutron Troubleshooting

Neutron uses IP Networking Namespaces on the Network node for routing overlapping networks

neutron net-list ip netns list

ip netns exec qrouter-<id> bash ping..

arping..

ip ro..

curl..

(68)

‹#›

Q&A

http://ask.openstack.org/

http://docs.openstack.org/

https://www.suse.com/documentation/suse-cloud4

/

Thank you

(69)
(70)

OpenStack Orchestration: Heat

(71)

OpenStack Orchestration: Heat

Uses Nova, Cinder, Neutron to assemble complete stacks of resources

heat stack-list

heat resource-list|show <stack>

heat event-list|show <stack>

Usually necessary to query the actual OpenStack service for further information

(72)

‹#›

OpenStack Imaging: Glance

Usually issues are in the configured glance backend itself (e.g. RBD, swift, filesystem) so debugging

concentrates on those

Filesytem:

/var/lib/glance/images

RBD:

ceph -w

rbd -p <pool> ls

(73)

SUSE

®

OpenStack Cloud

(74)
(75)

Unpublished Work of SUSE. All Rights Reserved.

This work is an unpublished work and contains confidential, proprietary, and trade secret information of SUSE.

Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE.

Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General Disclaimer

This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.

Figure

Updating...

References

Related subjects :