Application Scalability in Proactive Performance & Capacity Management

(1)

Application Scalability in

Proactive Performance & Capacity Management

Bernhard Brinkmoeller,

SAP AGS IT Planning

(2)

What is Scalability?

How would you define scalability?

In the context of PPCM is scalability a characteristic of the load or the

hardware?

How would you define scalable load?

(3)

What is Scalability?

Definition from Wikipedia

In

electronics

(including

hardware

,

communication

and

software

), scalability is the

ability of a system, network, or process to handle a growing amount of work in a

capable manner or its ability to be enlarged to accommodate that growth. ….

Scalability, as a property of systems, is generally difficult to define

[2]

_{and in any}

particular case it is necessary to define the specific requirements for scalability on

those dimensions that are deemed important. It is a highly significant issue in

electronics systems, databases, routers, and networking. A system whose

performance improves after adding hardware, proportionally to the capacity

added, is said to be a scalable system….

An

algorithm

, design,

networking protocol

,

program

, or other system is said to scale

if it is suitably

efficient

and practical when applied to large situations ….

The general definition of scalability is depends strongly on the context that it is used

in. Even in a given context it is questionable whether it is precise enough to

form the basis to define concrete work packages for a “scalability analysis”.

It is very important, that we reach a common understanding of what we want

to achieve in Proactive Performance & Capacity Management before we start.

(4)

Content

Definition of Scalability

Proactive Performance & Capacity Management according to ITIL

Scalability of load with…

… the amount of business data processed in a step

… the number of parallel processes

… the size of the DB

Scalability of service time with…

… the number of CPUs available

… the capacity of the I/O Subsystem

… Database locks

Non-scalability introduced by application server buffering

Consequences for risk assessment and quality control

(5)

Capacity Management Service According To ITIL

ITIL Service Delivery v.2.1 – Published For OGC By TSO

• According to the Information Technology Infrastructure Library (ITIL) a capacity management service consists of three Sub- Processes

• The output with the highest value is obtained when the results of the sub-processes are brought together.

• While the entry point of the sub-processes are different, all of them aim at establishing a connection from the business requirement over the services (reports and transactions) to the resource (CPU, memory, disk) consumption

(6)

What is Scalability?

Definition of Scalability in PPCM

1/3

Business capacity Management:

Business volume for process X

Load is scalable when the resource consumption of the Services necessary to run

the business process depends linearly on the (business) volume and there are no

unexpected load drivers

Resource Capacity Management:

Service time for resource Z

Service Capacity Management:

Resource consumption for Service Y

Hardware is scalable when it is capable to provide the necessary resources for the

required number of services in a given time interval without a degradation of service

times

(7)

What is Scalability?

Definition of Scalability in PPCM

2/3

Load is scalable when the consumption of expected resources depends linearly on

the (business) volume and there are no unexpected load drivers.

Hardware is scalable when it is capable to provide the necessary resources for the

required number of services in a given time interval without a degradation of service

times.

Examples:

• The response time of processing of an order should depend linearly on the number of items in the order. Signs of non-scalability are:

• Quadratic dependence on the number of line items in the order  Use of sorted tables and read binary search in ABAP

• Dependence on the network latency between the front end and the server:  The number of communication steps has to be so small that the network latency can be neglected.

• Dependency on the amount of data stored in the DB.  Read only new data from chronologically sorted indices.

• The throughput for order processing should depend linear on the CPU capacity provided by the infrastructure. Signs for non-scalability are:

• Dependence on the length of the critical path of DB locks.  avoid long critical path for updates with a large likelihood of lock collision

• I/O bottlenecks caused by high ReDo volume.  Avoid unproductive database changes (eg. Using „set update task local“)

(8)

What is Scalability?

Definition of Scalability in PPCM

3/3

On a detailed level scalability describes the relationships between:

•

Volumes of all business processes supported by a system

•

The consumption of the various resources provided by the system

•

The service request times the system is capable to provide

A system is scalable up to the required limit when even under high load the

contribution of non scalable contributions to the overall resource

consumption and service times remains below an acceptable limit and the

hardware can provide the required resources at peak time without

unacceptable degradations of service request times.

For very large systems the acceptable contribution of non–scalable load to the

resource consumption is typically set at about 20%. For smaller systems it is much

higher as it is cheaper to provide more hardware.

A system is scalable when the load consisting about 80% of the resource

consumption is proven to be scalable and the hardware can provide the

required resources at peak time without degradations of service request

times of less than 20%. (Limits are debatable)



A Scalability analysis is always restricted to the (expected) top load

contributors

(9)

Content

Definition of Scalability

Proactive Performance & Capacity Management according to ITIL

Scalability of load with…

… the amount of business data processed in a step

… the number of parallel processes

… the size of the DB

Scalability of service time with…

… the number of CPUs available

… the capacity of the I/O Subsystem

… Database locks

Non-scalability introduced by application server buffering

Consequences for risk assessment and quality control

Consequences for monitoring

(10)

The size of the DB

Principal Scalability Patterns

Following scaling behavior can be observed for the load with the DB size:

1. Constant resource consumption independent of table size

Independent of the table size are all fully indexed access to data ( the small dependency on the depth of the B-tree for the index can be neglected) which have a high likelihood to only access Data blocks in the buffer.

2. Decrease of the buffer hit ratio with table size

In case the chance that a data block decreases with the index or the table size a week linear dependency of the resource consumption with the table size can be observed

Directly proportional with the table size:

3. In case the number of data blocks that need to be read increases with the table size a strong linear dependency of the resource consumption can be observed.

0 1 2 3 4 5 6 1 1,5 2 2,5 3 3,5 4 4,5 5 fac to r o f load incre as e DB table growth

Scaling behaviour with DB size

Independend

Buffer hit ratio depends on table size Amount of Data read depends on table size

(11)

The size of the DB

Amount of Data Read Depends on Table Size

1/3

• In the cursor cache statements that create a load proportional to the size of the DB can be identified by a large (and growing) number of Bgets/row or Rproc/Exec.

• In case Rproc/exec is large the most common technical issue is select for all entries with an empty selection table. This always needs to be check

• In case this is not the solution it has to be check how the processes can be changed to reduce the number of records read.

• In case Bgets/row the index layout has to be checked. In case the access is to a single table correct indexing will always allow to reduce the Bgets/row to < 6.

(12)

The size of the DB

Amount of Data Read Depends on Table Size

2/3

• In case the large number of buffer gets is seen for a join with distributed selectivity it is not always possible to improve the situation with technical means. The most prominent and frequently seen example for such a join is the selection of material movement either in standard or as seen here in customer coding.

• The main issue here is the distributed selectivity of the Date on MKPF and all other fields on MSEG. In this special case the only stable solution for this is described in SAP Note 1598760. FAQ: MSEG extension and MB51/MB5B redesign. The changes necessary to avoid such non scalability are very complex as it is not only necessary to change coding but also the table layout.

• In many cases it is therefore not possible to implement a solution. So, most customers refuse to implement the changes. In that case knowledge of the non scalability can be used to estimate the largest allowed residence time for archiving to stay within acceptable performance limits.

(13)

The size of the DB

Amount of Data Read Depends on Table Size

3/3

• A nice example of non scalable load with an ever increasing amount of data read can be found in customer systems with long running delivery contracts typical for the automotive industry

• Such a statement is the select from EKBE which is the second most time consuming in the snapshot of the cursor cache with already more than a billion recorded disk reads. It is a select with specified

EBLEN and POSNR so it looks quite harmless. Rproc/exec is not that high as it is diluted by many access caused by simple Purchase orders. But the huge number of disk reads triggered are suspicious.

• EKBE is the order history containing all deliveries made on behave of a contract. Using JIT his might be one delivery every 3 minutes for more than a year for each position of a contract. As more and more old data is touched this drives the I/O load for this access dramatically. A solution for this special issue can be found in the use of transaction ME87 that needs to be used regularly to summarize the order history. (See SAP Note 417933 for details) .

• The example shows once again that it is more important to understand the business processes associated with the top resource consuming statements to find relevant performance improvements.

(14)

Example: VAPMA-VBUK

non scalable runtime increases with number of orders in DB

• The database uses index VAPMA~Z01 to access the data, this way each time all entries belonging to one plant will be read.

• The runtime of this statement will increase with the number of orders in the system.

 It is necessary to change the access so that the number open orders determine the runtime. This is most securely done by selecting VBUK first or by introducing oracle hints to use index VBUK~z02)

(15)

The size of the DB

Buffer hit ratio depends on table size

1/2

Less obvious than the depends discussed before are cases where we have fully indexed access to data reading only necessary. However there are statements in the cursor cache among the most expensive statement which are executed in huge but justified number numbers which only have a rather bad ratio of disk reds to buffer gets compared to the overall buffer quality.

Very often this is caused by an access via a non-chronologically sorted index. The theory behind this is elaborated in more detail in: “Data Archiving Improves Performance — Myth or Reality?”

http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d0b0de48-0701-0010-fcb1-fb99d43920e3?QuickLink=index&overridelayout=true&5003637390373

or in more detail in „ Performance Aspects of Data Archiving”

https://websmp108.sap-ag.de/~sapidb/011000358700005070382005E/DA_and_Performance_11_en.pdf

The main principle is easy to understand: If the data accessed is randomly distributed over the full width of an index or even a DB table the buffer hit ratio will depend heavily on the ratio of Index/table size vs. buffer size. Assuming fixed buffer sizes and growing index tables sizes the buffer hit ratio will go down.

This is not the case when the access is concentrated on a small part of the index/table that does not grow with the DB size.

(16)

The size of the DB

Buffer hit ratio depends on table size

2/2

Data is „touched“ by the DB when it resides in the same data block as data that is needed to fulfill a request. Therefore it can be concluded that the DB load can only be scalable when old data that is not needed any more does not reside in data blocks which also contain new data, that is needed to fulfill a request.

The pictures below show the insertion points of new data in a chronologically sorted index compared to that of an index that is not chronologically sorted. In a chronologically sorted index the amount of data touched to for all business transactions remains constant and is independent of the number of entries in the table.

If the index is not chronologically sorted this is not the case: The number of data blocks that are touched increases as the fraction of new data per index block gets smaller and smaller until the growth of old data is stopped for instance by data archiving.

Classification of Indices used for the access to data in respect to their quality with respect to chronology allows a very good estimate of the scalability of an application even from single user measurements.

(17)

The size of the DB

Tools to check

The most important tools to check this are the SQL trace in single user measurements and the cursor cache after go live.

There may be several reasons for this kind of non scalability. In the cursor cache statements need to be checked for a large number of Bget/execution, a large number of rproc/execution, and even for a worse than average buffer hit ratio.

In a single user trace it is necessary to check indexing and table design with special attention

given to the explicit or implicit time constraints in the where clause of each statement and how this is handled in the index.

Especially the importance of considering the different buffer quality for old and new data is neglected in many tables and index designs and very often makes the decisive difference between scalable and non scalable load.

(18)

Praxis Check:

Indices of DFKKOP

Table DFKKOP (Items in contract account document) is typically the largest and most important table of FI-CA with billions of entries in customer systems. In Standard 6 indices are defined for this table:

None of the indices is explicitly chronologically sorted. Specifying a time as the last field of an index (AUGDT in Index ~5; ~6) only creates a chronological order for entries with equal VTREF and BUKRS, which does not prevent the mixture of new and old data in one block. All of the indices are implicitly

chronologically sorted, by the use of either a document number (ascending with time), or the clearing status (open → 𝑛𝑒𝑤; closed → 𝑜𝑙𝑑).

Note: The clearing status was explicitly chosen as second field of all indices that did not contain a document number to achieve a separation between old and new data and enhance the scalability of access to new data with status „open“. Note also: there is no chronological order among the “closed” records for index ~1; ~4; ~5 and ~6.

Any access to the “closed” records via one of the indices ~1; ~4; ~5 or ~6 creates a non scalable load.

dfkkop~0 dfkkop~1 dfkkop~2 dfkkop~3 dfkkop~4 dfkkop~5 dfkkop~6

MANDT MANDT MANDT MANDT MANDT MANDT MANDT

OPEL

Number of Contract Accts Rec. & Payable Doc.

AUGST Clearing status AUGBL Clearing Document or Printed Document ABWBL

Number of the substitute FI-CA document AUGST Clearing status AUGST Clearing status AUGST Clearing status OPUPW

Repetition Item in Contract Account Document GPART business partner AUGST Clearing status VKONT

Contract Account Number

VTREF

Reference Specifications from Contract

ABWKT

Alternative contract account for collective bills

OPUPK

Item number in contract account document BUKRS Company Code WHGRP Repetition group BUKRS Company Code BUKRS Company Code OPUPZ

Subitem for a Partial Clearing in Document

XMANL

Exclude Item from Dunning Run

AUGDT

Clearing date

AUGDT

(19)

Praxis Check:

Access to DFKKOP

Insert of new records into DFKKOP

When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items

Access to open items

Use of all indices guarantees local access to new items only

Clearing run

The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6

Distributes the entry points equally over the complete index range forcing access to the complete range of these indices.

Open item list for settlement day

(20)

Praxis Check:

Access to DFKKOP

Insert of new records into DFKKOP

When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items

Access to open items

Use of all indices guarantees local access to new items only

Clearing run

The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6

Distributes the entry points equally over the complete index range forcing access to the complete range of these indices.

Open item list for settlement day

(21)

Expected Performance impact of HANA Migration

Insert of new records into DFKKOP

Any insert is just done into the L1-delta. While the merge will be very resource intensive the insert itself should be fast.

Access to open items

Being column based, HANA has a principle disadvantage here which will result in higher access times.

Clearing run

The update of the records again is just an insert into the L1-delta.

Open item list for settlement day

While this only touches only recent data (as long as the report is executed a short time after the settlement day) the amount of data necessary to be read is large enough that this disadvantage is offset by the efficient access to the data in the column store.

(22)

Content

Definition of Scalability

Proactive Performance & Capacity Management according to ITIL

Scalability of load with…

… the amount of business data processed in a step

… the number of parallel processes

… the size of the DB

Scalability of service time with…

… the number of CPUs available

… the capacity of the I/O Subsystem

… Database locks

Non-scalability introduced by application server buffering

Consequences for risk assessment and quality control

(23)

Example: DB Cursor Cache Analysis

Resource Consumption of Top 20 SQL- statements

0 10 20 30 40 50 60 70 80 90 100 ratio t o t o p con tr ibu to r [% ] duration disk reads buffer reads rows read

(24)

Importance of top 20 Resource Consumers

• Efficient optimization concentrates on the largest resource consumers.

• The longer and the more extensive this approach is followed, the smaller will be the relative

importance of the top n resource consumers compared to the rest. The effect of each optimization becomes smaller and less significant for overall sizing.

• SYS2 has reached a state where this approach does not show any significant potential for

improvement any more. This can be seen very clearly using the example of the shared cursor cache analysis:

• Shown above are the top 20 statements with respect to the number of buffer gets form SYS2 from tim1 and time2 together with an example of another customer system before and after optimization.

• The slop of the curves in SYS2 are so small that the top 20 virtually are meaningless for the overall resource consumption. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # bu ff er g ets n o rmaliz ed to t o p stat ement

top n statements sorted by buffer gets

Relative cost of top resource consuming statements

SYS1 (time 1) SYS1 (time 2) Z2L Jun 2012 Z2L Dec 2012 SYS2 (time1) SYS2 (time2)