Application Scalability in
Proactive Performance & Capacity Management
Bernhard Brinkmoeller,
SAP AGS IT Planning
What is Scalability?
How would you define scalability?
In the context of PPCM is scalability a characteristic of the load or the
hardware?
How would you define scalable load?
© 2013 SAP AG. All rights reserved. 3
What is Scalability?
Definition from Wikipedia
In
electronics
(including
hardware
,
communication
and
software
), scalability is the
ability of a system, network, or process to handle a growing amount of work in a
capable manner or its ability to be enlarged to accommodate that growth. ….
Scalability, as a property of systems, is generally difficult to define
[2]and in any
particular case it is necessary to define the specific requirements for scalability on
those dimensions that are deemed important. It is a highly significant issue in
electronics systems, databases, routers, and networking. A system whose
performance improves after adding hardware, proportionally to the capacity
added, is said to be a scalable system….
An
algorithm
, design,
networking protocol
,
program
, or other system is said to scale
if it is suitably
efficient
and practical when applied to large situations ….
The general definition of scalability is depends strongly on the context that it is used
in. Even in a given context it is questionable whether it is precise enough to
form the basis to define concrete work packages for a “scalability analysis”.
It is very important, that we reach a common understanding of what we want
to achieve in Proactive Performance & Capacity Management before we start.
© 2013 SAP AG. All rights reserved. 4
Content
Definition of Scalability
Proactive Performance & Capacity Management according to ITIL
Scalability of load with…
… the amount of business data processed in a step
… the number of parallel processes
… the size of the DB
Scalability of service time with…
… the number of CPUs available
… the capacity of the I/O Subsystem
… Database locks
Non-scalability introduced by application server buffering
Consequences for risk assessment and quality control
© 2013 SAP AG. All rights reserved. 5
Capacity Management Service According To ITIL
ITIL Service Delivery v.2.1 – Published For OGC By TSO
• According to the Information Technology Infrastructure Library (ITIL) a capacity management service consists of three Sub- Processes
• The output with the highest value is obtained when the results of the sub-processes are brought together.
• While the entry point of the sub-processes are different, all of them aim at establishing a connection from the business requirement over the services (reports and transactions) to the resource (CPU, memory, disk) consumption
What is Scalability?
Definition of Scalability in PPCM
1/3
Business capacity Management:
Business volume for process X
Load is scalable when the resource consumption of the Services necessary to run
the business process depends linearly on the (business) volume and there are no
unexpected load drivers
Resource Capacity Management:
Service time for resource Z
Service Capacity Management:
Resource consumption for Service Y
Hardware is scalable when it is capable to provide the necessary resources for the
required number of services in a given time interval without a degradation of service
times
© 2013 SAP AG. All rights reserved. 7
What is Scalability?
Definition of Scalability in PPCM
2/3
Load is scalable when the consumption of expected resources depends linearly on
the (business) volume and there are no unexpected load drivers.
Hardware is scalable when it is capable to provide the necessary resources for the
required number of services in a given time interval without a degradation of service
times.
Examples:
• The response time of processing of an order should depend linearly on the number of items in the order. Signs of non-scalability are:
• Quadratic dependence on the number of line items in the order Use of sorted tables and read binary search in ABAP
• Dependence on the network latency between the front end and the server: The number of communication steps has to be so small that the network latency can be neglected.
• Dependency on the amount of data stored in the DB. Read only new data from chronologically sorted indices.
• The throughput for order processing should depend linear on the CPU capacity provided by the infrastructure. Signs for non-scalability are:
• Dependence on the length of the critical path of DB locks. avoid long critical path for updates with a large likelihood of lock collision
• I/O bottlenecks caused by high ReDo volume. Avoid unproductive database changes (eg. Using „set update task local“)
What is Scalability?
Definition of Scalability in PPCM
3/3
On a detailed level scalability describes the relationships between:
•
Volumes of all business processes supported by a system
•
The consumption of the various resources provided by the system
•
The service request times the system is capable to provide
A system is scalable up to the required limit when even under high load the
contribution of non scalable contributions to the overall resource
consumption and service times remains below an acceptable limit and the
hardware can provide the required resources at peak time without
unacceptable degradations of service request times.
For very large systems the acceptable contribution of non–scalable load to the
resource consumption is typically set at about 20%. For smaller systems it is much
higher as it is cheaper to provide more hardware.
A system is scalable when the load consisting about 80% of the resource
consumption is proven to be scalable and the hardware can provide the
required resources at peak time without degradations of service request
times of less than 20%. (Limits are debatable)
A Scalability analysis is always restricted to the (expected) top load
contributors
© 2013 SAP AG. All rights reserved. 9
Content
Definition of Scalability
Proactive Performance & Capacity Management according to ITIL
Scalability of load with…
… the amount of business data processed in a step
… the number of parallel processes
… the size of the DB
Scalability of service time with…
… the number of CPUs available
… the capacity of the I/O Subsystem
… Database locks
Non-scalability introduced by application server buffering
Consequences for risk assessment and quality control
Consequences for monitoring
The size of the DB
Principal Scalability Patterns
Following scaling behavior can be observed for the load with the DB size:
1. Constant resource consumption independent of table size
Independent of the table size are all fully indexed access to data ( the small dependency on the depth of the B-tree for the index can be neglected) which have a high likelihood to only access Data blocks in the buffer.
2. Decrease of the buffer hit ratio with table size
In case the chance that a data block decreases with the index or the table size a week linear dependency of the resource consumption with the table size can be observed
Directly proportional with the table size:
3. In case the number of data blocks that need to be read increases with the table size a strong linear dependency of the resource consumption can be observed.
0 1 2 3 4 5 6 1 1,5 2 2,5 3 3,5 4 4,5 5 fac to r o f load incre as e DB table growth
Scaling behaviour with DB size
Independend
Buffer hit ratio depends on table size Amount of Data read depends on table size
© 2013 SAP AG. All rights reserved. 11
The size of the DB
Amount of Data Read Depends on Table Size
1/3
• In the cursor cache statements that create a load proportional to the size of the DB can be identified by a large (and growing) number of Bgets/row or Rproc/Exec.
• In case Rproc/exec is large the most common technical issue is select for all entries with an empty selection table. This always needs to be check
• In case this is not the solution it has to be check how the processes can be changed to reduce the number of records read.
• In case Bgets/row the index layout has to be checked. In case the access is to a single table correct indexing will always allow to reduce the Bgets/row to < 6.
The size of the DB
Amount of Data Read Depends on Table Size
2/3
• In case the large number of buffer gets is seen for a join with distributed selectivity it is not always possible to improve the situation with technical means. The most prominent and frequently seen example for such a join is the selection of material movement either in standard or as seen here in customer coding.
• The main issue here is the distributed selectivity of the Date on MKPF and all other fields on MSEG. In this special case the only stable solution for this is described in SAP Note 1598760. FAQ: MSEG extension and MB51/MB5B redesign. The changes necessary to avoid such non scalability are very complex as it is not only necessary to change coding but also the table layout.
• In many cases it is therefore not possible to implement a solution. So, most customers refuse to implement the changes. In that case knowledge of the non scalability can be used to estimate the largest allowed residence time for archiving to stay within acceptable performance limits.
© 2013 SAP AG. All rights reserved. 13
The size of the DB
Amount of Data Read Depends on Table Size
3/3
• A nice example of non scalable load with an ever increasing amount of data read can be found in customer systems with long running delivery contracts typical for the automotive industry
• Such a statement is the select from EKBE which is the second most time consuming in the snapshot of the cursor cache with already more than a billion recorded disk reads. It is a select with specified
EBLEN and POSNR so it looks quite harmless. Rproc/exec is not that high as it is diluted by many access caused by simple Purchase orders. But the huge number of disk reads triggered are suspicious.
• EKBE is the order history containing all deliveries made on behave of a contract. Using JIT his might be one delivery every 3 minutes for more than a year for each position of a contract. As more and more old data is touched this drives the I/O load for this access dramatically. A solution for this special issue can be found in the use of transaction ME87 that needs to be used regularly to summarize the order history. (See SAP Note 417933 for details) .
• The example shows once again that it is more important to understand the business processes associated with the top resource consuming statements to find relevant performance improvements.
Example: VAPMA-VBUK
non scalable runtime increases with number of orders in DB
• The database uses index VAPMA~Z01 to access the data, this way each time all entries belonging to one plant will be read.
• The runtime of this statement will increase with the number of orders in the system.
It is necessary to change the access so that the number open orders determine the runtime. This is most securely done by selecting VBUK first or by introducing oracle hints to use index VBUK~z02)
© 2013 SAP AG. All rights reserved. 15
The size of the DB
Buffer hit ratio depends on table size
1/2
Less obvious than the depends discussed before are cases where we have fully indexed access to data reading only necessary. However there are statements in the cursor cache among the most expensive statement which are executed in huge but justified number numbers which only have a rather bad ratio of disk reds to buffer gets compared to the overall buffer quality.
Very often this is caused by an access via a non-chronologically sorted index. The theory behind this is elaborated in more detail in: “Data Archiving Improves Performance — Myth or Reality?”
http://www.sdn.sap.com/irj/scn/go/portal/prtroot/docs/library/uuid/d0b0de48-0701-0010-fcb1-fb99d43920e3?QuickLink=index&overridelayout=true&5003637390373
or in more detail in „ Performance Aspects of Data Archiving”
https://websmp108.sap-ag.de/~sapidb/011000358700005070382005E/DA_and_Performance_11_en.pdf
The main principle is easy to understand: If the data accessed is randomly distributed over the full width of an index or even a DB table the buffer hit ratio will depend heavily on the ratio of Index/table size vs. buffer size. Assuming fixed buffer sizes and growing index tables sizes the buffer hit ratio will go down.
This is not the case when the access is concentrated on a small part of the index/table that does not grow with the DB size.
The size of the DB
Buffer hit ratio depends on table size
2/2
Data is „touched“ by the DB when it resides in the same data block as data that is needed to fulfill a request. Therefore it can be concluded that the DB load can only be scalable when old data that is not needed any more does not reside in data blocks which also contain new data, that is needed to fulfill a request.
The pictures below show the insertion points of new data in a chronologically sorted index compared to that of an index that is not chronologically sorted. In a chronologically sorted index the amount of data touched to for all business transactions remains constant and is independent of the number of entries in the table.
If the index is not chronologically sorted this is not the case: The number of data blocks that are touched increases as the fraction of new data per index block gets smaller and smaller until the growth of old data is stopped for instance by data archiving.
Classification of Indices used for the access to data in respect to their quality with respect to chronology allows a very good estimate of the scalability of an application even from single user measurements.
© 2013 SAP AG. All rights reserved. 17
The size of the DB
Tools to check
The most important tools to check this are the SQL trace in single user measurements and the cursor cache after go live.
There may be several reasons for this kind of non scalability. In the cursor cache statements need to be checked for a large number of Bget/execution, a large number of rproc/execution, and even for a worse than average buffer hit ratio.
In a single user trace it is necessary to check indexing and table design with special attention
given to the explicit or implicit time constraints in the where clause of each statement and how this is handled in the index.
Especially the importance of considering the different buffer quality for old and new data is neglected in many tables and index designs and very often makes the decisive difference between scalable and non scalable load.
© 2013 SAP AG. All rights reserved. 18
Praxis Check:
Indices of DFKKOP
Table DFKKOP (Items in contract account document) is typically the largest and most important table of FI-CA with billions of entries in customer systems. In Standard 6 indices are defined for this table:
None of the indices is explicitly chronologically sorted. Specifying a time as the last field of an index (AUGDT in Index ~5; ~6) only creates a chronological order for entries with equal VTREF and BUKRS, which does not prevent the mixture of new and old data in one block. All of the indices are implicitly
chronologically sorted, by the use of either a document number (ascending with time), or the clearing status (open → 𝑛𝑒𝑤; closed → 𝑜𝑙𝑑).
Note: The clearing status was explicitly chosen as second field of all indices that did not contain a document number to achieve a separation between old and new data and enhance the scalability of access to new data with status „open“. Note also: there is no chronological order among the “closed” records for index ~1; ~4; ~5 and ~6.
Any access to the “closed” records via one of the indices ~1; ~4; ~5 or ~6 creates a non scalable load.
dfkkop~0 dfkkop~1 dfkkop~2 dfkkop~3 dfkkop~4 dfkkop~5 dfkkop~6
MANDT MANDT MANDT MANDT MANDT MANDT MANDT
OPEL
Number of Contract Accts Rec. & Payable Doc.
AUGST Clearing status AUGBL Clearing Document or Printed Document ABWBL
Number of the substitute FI-CA document AUGST Clearing status AUGST Clearing status AUGST Clearing status OPUPW
Repetition Item in Contract Account Document GPART business partner AUGST Clearing status VKONT
Contract Account Number
VTREF
Reference Specifications from Contract
ABWKT
Alternative contract account for collective bills
OPUPK
Item number in contract account document BUKRS Company Code WHGRP Repetition group BUKRS Company Code BUKRS Company Code OPUPZ
Subitem for a Partial Clearing in Document
XMANL
Exclude Item from Dunning Run
AUGDT
Clearing date
AUGDT
© 2013 SAP AG. All rights reserved. 19
Praxis Check:
Access to DFKKOP
Insert of new records into DFKKOP
When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items
Access to open items
Use of all indices guarantees local access to new items only
Clearing run
The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6
Distributes the entry points equally over the complete index range forcing access to the complete range of these indices.
Open item list for settlement day
Praxis Check:
Access to DFKKOP
Insert of new records into DFKKOP
When new records are inserted into DFKKOP the status is open. Insertion points are concentrated locally for new items
Access to open items
Use of all indices guarantees local access to new items only
Clearing run
The change of the clearing status distributes the entry points in index ~1, ~4, ~5, ~6
Distributes the entry points equally over the complete index range forcing access to the complete range of these indices.
Open item list for settlement day
© 2013 SAP AG. All rights reserved. 21
Expected Performance impact of HANA Migration
Insert of new records into DFKKOP
Any insert is just done into the L1-delta. While the merge will be very resource intensive the insert itself should be fast.
Access to open items
Being column based, HANA has a principle disadvantage here which will result in higher access times.
Clearing run
The update of the records again is just an insert into the L1-delta.
Open item list for settlement day
While this only touches only recent data (as long as the report is executed a short time after the settlement day) the amount of data necessary to be read is large enough that this disadvantage is offset by the efficient access to the data in the column store.
© 2013 SAP AG. All rights reserved. 22
Content
Definition of Scalability
Proactive Performance & Capacity Management according to ITIL
Scalability of load with…
… the amount of business data processed in a step
… the number of parallel processes
… the size of the DB
Scalability of service time with…
… the number of CPUs available
… the capacity of the I/O Subsystem
… Database locks
Non-scalability introduced by application server buffering
Consequences for risk assessment and quality control
© 2013 SAP AG. All rights reserved. 23
Example: DB Cursor Cache Analysis
Resource Consumption of Top 20 SQL- statements
0 10 20 30 40 50 60 70 80 90 100 ratio t o t o p con tr ibu to r [% ] duration disk reads buffer reads rows read
Importance of top 20 Resource Consumers
• Efficient optimization concentrates on the largest resource consumers.
• The longer and the more extensive this approach is followed, the smaller will be the relative
importance of the top n resource consumers compared to the rest. The effect of each optimization becomes smaller and less significant for overall sizing.
• SYS2 has reached a state where this approach does not show any significant potential for
improvement any more. This can be seen very clearly using the example of the shared cursor cache analysis:
• Shown above are the top 20 statements with respect to the number of buffer gets form SYS2 from tim1 and time2 together with an example of another customer system before and after optimization.
• The slop of the curves in SYS2 are so small that the top 20 virtually are meaningless for the overall resource consumption. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # bu ff er g ets n o rmaliz ed to t o p stat ement
top n statements sorted by buffer gets
Relative cost of top resource consuming statements
SYS1 (time 1) SYS1 (time 2) Z2L Jun 2012 Z2L Dec 2012 SYS2 (time1) SYS2 (time2)