T
his gem describes the architecture and implementation of a profiling module for low-overhead, real-time analysis that supports performance counter organization so that many consumers can work together in harmony. It is designed from a game engine perspective, with many of its requirements specifically pertaining to things typically found in games. At the time of this writing, the described module is in use by a commercially available game engine.Profiling the performance of a game or engine is one of those things that everyone agrees is important, but just as often as not guesswork or quick hacks are substituted for a real game system that can gather solid data. In the long run, the time it takes to implement a clean profiling system is a wise investment. And, as with everything else, the earlier we plan for it, the easier it will be.
Profiling Basics
The basic profiling mechanism is simple: take a timestamp at the beginning of the code of interest and again at the end. Subtract the first from the second, and voila, that's how long the code took to run. We need a high-resolution counter - the Win-dows multimedia timer and its millisecond resolution will not cut it. If the platform is Windows on a PC, there are two high-resolution API calls we can use: QueryPerfor-manceCounter and QueryPerformanceFrequency. However, because the overhead of these functions is fairly high, we will roll our own, which only requires a few lines of inline assembly:
void CWin32PerfCounterMgr::GetPerfCounter(
LARGE_INTEGER SdCounter) { DWORD dwLow.dwHigh;
asm { rdtsc
mov dwLow, eax mov dwHigh, edx }
iCounter.QuadPart = ((unsigned int64)dwHigh « 32)
| (unsigned int64)dwLow; }
74
To convert this number into seconds, we need to know the counter frequency. In this case it is equal to the CPU cycles per second. We can measure it once when the counters are enabled — take a time sample, sleep for at least 500ms, and then take another sample. Note that similar counters are available if the target platform is a game console.
Commercially Available Tools
Performance tuning is definitely a case where choosing the right tool for the job can make all the difference. There are many time-tested commercial tools available for the PC that sample an application as it runs, then offline allow profile data to be viewed module-by-module, function-by-function, and just about any other imaginable way.
Intel® VTune™ and Metrowerks® Analysis Tools both make use of the built-in CPU hardware counters to generate post-processed profiles of runtime sections of a game. Tuning assembly code by instruction ordering or pairing prediction is defi-nitely a strength of VTune™.
The Intel® Graphics Performance Toolkit (GPT) provides some powerful scene analysis tools. It hooks in and snoops traffic at the layer between your application and Direct3D/OpenGL. Knowing exactly what is being drawn can at times be very help-ful. Changing the order or the way in which the game renders can sometimes signifi-cantly affect performance. However, the GPT is written to a specific version of DirectX, so its releases usually trail that of DirectX. Also, taking any significant scene data will slow down the application, so relying on the performance characteristics of data taken when using the GPT can be dangerous.
Statistics-gathering drivers for graphics cards and hardware counters can be invaluable. Nvidia releases special drivers and a real-time data viewing application that hooks all of the function entry points of the drivers. If the graphics driver is tak-ing a significant percentage of CPU time, this application will allow us to look inside and break it down further. Intel® provides counters in its drivers and hardware for its i740 chip, allowing optimization for stalls all the way down to the graphics chip level.
Some of the game consoles also provide this ability. It can be very useful, as it is the only way to break down performance at this low level. It does, however, require a fair amount of knowledge about how the drivers and chips operate, and what the counters really mean.
Why Roll Our Own?
Reason one: frame-based analysis. Games typically have a fairly high frame-to-frame coherency, but in just a matter of seconds can drastically change. Imagine a 3D shooter—a player starts facing a wall, runs down a long corridor, then ends it all in a bloody firefight with five Al-driven enemies. The game engine is running through many potentially different bottlenecks that can only really be identified with a frame-by-frame analysis. Looking at a breakdown of an accumulated sample over the entire
interval gives an inaccurate view of what is really going on. Frame-based analysis allows focusing on one problem at a time.
Reason two: it can be done anytime and anywhere. At the end of a PC game development cycle, someone will probably be faced with performance problems that only manifest themselves on someone's brother's machine, on odd Tuesdays. There are typically a significant number of these types of problems. They can cost a lot of time and can very easily slip the release date. Although this type of problem is unique to PC games, console games still have to deal with the "shooting a missile in the cor-ner of level three grinds the game to a slow crawl" types of problems. Once the prob-lem is understood, figuring out the solution is usually the easy part. If we could walk over to that test machine and pop up a few counter groups, we would quickly nail down the culprit.
Reason three: customizability. Modern game engines are complicated. The ability to ignore all the other modules in the engine except for the one being working on is powerful. In addition, the only person that can organize the data exactly how they want it is the engineer actually doing the work.
Profile Module Requirements
Requirement one: allow users to quickly and accurately profile the application.
Requirement two: be non-obtrusive (that is, have very low overhead). When the cost for taking samples and displaying the results becomes a significant portion of die frame time, it can actually change the application's behavior within the system. In gen-eral, slowing down the CPU will tend to hide stalls caused by graphics cards. While even a very small percentage can in some rare cases drastically change game perfor-mance, as a general rule, when the profiler is enabled, it should take less than five per-cent of the total CPU cycles. When disabled, it should be much less dian one perper-cent.
Requirement three: allow multiple users to work independently on their respec-tive systems without having to worry about other engine modules.
Requirement four: when it's not needed, it should be well out of the way.
Architecture and Implementation
A performance counter manager (IPerfCounterMan) keeps track of all active and inac-tive counters. The counters are organized into groups of similar type (for example, model render, world render, AI, physics) that are enabled and disabled together. This supports the notion of multiple groups working independently in an easy to under-stand grouping concept. Groups are useful for two reasons: for quickly determining if a counter needs to be sampled, and for enabling and disabling groups of counters to be displayed. We will make use of four-character codes (FourCC's) for the group ID and full text strings for counter names.
The entire system is organized into a module with an interface to the rest of the system. The basic component is a counter that is identified by a group ID (its
FourCC) and its string name. Each counter is given an integer ID on creation that uniquely identifies it. In typical usage, the game code creates counters on initializa-tion and puts start/stop counter calls around the code to be profiled.
The basic functional unit interface for the module is as follows:
class IPerfCounterMan { public:
// Add new counter (returns the ID, 0 is failure) int32 AddCounter(uint32 CounterGroup,
const char* szCounterName);
// Forget your counter's ID? (Zero is failure) int32 GetCounterID(uint32 CounterGroup,
const chan* szCounterName);
// Delete the counter
bool DeleteCounter(uint32 Counter-ID);
// Start and Stop a counter.
void StartCounter(uint32 Counter-ID);
void StopCounter(uint32 CounterlD);
// Draw the Counters onto the Screen (to be called once // per frame near the end of the scene)
void DrawCounters();
};
StopCounter calculates the difference between the StartCounter and StopCounter calls and keeps a running total. On DrawCounters, all the running counters are cleared. A maximum value is also maintained and is set at the end of the frame in DrawCounters. Let's assume that our engine has a debug console that accepts text commands. It is a very convenient way to enable and disable counter groups and to allow customization of the display.
It is very helpful to allow as much configuration in the counter display as possi-ble. We will most likely not want to refresh the counter display every frame (updates every 30 frames should be sufficient), but depending on what is being debugged, the ability to customize the refresh time can be very handy. In addition, displaying both the current percentage and the maximum percentage since last displayed is useful.
A bar graph is a good way to display the result. It gives the consumer a quick feel for the numbers and isn't hard to code. The ability to switch from percentage to actual time (in milliseconds), display the time or percentage as text values, and auto-scale the axes is also very useful. Be careful about switching the axis scale very often, especially without some kind of warning, because it will likely just confuse people.
Implementation Details
The interface to the performance counter manager should be flexible and easy to use.
Consumers of the profile manager will often find it easier to simply call Add-Counter(...) with the full string, get the ID, and start it up all at once instead of sav-ing the counter ID at some one-time initialization point. Providsav-ing this mechanism can help out when doing some quick profiling. However, it's not as efficient, and call-ing it many times in a frame will add up quickly. Also, supplycall-ing a class that can be placed at the beginning of a function that calls StartCounter in the constructor and StopCounter in the destructor (when it goes out of focus) can be a handy way to instrument the counters.
When writing the profiling manager, it's best to provide some kind of #define macro that completely removes the profiler. When it comes down to getting peak per-formance out of a game, profiling code is often one of the first things to go. We need to provide macros for AddCounter, StartCounter, and StopCounter that completely compile out on an #ifdefdnan%t.
Also, it's best to use colors for visual cues. When the counters are being displayed, it's easier to read if we use different colors on each line.
Data Analysis
Be sure to profile the release build, because it can have a very different set of bottle-necks from the debug version. If the target platform is the PC, it is also a good idea to pick two or three typical system configurations (low to high end) and profile each of them. Bottlenecks can vary greatly across system configurations.
The game should be profiled in the areas that have performance problems as well as during typical game play. We must break the problem down, try to focus on one thing at a time, and focus on the areas that will give the biggest bang for the buck. Just because a function is called the most often or takes the most CPU time doesn't mean it is the only place we should focus our efforts. Often, the only thing we can compare our cycle times with is our expectations, and realistic expectations are usually gained only through experience.
The profiler itself should also be profiled. If the act of profiling is intrusive, it changes the behavior of your game. There should be a counter around the profiler's draw routines.
Implementation Notes
The described module has been implemented across multiple platforms. However, parts of it require platform-dependent functions. The actual timestamp query and the draw functions will mostly likely need to be implemented in platform-dependent code, so it's best to design a level of abstraction around those functions. The described implementation uses a set of debug geometry and text (which has a
platform-dependent implementation) in the draw code so that it can be platform inplatform-dependent.
You may need to write a macro to create your four character code values, as many compilers do not have support for them.
This same system can be used to take long running profiles of a game server to detect problems. All the counters go through one source, so data can easily be filtered down and saved to disk.