Dataflow Analysis for Scaling

Part II Foundation Elements

Chapter 3 Workstations

3.1 The Basics

5.2.3 Dataflow Analysis for Scaling

If you understand the individual components of a typical transaction in a service, you can scale the service with much greater precision and efficiency. Strata’s experiences building scalable Internet services for ISPs and ASPs led her to create a dataflow model for individual transactions and combine them into spreadsheets to get an overall dataflow picture, which sounds more com- plicated than it really is.

A dataflow model is simply a list of transactions and their dependencies, with as much information as can be acquired about resource usages for each transaction. That information might include the amount of memory used on the server hosting that transaction, the size and number of packets used in a transaction, the number of open sockets used to service a transaction, and so on.

In modeling an individual service transaction with dataflow, all the pieces of the transaction necessary to make it happen are included, even such pieces as Internet name lookups via DNS, in order to get a true picture of the transaction. Even things technically outside your control, such as the be- havior of the root-level name servers in DNS, can affect what you’re trying to model. If a transaction bottleneck occurred in the name-lookup phase, for instance, you could internally run a caching name server, thus saving some time doing external lookups. Sites that keep and analyze web service logs or other external access logs routinely do this, as it speeds up logging. For even faster logging, sites may simply record the external host by IP address and do the name lookups in a postprocessing phase for analysis.

A nice thing about a service is that it is generally transaction based. Even file sharing consists of multiple transactions as blocks are read and written across the network. The key part of dataflow modeling to remember is that service transactions almost always depend on infrastructure transactions. It’s fairly common to investigate a scaling problem with a service and discover that the service itself has a bottleneck somewhere in the infrastructure.

Once the dataflow model is accurately depicting the service, you can address performance and scaling problems by seeing what part of the dataflow model is the weakest point, monitoring each piece, under real or simulated conditions, and see how they act or how they fail. For example, if your database can handle up to 100 queries per second and if you know that every access to your web site’s home page requires three database queries, you can predict that the web site will work only if there are no more than 33 hits per second. However, you also now know that if you can improve the

5.2 The Icing 125

performance of the database to be 200 QPS—possibly by replicating it on a second server and dividing the queries between the two—the web site can handle twice as many hits per second, assuming that no other bottleneck is involved.

Resources on the server can also be an issue. Suppose that a server pro- vides email access via IMAP. You might know, from direct observation or from vendor documentation, that each client connected to the server requires about 300K of RAM. Looking at the logs, you can get an idea of the usage patterns of the server: how many users are on simultaneously during which parts of the day versus the total number of server users.

Knowing how many people are using the service is only part of the process. In order to analyze resources, you also should consider whether the IMAP server process loads an index file of some type, or even the whole mailbox, into memory. If so, you need to know the average size of the data that will be loaded, which can be calculated as a strict average of all the customers’ index files; as a mean or median, based on where in the size curve most index files occur, or even by adding up only the index files used during peak usage times and doing those calculations on them. Pick what seems like the most realistic case for your application. The monitoring system can be used to validate your predictions. This might show unexpected things, such as whether the average mailbox size grows faster than expected. This might affect index size and thus performance.

Finally, step back and do this kind of analysis for all the steps in the dataflow. If a customer desktop makes an internal name-lookup query to find the mail server rather than caching info on where to find it, that should be included in your dataflow analysis as load on the name server. Maybe the customer is using a webmail application, in which case the customer will be using resources on a web server, whose software in turn makes an IMAP connection to the mail server. In this case, there are probably at least two name lookups per transaction, since the customer desktop will look up the webmail server, and the webmail server will look up the IMAP server. If the webmail server does local authentication and passes credentials to the IMAP server, there would be an additional name lookup, to the directory server, then a directory transaction.

Dataflow modeling works at all levels of scale. You can successfully design a server upgrade for a 30-person department or a multimedia services cluster for 3 million simultaneous users. It might take some traffic analysis on a sample setup, as well as vendor information, system traces, and so on, to get exact figures of the type you’d want for the huge-scale planning.

An Example of Dataflow Analysis

Strata once managed a large number of desktops accessing a set of file servers on the network. A complaint about files being slow to open was investigated, but the network was not clogged, nor were there unusual numbers of retransmits or lags in the file server statistics on the hosts serving files. Further investigation revealed that all the desktops were using the same directory server to get the server-to-file mapping when opening a file and that the directory server itself was overloaded. No one had realized that although the directory server could easily handle the number of users whose desktops mapped to it, each user was generating dozens, if not hundreds, of file-open requests to compile large jobs. When the requests per user figures were calculated and the number of simultaneous users estimated, it was then easy to see that an additional directory server was required for good performance.

5.3 Conclusion

Designing and building a service is a key part of every SA’s job. How well the SA performs this part of the job determines how easy each service is to support and maintain, how reliable it is, how well it performs, how well it meets customer requirements, and ultimately how happy the customers will be with the performance of the SA team.

You build services to provide better service to your customers, either directly by providing a service they need or indirectly by making the SA team more effective. Always keep the customers’ requirements in mind. They are ultimately the reason you are building the service.

An SA can do a lot of things to build a better service, such as building it on dedicated servers, simplifying management, monitoring the servers and the service, following company standards, and centralizing the service onto a few machines. Some ways to build a better service involve looking beyond the initial requirements into the future of upgrade and maintenance projects. Making the service as independent as possible of the machines it runs on is one key way of keeping it easier to maintain and upgrade.

Services should be as reliable as the customer requirements specify. Over time, in larger companies, you should be able to make more services fully re- dundant so that any one component can fail and be replaced without bringing the service down. Prioritize the order in which you make services fully redun- dant, based on the return on investment for your customers. You will have a better idea of which systems are the most critical only after gaining experience with them.

Exercises 127

Rolling out the service smoothly with minimal disruption to the customers is the final, but possibly most visible, part of building a new service. Customers are likely to form their opinion of the new service on the basis of the rollout process, so it is important to do that well.

Exercises

1. List all the services that you can think of in your environment. What hardware and software make up each one? List their dependencies. 2. Select a service that you are designing or can predict needing to design

in the future. What will you need to do to make it meet the recommen- dations in this chapter? How will you roll out the service to customers? 3. What services rely on machines that do not live in the machine room?

How can you remove those dependencies?

4. What services do you monitor? How would you expand your monitoring to be more service based rather than simply machine based? Does your monitoring system open trouble tickets or page people as appropriate? If not, how difficult would it be to add that functionality?

5. Do you have a machine that has multiple services running on it? If so, how would you go about splitting it up so that each service runs on dedicated machines? What would the impact on your customers be during that process? Would this help or hurt service?

6. How do you do capacity planning? Is it satisfactory, or can you think of ways to improve it?

7. What services do you have that have full redundancy? How is that redundancy provided? Are there other services that you should add redundancy to?

8. Reread the discussion of bandwidth versus latency (Section 5.1.2). What would the mathematical formula look like for the two proposed solu- tions: batched requests and windowed requests?

Chapter6

In document The Practice of System and Network Administration 2nd Edition pdf (Page 165-170)