Chapter 2: Background and Technology Trends
2.2 Changes Since Then
2.2.6 Database Algorithms
The final objection of Boral and DeWitt was that the simple hardware implementa- tions in the database machines were not sufficient to support complex database operations such as sorts and joins. As part of the work on database machines after the “time has passed” paper, including work by Boral and DeWitt and their colleagues, many different solutions to this have been proposed and explored. The concept of a shared nothing data- base “machine” is now a well-established concept within the database community, and much work has gone into developing the parallel algorithms that make this feasible. A sur- vey paper by DeWitt and Gray [DeWitt92] discusses the trends and success of shared nothing systems, including commercial successes from Tandem and Teradata, among oth- ers. The authors point to the same objection from [Boral83] and provide compelling evi- dence that parallel database algorithms have been a success.
Chapter 4 will explore the operations in a relational database system in detail, and illustrate how they can be mapped onto an Active Disk architecture. The basic point is that there are now known algorithms that can operate efficiently in this type of architecture. In addition, many of the data-intensive applications that are becoming popular today rely much more heavily on “simple” scans and require much more brute-force searching of data, because the patterns and relationships in the data are not as well understood as they are in the more traditional relational database systems on transaction processing work- loads. Finding patterns in image databases, for example, is a much different task than in structured customer records. The next sections will discuss this change in applications and motivate the benefits to a system that can support more efficient parallel scans than today’s systems.
2.2.7 Interconnects
There are several issues raised in the analysis of the database machines that were not specifically addressed in most of the existing work, and are still issues today. The most important one is contention for network bandwidth to the front-end host. This bottleneck persists today and is a primary reason why Active Disk systems, with processing at the “edges” of the network can be successful. Many types of data processing queries reduce a large amount of “raw” data into a much smaller amount of “summary” data that answers a
particular user question and can be easily understood by the human end-user. Queries such as: How much tea did we sell to outlets in China? How much revenue would we lose if we
stopped marketing refrigerators in Alaska? How much money does this customer owe us for goods we sent him more than two months ago? The use of indices can speed up the
search for a particular item of data, but cannot reduce the amount of data that must be returned to the user in answer to a particular query. This is what led to the contention for the output channel in the early processor-per-track database machines, and it is also the bottleneck in many large storage systems today.
Instead of being limited by the bandwidth of reading data from the disk media, mod- ern systems often have limited peripheral interconnect bandwidth, as seen in the system bus column of Table 2-6. We see that many more MB/s can be read into the memory of a large collection of disks than can be delivered to a host processor.
The interconnection “networks” used between storage devices and hosts and those used among hosts have long had somewhat different characteristics. A technology survey paper by Randy Katz [Katz92] breaks the technology into three distinct areas: backplanes, channels, and networks. Where backplanes are short (about 1 m), with bandwidth over 100 MB/s, sub microsecond latencies, and highly reliable; channels are longer (small tens of meters), support up to 100 MB/s bandwidth, have latencies under 100 microseconds, and medium reliability; and networks span kilometers, sustain 1 to 15 MB/s, and have latencies in milliseconds, with the medium considered unreliable, requiring the use of expensive protocols above to ensure reliable messaging. These distinctions are no longer as true as it was at the time of this survey. The need to connect larger numbers of devices and larger numbers of hosts over larger distances, has led to the development of the much more “network-like” Fibre Channel for storage interconnects. The growing popularity of Fibre Channel for storage devices and packet-switched networks for local- and wide-area networks has clouded the boundaries of peripheral-to-host and host-to-host interconnects. Since both Fibre Channel and Fast or Gigabit Ethernet, the storage and networking inter- connect technologies of choice respectively are based on packets, switches, and run over the same fiber optic infrastructure. Why continue to artificially separate the two systems?
This is the contention of previous work at Carnegie Mellon on Network-Attached Secure Disks (NASD) [Gibson97, Gibson98] and is coming close to reality by the intro-
System System Bus Storage
Throughput Mismatch Factor Compaq ProLiant TPC-C 133 MB/s 1,410 MB/s 10.6 x Microsoft TerraServer 532 MB/s 3,240 MB/s 6.1 x Digital AlphaServer TPC-C 266 MB/s 610 MB/s 2.3 x Digital AlphaServer TPC-D 300 532 MB/s 5,210 MB/s 9.8 x
Table 2-6 Comparison of system and storage throughput in large server systems. If we estimate a modest 10 MB/s for current disk drives on sequential scans, we see that the aggregate storage bandwidth is more than twice the (theoretical) backplane bandwidth of the machine in almost every case.
duction of Storage Area Networks (SANs) in industry [Clariion99, Seagate98, StorageTek99]. Industry surveys estimate that 18% of storage will be in SANs by the end of 1999, reaching up to 70% within two years [IBM99].
Even though individual point-to-point bandwidths have increased greatly and latency has decreased significantly, the network connectivity in a distributed system will continue to be a bottleneck. It is simply too expensive to connect a large number of devices with a full crossbar network. This means that systems will need to take advantage of locality of reference in order to manage the inherent bottlenecks, but it also means that certain access patterns will always suffer from the bottlenecks among nodes. The cost comparison is clear when one compares the cost of a hierarchical switched system against a full crossbar system as the number of nodes in a system is increased. For a small number of nodes, a local switched fabric is quite effective, but as soon as the number of nodes exceeds the capacity of a single switch, the costs of maintaining a full crossbar increase rapidly. Switches must be deployed in a way that requires most of their ports to be dedi- cated to switch-to-switch connectivity, rather than to support end nodes. This greatly increases the cost of the system to the point where it becomes prohibitive to provide that level of connectivity in systems of more than one hundred nodes. This means that net- works must either be limited to the size of the largest crossbar switch currently available, or must attempt to take advantage of locality in some form or another and live with certain bottlenecks. The ability to move function as well as data throughout different parts of the system (e.g. from hosts to disks) provides additional leverage in most efficiently taking advantage of a particular, limited, network configuration.