Sunday, March 24, 2013

Parallel Database Systems: The Future of High Performance Database Processing

This was a paper written by David J. DeWitt and Jim Gray back in 1992. It reviews the techniques used by parallel systems and surveys commercial and research systems existing at that time.

History

- Mainframes dominated most database and transaction processing tasks.
- Parallel Machines were practically written off.
- Specialized Database Machines came up with trendy hardware.
- Relational Data Model brought about a revolution. 
- Relational queries can be viewed as a series of operators applied to streams of data.
Pipelined parallelism: use the output of one step to as an input for another process.
- Partitioned parallelism: The same operator is executed across multiple machines simultaneously but over different data partitions or chunks.

Parallelism Goals and Metrics
Speedup: The ratio of time taken by small system to the time taken by a large system to perform the same task. 
Scaleup: The ability of an N-times larger system to perform an N-times larger job in the same elapsed time as the original system. 

There are 2 categories of scaleup:
(a) Transactional scaleupIf the task consists of performing many small independent requests submitted by many clients and operating on a shared database. Technically there are N-times as many clients, submitting N-times as many requests against an N-times larger database. Typical to TP systems and time sharing systems.
(b) Batch scaleup: Here there is one large task to be done by N-times larger database. Typical to database queries and scientific simulations.

3 Barriers to linear speedup and linear scaleup are startup (time to start the operation), interference (due to new process which tries to access the shared resources), skew (due to high parallelism, average size of each sub-task reduces but the variance can be high). 

Hardware Architecture
The challenge is to build an infinitely fast processor out of infinitely many processors of finite speed, and to build an infinitely large memory with infinite memory bandwidth from infinitely many storage units of finite speed. This is impractical due to interference generated by inclusion of every new node in the system.

Possible multi-processor designs:

(a) shared-memory
(b) shared-disks
(c) shared nothing

Shared-memory and shared-disk systems do not scale well on database applications. Interference is a major problem for shared-memory multi-processors. Most shared-memory multi-processors had adopted a shared-disk architecture to leverage affinity scheduling. This is more expensive than shared-nothing approach for shared database applications as the number of messages to reserve data and exchanging large number of data pages, the shared-disk approach is much more expensive than the  of exchanging small high-level logical questions and answers among clients and servers.
The main advantage of shared-nothing multi-processors is that they can be scaled up to hundreds and probably thousands of processors that do not interfere with one another. They are known to deliver near-linear speedups and scaleups on complex relational queries and on online-transaction processing workloads. 

Parallel Dataflow Approach to SQL Software
Relational queries can be executed as a dataflow graph. This graph can use both pipelined parallelism and dataflow parallelism. Pipeline parallelism provides limited speedup due to 3 limitations: the chains of parallel operations are typically shorter, not all operators can be pipelined as they might need all input before performing its operation and the differences in the execution cost of operators. Partitioning data and then parallelization of execution provides good scaleup and speedup (divide and conquer paradigm).



Data partitioning
technique
Feature Drawback
range partitioning good for sequential, associative access, 
range queries
leads to data skew 
and execution skew
round-robin good for complete scansbad for point queries
hashing good for sequential and associative access

The response time improvements are seen by these partitioning techniques till the point when the cost of starting a query on a node becomes a significant fraction of the actual execution time.

Parallelism Within Relational Operators: Use the existing relational operator routines over the data partitions. This way we save the effort for developing new "parallel" version of the operators. We will still need some routines for parallelization. The first basic parallelizing operator is a merge that can combine several parallel data streams into a single sequential stream. A split operator is used to partition or replicate the stream of tuples produced by a relational operator. 
An example query is given in the paper which depicted how split and merge are used along with the normal operators. As the operations are performed across different processors, there will be less interference. The split and merge operators have flow control and buffering to prevent one operator from getting too far ahead in the computation. 

Specialized Parallel Relational Operators: The algorithms used to implement the relational operators can be improved / changed to take into account the parallelization of the target system. eg Sort-merge join works well in a parallel dataflow environment unless there is data skew. In case of data skew, some sort partitions may be much larger than others. This skew problem is not present on a centralized execution system. The hash join is a better choice here. It breaks a big join into many little joins. Unless the bucket sizes are largely skewed, we can expect good performance.

Existing systems at that time

Teradata: Teradata processors are functionally divided into two groups: Interface Processors (IFPs) and Access Module Processors (AMPs). The IFPs handle communication with the host, query parsing and optimization, and coordination of AMPs during query execution. The AMPs are responsible for executing queries. Each relation is hash partitioned over a subset of the AMPs. A hash function is applied to the primary key of the tuple to select an AMP for storage.
Once a tuple arrives at a AMP, a second hash function determines the tuple’s placement in its
fragment of the relation. Hashing is used to spit the outputs of relational operators into intermediate relations. Join operators are executed using a parallel sort-merge algorithm.

Tandem NonStop SQL: Each disk is served by a set of processes managing a large shared RAMcache, a set of locks, and log records for the data on that disk pair. Considerable effort is spent on optimizing sequential scans by prefetching large units, and by filtering and manipulating the tuples with SQL predicates at these disk servers. Tandem systems demonstrate near-linear scaleup on transaction processing workloads, and near-linear speedup and scaleup on large relational queries.

Gamma: Provides round-robin, range, hash partitioning and hybrid-range partitioning that combines the best features of the hash and range partitioning strategies. Gamma uses split and merge operators to execute relational algebra operators using both parallelism and pipelining. Near-linear speedup and scaleup for relational queries has been reported.

Super Database Computer: The basic unit, called a processing module (PM), consists of one or more processors on a shared memory. Data is partitioned among the PMs by hashing. 

Database Machines and Grosch’s Law

In the 1960’s, Herb Grosch observed that there is an economy-of-scale in computing. At that time, expensive computers were much more powerful than inexpensive computers. Grosch’s law no longer applies to database and transaction processing problems. By combining hundreds or thousands of these small systems, one can build an incredibly powerful database machine for much less money than the cost of a modest mainframe. For database problems, the near-linear speedup and scaleup of these shared-nothing machines allows them to outperform current shared-memory and shared disk mainframes.

Future Directions and Research Problems

Mixing Batch and OLTP Queries
  • Large relational queries tend to acquire a many locks and tend to hold them for a relatively long time. This prevents concurrent updates the data by simple online transactions.
  • Priority scheduling: Batch jobs have a tendency to monopolize the processor, flood the memory cache, and make large demands on the I/O subsystem.
  • Priority inversion problem: A low-priority client makes a request to a high priority server. The server must run at high priority because it is managing critical resources. Given this, the work of the low priority client is effectively promoted to high priority when the low priority request is serviced by the high-priority server.

Parallel Query Optimization
  • To date, no query optimizers consider all the parallel algorithms for each operator and all the query tree organizations.
  • Data skew can lead to high variance in the size of intermediate relations, leading to both poor query plan cost estimates and sub-linear speedup.

Application Program Parallelism
  • Tools are needed to structure application programs to take advantage of parallelism inherent in these parallel systems. 
  • Library packages to facilitate explicitly parallel application programs are needed.

Physical Database Design
Database design tools are needed to get the best one amongst the many possible indexing and partitioning combinations using description of the queries comprising the workload, their frequency of execution, statistical information about the relations in the database, and a description of the processors and disks.

On-line Data Reorganization and Utilities
  • A difficult problem is how to process database utility commands while the system remains operational and the data remains available for concurrent reads and writes by others. 
  • The fundamental properties of such algorithms is that they must be online(operate without making data unavailable), incremental(operate on parts of a large database), parallel (exploit parallel processors), and recoverable(allow the operation to be canceled and return to the old state).

No comments:

Post a Comment