Thursday, March 28, 2013

The Gamma Database Machine Project

This is a old database project built at University of Wisconsin.

Key highlights of Gamma:
  1. based on a shared-nothing architecture. No sharing of memory. Communication is done only by message passing. The benefits of using this is discussed in my earlier blog entry.
  2. use of hash-based parallel algorithms which have no central chains and so can scaleup almost in-definitely.
  3. use horizontal partitioning so that large relations can be processed concurrently by multiple machines without any communication overhead.
Hardware Architecture of Gamma

Gamma Version 1.0
17 processors (8 of them having disk while the rest were diskless) were connected by a 80 megabit/sec token ring network. There were several problems with this design:
  • benefits of a larger disk page size were balanced by the cost of copying tuples from a disk page into a network packet.
  • The slower network interface and the Unibus became bottleneck as they could not accept the faster packet traffic from the token ring.
  • Due to low memory (just 2 MB !!), balancing space for performing operations, stack space for processes and buffer pool separately was hard.
Gamma Version 2.0
They used 32 processor iPSC/2 hypercube from Intel. Every processor had a 386 CPU, 8 MB memory, and a 330 MB disk drive. Custom VLSI routing modules were used to connect nodes in the hypercube. Small and large messages were sent as datagrams and specialized circuits respectively. Gamma’s operating system, NOSE was ported to work with Intels' hypercube. While porting the Gamma software to the Intel 386, number of bugs in the original VAX based code were realized.
Software Architecture of Gamma

Gamma Storage Organizations
Relations are horizontally partitioned across all disk drives so that the database software can exploit the I/O bandwidth provided by the hardware. The query language of Gamma provides the user with three alternative declustering strategies:round robin, hashed, and range partitioned. Clustered and non-clustered indices are provided on a partitioned relation. The query optimizer takes into account the partitioning information of a relation in its query plans. The idea of partitioning relations across all the nodes in the cluster is bad as for each query, it is likely that all the nodes in the cluster would have to participate. The distribution of a relations across the nodes in cluster could be done more smartly. 

Gamma Process Structure
At system initialization time, a UNIX daemon process for the Catalog Manager (CM) is initiated along with a set of Scheduler Processes, a set of Operator Processes, the Deadlock Detection Process, and the Recovery Process.
  • Catalog manager: central repository of all conceptual and internal schema information for each database. Is responsible for insuring consistency among the copies cached by each user.
  • Query Manager: responsible for caching schema information locally, providing an interface for ad-hoc queries, query parsing, optimization, and compilation.
  • Scheduler Processes: Controls multi-site query. Responsible for activating the Operator Processes used to execute the nodes of a compiled query tree.
  • Operator Process: For each operator in a query tree, at least one Operator Process is employed at each processor participating in the execution of the operator.

Query Execution
For the VAX version of Gamma (ie. 1.0), the query plans considered by the optimizer were only left-deep trees because there was not enough memory to support right-deep or bushy plans. For later version, they considered examining the right deep and bushy query plans. 

The query optimizer recognizes that certain queries need to be executed at a subset of the nodes in the system. Optimizer connects to the scheduler process, the QM sends the compiled query to the scheduler process and waits for the query to complete execution. 

Query Processing Algorithms
  1. Selection Operator: If the predicate is on the partitioning attribute of the relation and the relation is hash or range partitioned, the scheduler can direct the selection operator to a subset of the nodes. If either the relation is round-robin partitioned or the selection predicate is not on the partitioning attribute, a selection operator must be initiated on all nodes over which the relation is declustered.
  2. Join Operator: The relations to be joined are partitioned into disjoint subsets (buckets) using a hash function over the join attribute. All the tuples with the same value of join attribute fall in the same bucket.Hybrid hash join is the default join algorithm used by Gamma as it provided good performance.
  3. Hybrid Hash-Join: The operation is performed in 3 phases: 
    • Step #1: Use a hash function to partition the inner (smaller) relation, R, into N buckets. The first bucket is held in-memory while the rest buckets are written to disk.The partitioning split table performs this step.
    • Step #2: The second relation, S, is partitioned using the hash function from step 1. Again the first bucket is help in-memory while rest N-1 buckets are written to disk. The tuples from first bucket are probed with the in-memory hash table of step #1.
    • Step #3: The remaining N-1 buckets from relation R are joined with their respective buckets from relation S
  4. Aggregate Operations: Handled in 2 steps:
    • Each processor computes a piece of the result by calculating a value for each of the partitions. 
    • The processors redistribute the partial results by hashing on the "group by" attribute. The result of this step is to collect the partial results for each partition at a single site so that the final result for each partition can be computed.
  5. Update Operators: Implemented using standard techniques except when a replace operator modifies the partitioning attribute of a tuple wherein the split table decides the site for the modified tuple.
Transaction and Failure Management

Concurrency Control
They use 2PL, multi-granular locking and centralized deadlock detection algorithm. Each site has its own lock manager and deadlock detector which sends its wait-for-graph to the centralized deadlock detector.

No comments:

Post a Comment