Thursday, March 28, 2013

The Gamma Database Machine Project

This is a old database project built at University of Wisconsin.

Key highlights of Gamma:
  1. based on a shared-nothing architecture. No sharing of memory. Communication is done only by message passing. The benefits of using this is discussed in my earlier blog entry.
  2. use of hash-based parallel algorithms which have no central chains and so can scaleup almost in-definitely.
  3. use horizontal partitioning so that large relations can be processed concurrently by multiple machines without any communication overhead.
Hardware Architecture of Gamma

Gamma Version 1.0
17 processors (8 of them having disk while the rest were diskless) were connected by a 80 megabit/sec token ring network. There were several problems with this design:
  • benefits of a larger disk page size were balanced by the cost of copying tuples from a disk page into a network packet.
  • The slower network interface and the Unibus became bottleneck as they could not accept the faster packet traffic from the token ring.
  • Due to low memory (just 2 MB !!), balancing space for performing operations, stack space for processes and buffer pool separately was hard.
Gamma Version 2.0
They used 32 processor iPSC/2 hypercube from Intel. Every processor had a 386 CPU, 8 MB memory, and a 330 MB disk drive. Custom VLSI routing modules were used to connect nodes in the hypercube. Small and large messages were sent as datagrams and specialized circuits respectively. Gamma’s operating system, NOSE was ported to work with Intels' hypercube. While porting the Gamma software to the Intel 386, number of bugs in the original VAX based code were realized.
Software Architecture of Gamma

Gamma Storage Organizations
Relations are horizontally partitioned across all disk drives so that the database software can exploit the I/O bandwidth provided by the hardware. The query language of Gamma provides the user with three alternative declustering strategies:round robin, hashed, and range partitioned. Clustered and non-clustered indices are provided on a partitioned relation. The query optimizer takes into account the partitioning information of a relation in its query plans. The idea of partitioning relations across all the nodes in the cluster is bad as for each query, it is likely that all the nodes in the cluster would have to participate. The distribution of a relations across the nodes in cluster could be done more smartly. 

Gamma Process Structure
At system initialization time, a UNIX daemon process for the Catalog Manager (CM) is initiated along with a set of Scheduler Processes, a set of Operator Processes, the Deadlock Detection Process, and the Recovery Process.
  • Catalog manager: central repository of all conceptual and internal schema information for each database. Is responsible for insuring consistency among the copies cached by each user.
  • Query Manager: responsible for caching schema information locally, providing an interface for ad-hoc queries, query parsing, optimization, and compilation.
  • Scheduler Processes: Controls multi-site query. Responsible for activating the Operator Processes used to execute the nodes of a compiled query tree.
  • Operator Process: For each operator in a query tree, at least one Operator Process is employed at each processor participating in the execution of the operator.

Query Execution
For the VAX version of Gamma (ie. 1.0), the query plans considered by the optimizer were only left-deep trees because there was not enough memory to support right-deep or bushy plans. For later version, they considered examining the right deep and bushy query plans. 

The query optimizer recognizes that certain queries need to be executed at a subset of the nodes in the system. Optimizer connects to the scheduler process, the QM sends the compiled query to the scheduler process and waits for the query to complete execution. 

Query Processing Algorithms
  1. Selection Operator: If the predicate is on the partitioning attribute of the relation and the relation is hash or range partitioned, the scheduler can direct the selection operator to a subset of the nodes. If either the relation is round-robin partitioned or the selection predicate is not on the partitioning attribute, a selection operator must be initiated on all nodes over which the relation is declustered.
  2. Join Operator: The relations to be joined are partitioned into disjoint subsets (buckets) using a hash function over the join attribute. All the tuples with the same value of join attribute fall in the same bucket.Hybrid hash join is the default join algorithm used by Gamma as it provided good performance.
  3. Hybrid Hash-Join: The operation is performed in 3 phases: 
    • Step #1: Use a hash function to partition the inner (smaller) relation, R, into N buckets. The first bucket is held in-memory while the rest buckets are written to disk.The partitioning split table performs this step.
    • Step #2: The second relation, S, is partitioned using the hash function from step 1. Again the first bucket is help in-memory while rest N-1 buckets are written to disk. The tuples from first bucket are probed with the in-memory hash table of step #1.
    • Step #3: The remaining N-1 buckets from relation R are joined with their respective buckets from relation S
  4. Aggregate Operations: Handled in 2 steps:
    • Each processor computes a piece of the result by calculating a value for each of the partitions. 
    • The processors redistribute the partial results by hashing on the "group by" attribute. The result of this step is to collect the partial results for each partition at a single site so that the final result for each partition can be computed.
  5. Update Operators: Implemented using standard techniques except when a replace operator modifies the partitioning attribute of a tuple wherein the split table decides the site for the modified tuple.
Transaction and Failure Management

Concurrency Control
They use 2PL, multi-granular locking and centralized deadlock detection algorithm. Each site has its own lock manager and deadlock detector which sends its wait-for-graph to the centralized deadlock detector.

Wednesday, March 27, 2013

How Google Code Search Worked

This is a short summary of this article by Russ Cox.
It describes how Code Search worked when the Russ was an working on it as an intern at Google. This implementation works well enough to index and search large code bases on a single computer.

Indexed Word Search: The key data structure used here is called a posting list or inverted index, which lists, for every possible search term, the documents that contain that term. For finding documents matching all terms in a query, intersection of all documents for both the words was employed. Other way is to treat phrases as AND queries to identify a set of candidate documents and then filter out non-matching documents after loading the document bodies from disk. Storing the position information in the index entries makes the index bigger but avoids loading a document from disk unless it is guaranteed to be a match.

Indexed Regular Expression Search: Google code search did not use regex expressions to answer the queries. Instead of merely using 1 gram inverted index, they had build an index of n-grams, substrings of length n. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is. The query phrase is converted into trigrams with AND operations. This query is ran against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents. A regular expression using the OR operator lead to a query with OR terms, and parenthesized sub-expressions complicate the process of turning strings into AND expressions.  Regex rules gave a huge results set. At each step, they applied some simplifications to keep the computed information manageable.

Computing trigrams: The trigrams function applied to a single string can be any, if the string has <3 characters,  else AND of all the trigrams in the string. The trigrams function applied to a set of strings is the OR of the trigrams function applied to each of the strings.

Transformations : were defined that apply at any step during the analysis, to any regular expression e. These preserve the validity of the results. The best way to apply these transformation is to use the information-saving transformations immediately before applying an information-discarding transformation. A little more information can be squeezed from the concatenation analysis:
if e1e2 is not exact, match(e1e2) can also require trigrams(suffix(e1) × prefix(e2)).
In addition to these transformations, they also applied basic Boolean simplifications to the match query as it is constructed eg. “abc OR (abc AND def)” is more expensive but no more precise than “abc”.

Implementation: was done in "Go". The cindex command if passed directories, it will update the current index with these directories, else it will update the existing index.
The index file contains a list of paths, a list of indexed files, and then the same posting lists we saw at the beginning of this article, one for each trigram. Index is around 20% of the size of the files being indexed. To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache.

The Web as a graph

This blog is about a paper by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal

Topic search problem: 
Given a search topic, find the high quality output pages for the query.
Authoritative pages: focused on a topic. Hub pages contain links to relevant pages on the topic.

HITS algorithm: Every page has a pair of non-negative weights <xp, yp>.
If a page is pointed by many good hubs, its authority weight, xp is increased by no of hubs.
The hub weight, xy is updated in the same fashion.
Page with large weights represent a very dense pattern of linkage, from pages of a large hub pages weight of pages to page of large weight authority. Typically, the relative ordering of hub/authority scores become stable after 5 iterations.

Topic enumeration problem: 
Given a snapshot of web, output all communities in it.
Bipartite core Ci,j is a graph of (i+j) nodes that contains atleast one Ki,j as a subgraph.
Trawling: We can find a large fraction of cyber communities by enumerating all bipartite cores of the web. Naive algorithms are not effective for doing this.

Elimination generation paradigm: Scan the entire web graph, modify it and extract some metadata. The residual graph is written to disk and it along with the meta forms the new state.  During each pass, elimination operation (prune off edges from nodes lower in-degree) and generation operations (indentify qualifying nodes, output core or indicate if it cannot exist) are interleaved. After few runs, the benefits of elimination and generation tail off.
This algorithm runs in linear time in size of the web graph.

Supervised classification problem: 
Given a set of predefined categories, build a system that assigns a document one of the categories
Using terms in the anchor text of links pointing to documents degrade the accuracy.
Better: "Statistical model based" Pages on similar topics tend to be linked as compared to non related ones. This relation can be captured using the Markow Random Field (MRF) model.

Hyperclass algorithm: Instead of just considering page p for classification, consider pages surrounding it IFF p -> q and q -> p. If hyperlinks are used along with the text, the categorization is improved.
Relaxation labeling: Category labels of linked pages and of those to be categorized are adjusted until most probable class configuration is found.

Web graph models can be analytical tool, explanatory tool and predictive tool.
3 properties of web graph model: The rich get richer, correlated out links, correlated in-links.
The in-degrees and out-degrees of nodes follow a inverse polynomial distribution.

Stuff I've seen: A system for personal information retrieval and re-use

This paper is about the SIS system designed at Microsoft Research.
As per studies, while searching, about 60-80% times people do a revisit of previously seen web pages. SIS tries to take into account this fact. It creates a unified index of all the information a person has seen on his/her computer(s), which can be in any form including local disk files, emails, web pages, calendar etc. It also maintains a contextual information of the content This is in form of either thumbnails, meta information about the document and previews. This is missing in the web search results. SIS provides a unified index across multiple applications differing in data organization  strategy and indexing technique. AS it creates a local index from the data, its quicker than web search systems. Also, it have query refinement facility too. The rich contextual cues & previews gathered from the users' activities are useful.

The system comprised of 5 components: Gatherer, Filter, Tokenizer, Indexer and a Retriever. Gatherer was responsible to gather data from various sources. Filter emits a character stream from the decoded data. Tokenizer performed tokenization carries out linguistic processing. The Indexer generates an inverted index used for searching. The  Retriever is the language to be used by the user for searching data via the index populated. A SIS client must run on every users' machine. The user interface was in 2 forms: Top view and Side view. The prior is a list view with filters for refining attributes in each column and is more flexible. The later differs from the prior one in terms of positioning of the filters: they are placed at left and revealed serially. It was easier to understand this way and results are less cluttered. The systems was evaluated using questionnaires and log analysis.  Users were asked about their search experiences before and after using SIS. Log analysis gave an insight into the query patterns, user interface interactions and quality of the results produced. They had also monitored the interaction patterns of the user interface. Very few user queries involved use of boolean operators, phrases. Average query length was 1.59 words.

This seems to be a normal search engine with the only add-on being that it used to gather data from several sources and parse the data. As this being customized for a particular user, it made sense to have the index locally stored on the machine. This gave good response time. Web search engines these days have become way smarter and are user customized. One direct disadvantage of the system that I see is that it will not produce results outside of those not visited by the user. The system will increase the load on the users' machine as it will run continuously. Also, over a period of time, the data gathered  will keep of accumulating thus consuming a significant memory of the local machine. 

As we may think


This was a article written by Vannevar Bush in 1945. It starts with a discussion about the research advancements that were done at the time when the article was written. With a explosion in the number of researchers and their findings, the authors states that any researcher has to devote a considerable amount of time in reading all the findings of his/her field. Couple of researchers having innovative ideas could not make them to practical use (viz. Leibnitz for calculating machine, Babbage for arithmetical machine). Interchangeable parts, which required skilled labors, can be produced at low cost using machines. Then the article moves towards the scope for advancements in photography. Right from the lenses, size of the camera to development of the picture from the film, there was ample scope for changing the way the camera was back then. Microphotography involving microfilm and reductions by linear factor give good results after the re-enlarging. With a greater reduction factor of 10k, a big Encyclopoedia can be stored in very small space using microflim.
           
The retrieval of data from these storages does matter. Author mentions Voder and Vocoder which convert mechanical stroke into sound and vice versa. A stenotype clubbed with Vocoder can create a device which can type when talked to. Such device is being described and its benefits to a experimenter is discussed. Arithmetic operations like addition can be performed on these media. With that, complex calculations can be done quickly and even by naive people. With the increasing market needs, complex analysis and predictions are also expected to be done on these media which can open doors to deeper insights into the any data: economical or laboratory readings. This would free up people from doing what they are not good at and allow them to focus on things that they are really good at. This need for analyzing data is not restricted to people who high end researchers, even common man can employ it. Any conclusion can be derived using logical circuits. Later he talks about Memx and his prediction about how it can be used to revolutionize from various circles of the occupation. In the end, argument is made for possible experiments that could be performed using the known technologies and those discussed earlier in the article.

This article shows Vannevars' vision of solving real world problems by looking at the existing solutions or technologies. His thinking was directed quite far from the technology present at the time when he wrote the article. The merging of stenotype and Vocoder is classic signature of a person who has been around in the field for a long while and knows how to fit things together. Its stunning to read an article dated in 1945 with mention of tools for analysis that can might be needed not only for people with in high end lab with scientific background but to common people. In the idea for memex, he knew that merely using compression and getting the data stored in compact films wont just work out to be a complete solution. Being a person with business sense, rather real designer, he has this understanding that ability to use the data and do something from that would be needed. He also evaluated his solution in terms of benefits to doctors, historian, etc which clearly distinguishes him from a blind researcher. 

Sunday, March 24, 2013

Algorithm based Fault Tolerance

Algorithm-based  fault  tolerance  (ABFT)  is  an  error detection,  location  and  correction  scheme  which  uses redundant  computations  within  the  algorithms  to  detect and  correct  errors  caused  by  permanent  or  transient  failures in the hardware, concurrently with normal operation.

The key idea of the ABFT technique is to encode the data at a higher level using checksum schemes and redesign algorithms to operate on the encoded data. ABFT can be used to detect and correct errors with a low overhead. ABFT is generally not applicable to all problems.

Systematic way of converting a concurrent computation into a ABFT

In  an  ABFT  system,  the  input  data  is  encoded.  The  algorithm  is  modified to  work on the  modified input.  The  computation  steps  of the  algorithm  are  the distributed  among the  computation  units.  It  is  important  to allocate  the  computational  steps  on  different  processors  so  that  an  error  in one  processor affects  as  few  data  elements as  possible.  The encoded  output  produced  by  the  algorithm is  decoded  to check  the  correctness  of the  results. An  ABFT  system  uses  redundant  computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures.  ABFT  is  not  a  general  mechanism  as  other approaches,  it  varies  from  algorithm  to algorithm. This is captured in the figure below.

ABFT  system  uses  redundant computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures  in  the  hardware.  It  is not a general approach and varies  from  algorithm  to  algorithm.  When  the  modified  algorithm  is  actually executed on  a multiprocessor architecture,  the  overheads  are  required  to be  minimum  in  comparison  to  TMR.  The  modified  algorithm  takes  more  time  to  operate  on  the encoded  data  when  compared  to  the  original  algorithm. 


ABFT approach for matrix operations

Let PE­i indicate a processing element in the system.
DS(i) is the set of data items affected by PEi. System has checks C1, C2, .... , Cp which check the correctness of the data produced by the PEs.
A check Ck has a associated error set ES(k) indicating the data elements checked by Ck.

A (g,h) check, where g is the cardinality of the error set,  is evaluated as:
·         pass, output = 0 if no error in data items
·         failed, output = 1 if more than h errors
·         unpredictable if there are more than h errors in the system.

If A is a n*n matrix, we define row checksum matrix (Ar) and column checksum matrix (Ac) as:
Ar is a n*(n+1) matrix = [A Ae]
e, is a n *1 vector, = [1 1 ...... 1]T

A =

A


AeT

   



where AeT is the column summation vector. The full checksum matrix Af is given as:


A
Ae


AeT
eTAe


The exact location of one error can be determined by intersecting the inconsistent rows and columns. The use of either row  checksum/column checksum matrices makes  the system 1-error detectable. A full checksum matrix system is also 1-error detectable. 

Concurrency Control and Performance

This paper was authored by Rakesh Agrwal, Mike Carey and Miron Livny.
Previous work had conflicting results: blocking beats restarts, restarts beat blocking, etc.

Goal of this paper
        Do a good job modeling the problem and its variants
        Capture causes of previous conflicting results
        Make recommendations based on variables of problem

Methodology
  Simulation study. Compare the 3 approaches:
a.       Blocking (i.e. 2PL)    block when lock request is denied
b.      Immediate Restart     when denied a lock, restart after a delay
c.       Optimistic                restart decision taken at commit points
  Pay attention to model of system:
1. database system model: hardware and software model (CPUs, disks, size & granule of DB, load control mechanism, CC algorithm
2.  user model: arrival of user tasks, nature of tasks (e.g. batch vs. interactive)
3.  transaction model: logical reference string (i.e. CC schedule), physical reference string (i.e. disk block requests, CPU processing bursts).
4.  Probabilistic modeling of each. They argue this is key to a performance study of a DBMS.
  logical queuing model
  physical queuing model

Measurements
a)      measure throughput, mostly
b)      pay attention to variance of response time, too
c)      pick a DB size so that there are noticeable conflicts (else you get comparable performance)

Experiment 1: Infinite Resources
  • as many disks and CPUs as you want
  • blocking thrashes due to transactions blocking numerous times
  • restart plateaus: adaptive wait period (avg response time) before restart
  • optimistic scales logarithmically
  • standard deviation of response time under locking much lower
Experiment 2: Limited Resources (1 CPU, 2 disks)
  • Everybody thrashes
  • blocking throughput peaks at mpl 25
  • optimistic peaks at 10
  • restart peaks at 10, plateaus at 50 – as good or better than optimistic
  • at super-high mpl (200), restart beats both blocking and optimistic
  • load control is the answer here – adding it to blocking & optimistic makes them handle higher mpls
Experiment 3: Multiple Resources (5, 10, 25, 50 CPUs, 2 disks each)
  • optimistic starts to win at 25 CPUs
  • when useful disk utilization is only about 30%, system begins to behave like infinite resources
  • even better at 50
Experiment 4: Interactive Workloads
  • Add user think time.
  • makes the system appear to have more resources
  • so optimistic wins with think times 5 & 10 secs.
  • Blocking still wins for 1 second think time.

Questioning to assumptions:
Fake restarts – biases for optimistic
  • result in less conflict
  • Cost of conflict in optimistic is higher
  • issue of k > 2 transactions contending for one item
  • will have to punish k-1 of them with real restart
Write-lock acquisition
  • recall our discussion of lock upgrades and deadlock
  • blind write biases for restart (optimistic not an issue here), particularly with infinite resources
  • blocking holds write locks for a long time; waste of deadlock restart not an issue here.
  • with finite resources, blind write restarts transactions earlier (making restart look better)
Conclusions
  blocking beats restarting, unless resource utilization is low
  possible in situations of high think time
  mpl control important. admission control the typical scheme. Restartʼs adaptive load control is too clumsy, though.
  false assumptions made blocking look relatively worse

Parallel Database Systems: The Future of High Performance Database Processing

This was a paper written by David J. DeWitt and Jim Gray back in 1992. It reviews the techniques used by parallel systems and surveys commercial and research systems existing at that time.

History

- Mainframes dominated most database and transaction processing tasks.
- Parallel Machines were practically written off.
- Specialized Database Machines came up with trendy hardware.
- Relational Data Model brought about a revolution. 
- Relational queries can be viewed as a series of operators applied to streams of data.
Pipelined parallelism: use the output of one step to as an input for another process.
- Partitioned parallelism: The same operator is executed across multiple machines simultaneously but over different data partitions or chunks.

Parallelism Goals and Metrics
Speedup: The ratio of time taken by small system to the time taken by a large system to perform the same task. 
Scaleup: The ability of an N-times larger system to perform an N-times larger job in the same elapsed time as the original system. 

There are 2 categories of scaleup:
(a) Transactional scaleupIf the task consists of performing many small independent requests submitted by many clients and operating on a shared database. Technically there are N-times as many clients, submitting N-times as many requests against an N-times larger database. Typical to TP systems and time sharing systems.
(b) Batch scaleup: Here there is one large task to be done by N-times larger database. Typical to database queries and scientific simulations.

3 Barriers to linear speedup and linear scaleup are startup (time to start the operation), interference (due to new process which tries to access the shared resources), skew (due to high parallelism, average size of each sub-task reduces but the variance can be high). 

Hardware Architecture
The challenge is to build an infinitely fast processor out of infinitely many processors of finite speed, and to build an infinitely large memory with infinite memory bandwidth from infinitely many storage units of finite speed. This is impractical due to interference generated by inclusion of every new node in the system.

Possible multi-processor designs:

(a) shared-memory
(b) shared-disks
(c) shared nothing

Shared-memory and shared-disk systems do not scale well on database applications. Interference is a major problem for shared-memory multi-processors. Most shared-memory multi-processors had adopted a shared-disk architecture to leverage affinity scheduling. This is more expensive than shared-nothing approach for shared database applications as the number of messages to reserve data and exchanging large number of data pages, the shared-disk approach is much more expensive than the  of exchanging small high-level logical questions and answers among clients and servers.
The main advantage of shared-nothing multi-processors is that they can be scaled up to hundreds and probably thousands of processors that do not interfere with one another. They are known to deliver near-linear speedups and scaleups on complex relational queries and on online-transaction processing workloads. 

Parallel Dataflow Approach to SQL Software
Relational queries can be executed as a dataflow graph. This graph can use both pipelined parallelism and dataflow parallelism. Pipeline parallelism provides limited speedup due to 3 limitations: the chains of parallel operations are typically shorter, not all operators can be pipelined as they might need all input before performing its operation and the differences in the execution cost of operators. Partitioning data and then parallelization of execution provides good scaleup and speedup (divide and conquer paradigm).



Data partitioning
technique
Feature Drawback
range partitioning good for sequential, associative access, 
range queries
leads to data skew 
and execution skew
round-robin good for complete scansbad for point queries
hashing good for sequential and associative access

The response time improvements are seen by these partitioning techniques till the point when the cost of starting a query on a node becomes a significant fraction of the actual execution time.

Parallelism Within Relational Operators: Use the existing relational operator routines over the data partitions. This way we save the effort for developing new "parallel" version of the operators. We will still need some routines for parallelization. The first basic parallelizing operator is a merge that can combine several parallel data streams into a single sequential stream. A split operator is used to partition or replicate the stream of tuples produced by a relational operator. 
An example query is given in the paper which depicted how split and merge are used along with the normal operators. As the operations are performed across different processors, there will be less interference. The split and merge operators have flow control and buffering to prevent one operator from getting too far ahead in the computation. 

Specialized Parallel Relational Operators: The algorithms used to implement the relational operators can be improved / changed to take into account the parallelization of the target system. eg Sort-merge join works well in a parallel dataflow environment unless there is data skew. In case of data skew, some sort partitions may be much larger than others. This skew problem is not present on a centralized execution system. The hash join is a better choice here. It breaks a big join into many little joins. Unless the bucket sizes are largely skewed, we can expect good performance.

Existing systems at that time

Teradata: Teradata processors are functionally divided into two groups: Interface Processors (IFPs) and Access Module Processors (AMPs). The IFPs handle communication with the host, query parsing and optimization, and coordination of AMPs during query execution. The AMPs are responsible for executing queries. Each relation is hash partitioned over a subset of the AMPs. A hash function is applied to the primary key of the tuple to select an AMP for storage.
Once a tuple arrives at a AMP, a second hash function determines the tuple’s placement in its
fragment of the relation. Hashing is used to spit the outputs of relational operators into intermediate relations. Join operators are executed using a parallel sort-merge algorithm.

Tandem NonStop SQL: Each disk is served by a set of processes managing a large shared RAMcache, a set of locks, and log records for the data on that disk pair. Considerable effort is spent on optimizing sequential scans by prefetching large units, and by filtering and manipulating the tuples with SQL predicates at these disk servers. Tandem systems demonstrate near-linear scaleup on transaction processing workloads, and near-linear speedup and scaleup on large relational queries.

Gamma: Provides round-robin, range, hash partitioning and hybrid-range partitioning that combines the best features of the hash and range partitioning strategies. Gamma uses split and merge operators to execute relational algebra operators using both parallelism and pipelining. Near-linear speedup and scaleup for relational queries has been reported.

Super Database Computer: The basic unit, called a processing module (PM), consists of one or more processors on a shared memory. Data is partitioned among the PMs by hashing. 

Database Machines and Grosch’s Law

In the 1960’s, Herb Grosch observed that there is an economy-of-scale in computing. At that time, expensive computers were much more powerful than inexpensive computers. Grosch’s law no longer applies to database and transaction processing problems. By combining hundreds or thousands of these small systems, one can build an incredibly powerful database machine for much less money than the cost of a modest mainframe. For database problems, the near-linear speedup and scaleup of these shared-nothing machines allows them to outperform current shared-memory and shared disk mainframes.

Future Directions and Research Problems

Mixing Batch and OLTP Queries
  • Large relational queries tend to acquire a many locks and tend to hold them for a relatively long time. This prevents concurrent updates the data by simple online transactions.
  • Priority scheduling: Batch jobs have a tendency to monopolize the processor, flood the memory cache, and make large demands on the I/O subsystem.
  • Priority inversion problem: A low-priority client makes a request to a high priority server. The server must run at high priority because it is managing critical resources. Given this, the work of the low priority client is effectively promoted to high priority when the low priority request is serviced by the high-priority server.

Parallel Query Optimization
  • To date, no query optimizers consider all the parallel algorithms for each operator and all the query tree organizations.
  • Data skew can lead to high variance in the size of intermediate relations, leading to both poor query plan cost estimates and sub-linear speedup.

Application Program Parallelism
  • Tools are needed to structure application programs to take advantage of parallelism inherent in these parallel systems. 
  • Library packages to facilitate explicitly parallel application programs are needed.

Physical Database Design
Database design tools are needed to get the best one amongst the many possible indexing and partitioning combinations using description of the queries comprising the workload, their frequency of execution, statistical information about the relations in the database, and a description of the processors and disks.

On-line Data Reorganization and Utilities
  • A difficult problem is how to process database utility commands while the system remains operational and the data remains available for concurrent reads and writes by others. 
  • The fundamental properties of such algorithms is that they must be online(operate without making data unavailable), incremental(operate on parts of a large database), parallel (exploit parallel processors), and recoverable(allow the operation to be canceled and return to the old state).