Thursday, March 28, 2013

The Gamma Database Machine Project

This is a old database project built at University of Wisconsin.

Key highlights of Gamma:
  1. based on a shared-nothing architecture. No sharing of memory. Communication is done only by message passing. The benefits of using this is discussed in my earlier blog entry.
  2. use of hash-based parallel algorithms which have no central chains and so can scaleup almost in-definitely.
  3. use horizontal partitioning so that large relations can be processed concurrently by multiple machines without any communication overhead.
Hardware Architecture of Gamma

Gamma Version 1.0
17 processors (8 of them having disk while the rest were diskless) were connected by a 80 megabit/sec token ring network. There were several problems with this design:
  • benefits of a larger disk page size were balanced by the cost of copying tuples from a disk page into a network packet.
  • The slower network interface and the Unibus became bottleneck as they could not accept the faster packet traffic from the token ring.
  • Due to low memory (just 2 MB !!), balancing space for performing operations, stack space for processes and buffer pool separately was hard.
Gamma Version 2.0
They used 32 processor iPSC/2 hypercube from Intel. Every processor had a 386 CPU, 8 MB memory, and a 330 MB disk drive. Custom VLSI routing modules were used to connect nodes in the hypercube. Small and large messages were sent as datagrams and specialized circuits respectively. Gamma’s operating system, NOSE was ported to work with Intels' hypercube. While porting the Gamma software to the Intel 386, number of bugs in the original VAX based code were realized.
Software Architecture of Gamma

Gamma Storage Organizations
Relations are horizontally partitioned across all disk drives so that the database software can exploit the I/O bandwidth provided by the hardware. The query language of Gamma provides the user with three alternative declustering strategies:round robin, hashed, and range partitioned. Clustered and non-clustered indices are provided on a partitioned relation. The query optimizer takes into account the partitioning information of a relation in its query plans. The idea of partitioning relations across all the nodes in the cluster is bad as for each query, it is likely that all the nodes in the cluster would have to participate. The distribution of a relations across the nodes in cluster could be done more smartly. 

Gamma Process Structure
At system initialization time, a UNIX daemon process for the Catalog Manager (CM) is initiated along with a set of Scheduler Processes, a set of Operator Processes, the Deadlock Detection Process, and the Recovery Process.
  • Catalog manager: central repository of all conceptual and internal schema information for each database. Is responsible for insuring consistency among the copies cached by each user.
  • Query Manager: responsible for caching schema information locally, providing an interface for ad-hoc queries, query parsing, optimization, and compilation.
  • Scheduler Processes: Controls multi-site query. Responsible for activating the Operator Processes used to execute the nodes of a compiled query tree.
  • Operator Process: For each operator in a query tree, at least one Operator Process is employed at each processor participating in the execution of the operator.

Query Execution
For the VAX version of Gamma (ie. 1.0), the query plans considered by the optimizer were only left-deep trees because there was not enough memory to support right-deep or bushy plans. For later version, they considered examining the right deep and bushy query plans. 

The query optimizer recognizes that certain queries need to be executed at a subset of the nodes in the system. Optimizer connects to the scheduler process, the QM sends the compiled query to the scheduler process and waits for the query to complete execution. 

Query Processing Algorithms
  1. Selection Operator: If the predicate is on the partitioning attribute of the relation and the relation is hash or range partitioned, the scheduler can direct the selection operator to a subset of the nodes. If either the relation is round-robin partitioned or the selection predicate is not on the partitioning attribute, a selection operator must be initiated on all nodes over which the relation is declustered.
  2. Join Operator: The relations to be joined are partitioned into disjoint subsets (buckets) using a hash function over the join attribute. All the tuples with the same value of join attribute fall in the same bucket.Hybrid hash join is the default join algorithm used by Gamma as it provided good performance.
  3. Hybrid Hash-Join: The operation is performed in 3 phases: 
    • Step #1: Use a hash function to partition the inner (smaller) relation, R, into N buckets. The first bucket is held in-memory while the rest buckets are written to disk.The partitioning split table performs this step.
    • Step #2: The second relation, S, is partitioned using the hash function from step 1. Again the first bucket is help in-memory while rest N-1 buckets are written to disk. The tuples from first bucket are probed with the in-memory hash table of step #1.
    • Step #3: The remaining N-1 buckets from relation R are joined with their respective buckets from relation S
  4. Aggregate Operations: Handled in 2 steps:
    • Each processor computes a piece of the result by calculating a value for each of the partitions. 
    • The processors redistribute the partial results by hashing on the "group by" attribute. The result of this step is to collect the partial results for each partition at a single site so that the final result for each partition can be computed.
  5. Update Operators: Implemented using standard techniques except when a replace operator modifies the partitioning attribute of a tuple wherein the split table decides the site for the modified tuple.
Transaction and Failure Management

Concurrency Control
They use 2PL, multi-granular locking and centralized deadlock detection algorithm. Each site has its own lock manager and deadlock detector which sends its wait-for-graph to the centralized deadlock detector.

Wednesday, March 27, 2013

How Google Code Search Worked

This is a short summary of this article by Russ Cox.
It describes how Code Search worked when the Russ was an working on it as an intern at Google. This implementation works well enough to index and search large code bases on a single computer.

Indexed Word Search: The key data structure used here is called a posting list or inverted index, which lists, for every possible search term, the documents that contain that term. For finding documents matching all terms in a query, intersection of all documents for both the words was employed. Other way is to treat phrases as AND queries to identify a set of candidate documents and then filter out non-matching documents after loading the document bodies from disk. Storing the position information in the index entries makes the index bigger but avoids loading a document from disk unless it is guaranteed to be a match.

Indexed Regular Expression Search: Google code search did not use regex expressions to answer the queries. Instead of merely using 1 gram inverted index, they had build an index of n-grams, substrings of length n. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is. The query phrase is converted into trigrams with AND operations. This query is ran against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents. A regular expression using the OR operator lead to a query with OR terms, and parenthesized sub-expressions complicate the process of turning strings into AND expressions.  Regex rules gave a huge results set. At each step, they applied some simplifications to keep the computed information manageable.

Computing trigrams: The trigrams function applied to a single string can be any, if the string has <3 characters,  else AND of all the trigrams in the string. The trigrams function applied to a set of strings is the OR of the trigrams function applied to each of the strings.

Transformations : were defined that apply at any step during the analysis, to any regular expression e. These preserve the validity of the results. The best way to apply these transformation is to use the information-saving transformations immediately before applying an information-discarding transformation. A little more information can be squeezed from the concatenation analysis:
if e1e2 is not exact, match(e1e2) can also require trigrams(suffix(e1) × prefix(e2)).
In addition to these transformations, they also applied basic Boolean simplifications to the match query as it is constructed eg. “abc OR (abc AND def)” is more expensive but no more precise than “abc”.

Implementation: was done in "Go". The cindex command if passed directories, it will update the current index with these directories, else it will update the existing index.
The index file contains a list of paths, a list of indexed files, and then the same posting lists we saw at the beginning of this article, one for each trigram. Index is around 20% of the size of the files being indexed. To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache.

The Web as a graph

This blog is about a paper by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal

Topic search problem: 
Given a search topic, find the high quality output pages for the query.
Authoritative pages: focused on a topic. Hub pages contain links to relevant pages on the topic.

HITS algorithm: Every page has a pair of non-negative weights <xp, yp>.
If a page is pointed by many good hubs, its authority weight, xp is increased by no of hubs.
The hub weight, xy is updated in the same fashion.
Page with large weights represent a very dense pattern of linkage, from pages of a large hub pages weight of pages to page of large weight authority. Typically, the relative ordering of hub/authority scores become stable after 5 iterations.

Topic enumeration problem: 
Given a snapshot of web, output all communities in it.
Bipartite core Ci,j is a graph of (i+j) nodes that contains atleast one Ki,j as a subgraph.
Trawling: We can find a large fraction of cyber communities by enumerating all bipartite cores of the web. Naive algorithms are not effective for doing this.

Elimination generation paradigm: Scan the entire web graph, modify it and extract some metadata. The residual graph is written to disk and it along with the meta forms the new state.  During each pass, elimination operation (prune off edges from nodes lower in-degree) and generation operations (indentify qualifying nodes, output core or indicate if it cannot exist) are interleaved. After few runs, the benefits of elimination and generation tail off.
This algorithm runs in linear time in size of the web graph.

Supervised classification problem: 
Given a set of predefined categories, build a system that assigns a document one of the categories
Using terms in the anchor text of links pointing to documents degrade the accuracy.
Better: "Statistical model based" Pages on similar topics tend to be linked as compared to non related ones. This relation can be captured using the Markow Random Field (MRF) model.

Hyperclass algorithm: Instead of just considering page p for classification, consider pages surrounding it IFF p -> q and q -> p. If hyperlinks are used along with the text, the categorization is improved.
Relaxation labeling: Category labels of linked pages and of those to be categorized are adjusted until most probable class configuration is found.

Web graph models can be analytical tool, explanatory tool and predictive tool.
3 properties of web graph model: The rich get richer, correlated out links, correlated in-links.
The in-degrees and out-degrees of nodes follow a inverse polynomial distribution.

Stuff I've seen: A system for personal information retrieval and re-use

This paper is about the SIS system designed at Microsoft Research.
As per studies, while searching, about 60-80% times people do a revisit of previously seen web pages. SIS tries to take into account this fact. It creates a unified index of all the information a person has seen on his/her computer(s), which can be in any form including local disk files, emails, web pages, calendar etc. It also maintains a contextual information of the content This is in form of either thumbnails, meta information about the document and previews. This is missing in the web search results. SIS provides a unified index across multiple applications differing in data organization  strategy and indexing technique. AS it creates a local index from the data, its quicker than web search systems. Also, it have query refinement facility too. The rich contextual cues & previews gathered from the users' activities are useful.

The system comprised of 5 components: Gatherer, Filter, Tokenizer, Indexer and a Retriever. Gatherer was responsible to gather data from various sources. Filter emits a character stream from the decoded data. Tokenizer performed tokenization carries out linguistic processing. The Indexer generates an inverted index used for searching. The  Retriever is the language to be used by the user for searching data via the index populated. A SIS client must run on every users' machine. The user interface was in 2 forms: Top view and Side view. The prior is a list view with filters for refining attributes in each column and is more flexible. The later differs from the prior one in terms of positioning of the filters: they are placed at left and revealed serially. It was easier to understand this way and results are less cluttered. The systems was evaluated using questionnaires and log analysis.  Users were asked about their search experiences before and after using SIS. Log analysis gave an insight into the query patterns, user interface interactions and quality of the results produced. They had also monitored the interaction patterns of the user interface. Very few user queries involved use of boolean operators, phrases. Average query length was 1.59 words.

This seems to be a normal search engine with the only add-on being that it used to gather data from several sources and parse the data. As this being customized for a particular user, it made sense to have the index locally stored on the machine. This gave good response time. Web search engines these days have become way smarter and are user customized. One direct disadvantage of the system that I see is that it will not produce results outside of those not visited by the user. The system will increase the load on the users' machine as it will run continuously. Also, over a period of time, the data gathered  will keep of accumulating thus consuming a significant memory of the local machine. 

As we may think


This was a article written by Vannevar Bush in 1945. It starts with a discussion about the research advancements that were done at the time when the article was written. With a explosion in the number of researchers and their findings, the authors states that any researcher has to devote a considerable amount of time in reading all the findings of his/her field. Couple of researchers having innovative ideas could not make them to practical use (viz. Leibnitz for calculating machine, Babbage for arithmetical machine). Interchangeable parts, which required skilled labors, can be produced at low cost using machines. Then the article moves towards the scope for advancements in photography. Right from the lenses, size of the camera to development of the picture from the film, there was ample scope for changing the way the camera was back then. Microphotography involving microfilm and reductions by linear factor give good results after the re-enlarging. With a greater reduction factor of 10k, a big Encyclopoedia can be stored in very small space using microflim.
           
The retrieval of data from these storages does matter. Author mentions Voder and Vocoder which convert mechanical stroke into sound and vice versa. A stenotype clubbed with Vocoder can create a device which can type when talked to. Such device is being described and its benefits to a experimenter is discussed. Arithmetic operations like addition can be performed on these media. With that, complex calculations can be done quickly and even by naive people. With the increasing market needs, complex analysis and predictions are also expected to be done on these media which can open doors to deeper insights into the any data: economical or laboratory readings. This would free up people from doing what they are not good at and allow them to focus on things that they are really good at. This need for analyzing data is not restricted to people who high end researchers, even common man can employ it. Any conclusion can be derived using logical circuits. Later he talks about Memx and his prediction about how it can be used to revolutionize from various circles of the occupation. In the end, argument is made for possible experiments that could be performed using the known technologies and those discussed earlier in the article.

This article shows Vannevars' vision of solving real world problems by looking at the existing solutions or technologies. His thinking was directed quite far from the technology present at the time when he wrote the article. The merging of stenotype and Vocoder is classic signature of a person who has been around in the field for a long while and knows how to fit things together. Its stunning to read an article dated in 1945 with mention of tools for analysis that can might be needed not only for people with in high end lab with scientific background but to common people. In the idea for memex, he knew that merely using compression and getting the data stored in compact films wont just work out to be a complete solution. Being a person with business sense, rather real designer, he has this understanding that ability to use the data and do something from that would be needed. He also evaluated his solution in terms of benefits to doctors, historian, etc which clearly distinguishes him from a blind researcher. 

Sunday, March 24, 2013

Algorithm based Fault Tolerance

Algorithm-based  fault  tolerance  (ABFT)  is  an  error detection,  location  and  correction  scheme  which  uses redundant  computations  within  the  algorithms  to  detect and  correct  errors  caused  by  permanent  or  transient  failures in the hardware, concurrently with normal operation.

The key idea of the ABFT technique is to encode the data at a higher level using checksum schemes and redesign algorithms to operate on the encoded data. ABFT can be used to detect and correct errors with a low overhead. ABFT is generally not applicable to all problems.

Systematic way of converting a concurrent computation into a ABFT

In  an  ABFT  system,  the  input  data  is  encoded.  The  algorithm  is  modified to  work on the  modified input.  The  computation  steps  of the  algorithm  are  the distributed  among the  computation  units.  It  is  important  to allocate  the  computational  steps  on  different  processors  so  that  an  error  in one  processor affects  as  few  data  elements as  possible.  The encoded  output  produced  by  the  algorithm is  decoded  to check  the  correctness  of the  results. An  ABFT  system  uses  redundant  computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures.  ABFT  is  not  a  general  mechanism  as  other approaches,  it  varies  from  algorithm  to algorithm. This is captured in the figure below.

ABFT  system  uses  redundant computations  within  the algorithm  to  detect  and  correct errors  caused  by  permanent or  transient  failures  in  the  hardware.  It  is not a general approach and varies  from  algorithm  to  algorithm.  When  the  modified  algorithm  is  actually executed on  a multiprocessor architecture,  the  overheads  are  required  to be  minimum  in  comparison  to  TMR.  The  modified  algorithm  takes  more  time  to  operate  on  the encoded  data  when  compared  to  the  original  algorithm. 


ABFT approach for matrix operations

Let PE­i indicate a processing element in the system.
DS(i) is the set of data items affected by PEi. System has checks C1, C2, .... , Cp which check the correctness of the data produced by the PEs.
A check Ck has a associated error set ES(k) indicating the data elements checked by Ck.

A (g,h) check, where g is the cardinality of the error set,  is evaluated as:
·         pass, output = 0 if no error in data items
·         failed, output = 1 if more than h errors
·         unpredictable if there are more than h errors in the system.

If A is a n*n matrix, we define row checksum matrix (Ar) and column checksum matrix (Ac) as:
Ar is a n*(n+1) matrix = [A Ae]
e, is a n *1 vector, = [1 1 ...... 1]T

A =

A


AeT

   



where AeT is the column summation vector. The full checksum matrix Af is given as:


A
Ae


AeT
eTAe


The exact location of one error can be determined by intersecting the inconsistent rows and columns. The use of either row  checksum/column checksum matrices makes  the system 1-error detectable. A full checksum matrix system is also 1-error detectable. 

Concurrency Control and Performance

This paper was authored by Rakesh Agrwal, Mike Carey and Miron Livny.
Previous work had conflicting results: blocking beats restarts, restarts beat blocking, etc.

Goal of this paper
        Do a good job modeling the problem and its variants
        Capture causes of previous conflicting results
        Make recommendations based on variables of problem

Methodology
  Simulation study. Compare the 3 approaches:
a.       Blocking (i.e. 2PL)    block when lock request is denied
b.      Immediate Restart     when denied a lock, restart after a delay
c.       Optimistic                restart decision taken at commit points
  Pay attention to model of system:
1. database system model: hardware and software model (CPUs, disks, size & granule of DB, load control mechanism, CC algorithm
2.  user model: arrival of user tasks, nature of tasks (e.g. batch vs. interactive)
3.  transaction model: logical reference string (i.e. CC schedule), physical reference string (i.e. disk block requests, CPU processing bursts).
4.  Probabilistic modeling of each. They argue this is key to a performance study of a DBMS.
  logical queuing model
  physical queuing model

Measurements
a)      measure throughput, mostly
b)      pay attention to variance of response time, too
c)      pick a DB size so that there are noticeable conflicts (else you get comparable performance)

Experiment 1: Infinite Resources
  • as many disks and CPUs as you want
  • blocking thrashes due to transactions blocking numerous times
  • restart plateaus: adaptive wait period (avg response time) before restart
  • optimistic scales logarithmically
  • standard deviation of response time under locking much lower
Experiment 2: Limited Resources (1 CPU, 2 disks)
  • Everybody thrashes
  • blocking throughput peaks at mpl 25
  • optimistic peaks at 10
  • restart peaks at 10, plateaus at 50 – as good or better than optimistic
  • at super-high mpl (200), restart beats both blocking and optimistic
  • load control is the answer here – adding it to blocking & optimistic makes them handle higher mpls
Experiment 3: Multiple Resources (5, 10, 25, 50 CPUs, 2 disks each)
  • optimistic starts to win at 25 CPUs
  • when useful disk utilization is only about 30%, system begins to behave like infinite resources
  • even better at 50
Experiment 4: Interactive Workloads
  • Add user think time.
  • makes the system appear to have more resources
  • so optimistic wins with think times 5 & 10 secs.
  • Blocking still wins for 1 second think time.

Questioning to assumptions:
Fake restarts – biases for optimistic
  • result in less conflict
  • Cost of conflict in optimistic is higher
  • issue of k > 2 transactions contending for one item
  • will have to punish k-1 of them with real restart
Write-lock acquisition
  • recall our discussion of lock upgrades and deadlock
  • blind write biases for restart (optimistic not an issue here), particularly with infinite resources
  • blocking holds write locks for a long time; waste of deadlock restart not an issue here.
  • with finite resources, blind write restarts transactions earlier (making restart look better)
Conclusions
  blocking beats restarting, unless resource utilization is low
  possible in situations of high think time
  mpl control important. admission control the typical scheme. Restartʼs adaptive load control is too clumsy, though.
  false assumptions made blocking look relatively worse