This is a short summary of this article by Russ Cox.
It describes how Code Search worked when the Russ was an working on it as an intern at Google. This implementation works well enough to index and search large code bases on a single computer.
Indexed Word Search: The key data structure used here is called a posting list or inverted index, which lists, for every possible search term, the documents that contain that term. For finding documents matching all terms in a query, intersection of all documents for both the words was employed. Other way is to treat phrases as AND queries to identify a set of candidate documents and then filter out non-matching documents after loading the document bodies from disk. Storing the position information in the index entries makes the index bigger but avoids loading a document from disk unless it is guaranteed to be a match.
Indexed Regular Expression Search: Google code search did not use regex expressions to answer the queries. Instead of merely using 1 gram inverted index, they had build an index of n-grams, substrings of length n. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is. The query phrase is converted into trigrams with AND operations. This query is ran against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents. A regular expression using the OR operator lead to a query with OR terms, and parenthesized sub-expressions complicate the process of turning strings into AND expressions. Regex rules gave a huge results set. At each step, they applied some simplifications to keep the computed information manageable.
Computing trigrams: The trigrams function applied to a single string can be any, if the string has <3 characters, else AND of all the trigrams in the string. The trigrams function applied to a set of strings is the OR of the trigrams function applied to each of the strings.
Transformations : were defined that apply at any step during the analysis, to any regular expression e. These preserve the validity of the results. The best way to apply these transformation is to use the information-saving transformations immediately before applying an information-discarding transformation. A little more information can be squeezed from the concatenation analysis:
if e1e2 is not exact, match(e1e2) can also require trigrams(suffix(e1) × prefix(e2)).
In addition to these transformations, they also applied basic Boolean simplifications to the match query as it is constructed eg. “abc OR (abc AND def)” is more expensive but no more precise than “abc”.
Implementation: was done in "Go". The cindex command if passed directories, it will update the current index with these directories, else it will update the existing index.
The index file contains a list of paths, a list of indexed files, and then the same posting lists we saw at the beginning of this article, one for each trigram. Index is around 20% of the size of the files being indexed. To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache.
It describes how Code Search worked when the Russ was an working on it as an intern at Google. This implementation works well enough to index and search large code bases on a single computer.
Indexed Word Search: The key data structure used here is called a posting list or inverted index, which lists, for every possible search term, the documents that contain that term. For finding documents matching all terms in a query, intersection of all documents for both the words was employed. Other way is to treat phrases as AND queries to identify a set of candidate documents and then filter out non-matching documents after loading the document bodies from disk. Storing the position information in the index entries makes the index bigger but avoids loading a document from disk unless it is guaranteed to be a match.
Indexed Regular Expression Search: Google code search did not use regex expressions to answer the queries. Instead of merely using 1 gram inverted index, they had build an index of n-grams, substrings of length n. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is. The query phrase is converted into trigrams with AND operations. This query is ran against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents. A regular expression using the OR operator lead to a query with OR terms, and parenthesized sub-expressions complicate the process of turning strings into AND expressions. Regex rules gave a huge results set. At each step, they applied some simplifications to keep the computed information manageable.
Computing trigrams: The trigrams function applied to a single string can be any, if the string has <3 characters, else AND of all the trigrams in the string. The trigrams function applied to a set of strings is the OR of the trigrams function applied to each of the strings.
Transformations : were defined that apply at any step during the analysis, to any regular expression e. These preserve the validity of the results. The best way to apply these transformation is to use the information-saving transformations immediately before applying an information-discarding transformation. A little more information can be squeezed from the concatenation analysis:
if e1e2 is not exact, match(e1e2) can also require trigrams(suffix(e1) × prefix(e2)).
In addition to these transformations, they also applied basic Boolean simplifications to the match query as it is constructed eg. “abc OR (abc AND def)” is more expensive but no more precise than “abc”.
Implementation: was done in "Go". The cindex command if passed directories, it will update the current index with these directories, else it will update the existing index.
The index file contains a list of paths, a list of indexed files, and then the same posting lists we saw at the beginning of this article, one for each trigram. Index is around 20% of the size of the files being indexed. To minimize I/O and take advantage of operating system caching, csearch uses mmap to map the index into memory and in doing so read directly from the operating system's file cache.
No comments:
Post a Comment