Quickly Generating Billion-Record Synthetic Databases

  • ,
  • Prakash Sundaresan ,
  • Susanne Englert ,
  • Ken Baclawski ,
  • Peter J. Weinberger

Published by Association for Computing Machinery, Inc.

Publication | Publication

Evaluating database system performance often requires generating synthetic databases – ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions. The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.