DataFiller - generate random data from database schema

NAME
SYNOPSIS
DESCRIPTION
OPTIONS
ARGUMENTS
TUTORIAL

DIRECTIVES IN COMMENTS
A SIMPLE LIBRARY EXAMPLE
IMPROVING GENERATED VALUES
DISCUSSION
ADVANCED FEATURES
TUTORIAL CONCLUSION

DIRECTIVES AND DATA GENERATORS

DATA GENERATORS
GLOBAL DIRECTIVES
TABLE DIRECTIVES
ATTRIBUTE DIRECTIVES

EXAMPLES

PGBENCH SCHEMA
COMICS SCHEMA

WARNINGS, BUGS, FEATURES AND GRUMBLING
LICENSE
DOWNLOAD
VERSIONS
KEYWORDS
SEE ALSO

NAME

datafiller - generate random data from database schema extended with directives

SYNOPSIS

datafiller [--help --man ...] [schema.sql ...] > data.sql

DESCRIPTION

This script generates random data from a database schema enriched with simple directives in SQL comments to drive 29 data generators which cover typical data types and their combination. Reasonable defaults are provided, especially based on key and type constraints, so that few directives should be necessary. The minimum setup is to specify the relative size of tables with directive mult so that data generation can be scaled.

See the TUTORIAL section below. Also, run with --validate=comics or --validate=library and look at the output for didactic examples.

OPTIONS

--debug or -D

Set debug mode. Repeat for more.

Default is no debug.

--drop

Drop tables before recreating them. This implies option --filter, otherwise there would be no table to fill.

Default is not to.

--encoding=enc or -e enc

Set this encoding for input and output files.

Default is no explicit encoding.

--filter or -f, reverse with --no-filter

Work as a filter, i.e. send the schema input script to stdout and then the generated data. This is convenient to pipe the result of the script directly for execution to the database command.

Default is to only ouput generated data.

--help or -h

Show basic help.

--man or -m

Show full man page based on POD. Yes, the perl thing:-)

--null RATE or -n RATE

Probability to generate a null value for nullable attributes.

Default is 0.01, which can be overriden by the null directive at the schema level, or per-attributes provided null rate.

--offset OFFSET or -O OFFSET

Set default offset for integer generators on primary keys. This is useful to extend the already existing content of a database for larger tests.

Default is 1, which can be overriden by the offset directive at the schema level, or per-attribute provided offset.

--pod COMMAND

Override pod conversion command used by option --man.

Default is 'pod2usage -verbose 3'.

--quiet or -q

Generate less verbose SQL output.

Default is to generate one echo when starting to fill a table.

--seed SEED or -S SEED

Seed overall random generation with provided string.

Default uses OS supplied randomness or current time.

--size SIZE

Set overall scaling. The size is combined with the mult directive value on a table to compute the actual number of tuples to generate in each table.

Default is 100, which can be overriden with the size directive at the schema level.

--target (postgresql|mysql) or -t ...

Target database engine. MySQL support is really experimental.

Default is to target PostgreSQL.

--test='directives...'

Run tests for any generator with some directives. If the directives start with !, show an histogram. If the directives start with -, show results on one-line. This is convenient for unit testing, or to check what would be the output of a set of directives.

Examples: --size=100 --test='int size=10 mangle' would show 100 integers between 1 and 10 drawn uniformly. --test='!bool rate=0.3' may show False: 69.32%, True: 30.68%, stating the rate at which True and False were seen during the test.

Directive type can be used within --test=... to provide the target SQL type when testing.

--transaction or -T

Use a global transaction.

Default is not to.

--tries=NUM

How hard to try to satisfy a compound unique constraint before giving up on a given tuple.

Default is 10.

--truncate

Delete table contents before filling.

Default is not to.

--type=CUSTOM

Add a custom type. The default generator for this type will rely on a macro of the same name, if it exists. This option can be repeated to add more custom types. See also the type directive at the schema level.

--validate=(unit|internal|comics|library|pgbench)

Output validation test cases. All but the unit tests can be processed with psql.

  sh> datafiller --validate=internal | psql

This option sets --filter automatically, although it can be forced back with --no-filter.

Default is to process argument files or standard input.

--version, -v or -V

Show script version.

ARGUMENTS

Files containing SQL schema definitions. Standard input is processed if empty.

TUTORIAL

This tutorial introduces how to use DataFiller to fill a PostgreSQL database for testing functionalities and performances.

DIRECTIVES IN COMMENTS

The starting point of the script to generate test data is the SQL schema of the database taken from a file. It includes important information that will be used to generate data: attribute types, uniqueness, not-null-ness, foreign keys... The idea is to augment the schema with directives in comments so as to provide additional hints about data generation.

A datafiller directive is a special SQL comment recognized by the script, with a df marker at the beginning. A directive must appear after the object about which it is applied, either directly after the object declaration, in which case the object is implicit, or much later, in which case the object must be explicitely referenced:

  -- this directive sets the default overall size
    -- df: size=10
  -- this directive defines a macro named "fn"
    -- df fn: word=/path/to/file-containing-words
  -- this directive applies to table "Foo"
  CREATE TABLE Foo( -- df: mult=10.0
    -- this directive applies to attribute "fid"
    fid SERIAL -- df: offset=1000
    -- use defined macro, choose "stuff" from the list of words
  , stuff TEXT NOT NULL -- df: use=fn
  );
  -- ... much later
  -- this directive applies to attribute "fid" in table "Foo"
  -- df T=Foo A=fid: null=0.8

A SIMPLE LIBRARY EXAMPLE

Let us start with a simple example involving a library where readers borrow books. Our schema is defined in file library.sql as follows:

  
  CREATE TABLE Book( 
    bid SERIAL PRIMARY KEY,
    title TEXT NOT NULL, 
    isbn ISBN13 NOT NULL 
  );

  CREATE TABLE Reader( 
    rid SERIAL PRIMARY KEY,
    firstname TEXT NOT NULL, 
    lastname TEXT NOT NULL, 
    born DATE NOT NULL, 
    gender BOOLEAN NOT NULL, 
    phone TEXT 
  );

  CREATE TABLE Borrow( 
    borrowed TIMESTAMP NOT NULL, 
    rid INTEGER NOT NULL REFERENCES Reader,
    bid INTEGER NOT NULL REFERENCES Book, 
    PRIMARY KEY(bid) -- a book is borrowed once at a time!
  );

The first and only information you really need to provide is the relative or absolute size of relations. For scaling, the best way is to specify a relative size multiplier with the mult directive on each table, which will be multiplied by the size option to compute the actual size of data to generate in each table. Let us say we want 100 books in stock per reader, with 1.5 borrowed books per reader on average:

  CREATE TABLE Book( -- df: mult=100.0
  ...
  CREATE TABLE Borrow( --df: mult=1.5

The default multiplier is 1.0, it does not need to be set on Reader. Then you can generate a data set with:

  sh> datafiller --size=1000 library.sql > library_test_data.sql

Note that explicit constraints are enforced on the generated data, so that foreign keys in *Borrow* reference existing *Books* and *Readers*. However the script cannot guess implicit constraints, thus if an attribute is not declared NOT NULL, then some NULL values will be generated. If an attribute is not unique, then the generated values will probably not be unique.

IMPROVING GENERATED VALUES

In the above generated data, some attributes may not reflect the reality one would expect from a library. Changing the default with per-attribute directives will help improve this first result.

First, book titles are all quite short, looking like title_number, with some collisions. Indeed the default is to generate strings with a common prefix based on the attribute name and a number drawn uniformly from the expected number of tuples. We can change to texts composed of between 1 and 7 English words taken from a dictionary:

  title TEXT NOT NULL
  -- df English: word=/etc/dictionaries-common/words
  -- df: text=English length=4 lenvar=3

Also, we may have undesirable collisions on the ISBN attribute, because the default size of the set is the number of tuples in the table. We can extend the size at the attribute level so as to avoid this issue:

  isbn ISBN13 NOT NULL -- df: size=1000000000

If we now look at readers, the result can also be improved. First, we can decide to keep the prefix and number form, but make the statistics more in line with what you can find. Let us draw from 1000 firstnames, most frequent 3%, and 10000 lastnames, most frequent 1%:

  firstname TEXT NOT NULL,
     -- df: sub=power prefix=fn size=1000 rate=0.03
  lastname TEXT NOT NULL,
     -- df: sub=power prefix=ln size=10000 rate=0.01

The default generated dates are a few days around now, which does not make much sense for our readers' birth dates. Let us set a range of birth dates.

  birth DATE NOT NULL, -- df: start=1923-01-01 end=2010-01-01

Most readers from our library are female: we can adjust the rate so that 25% of readers are male, instead of the default 50%.

  gender BOOLEAN NOT NULL, -- df: rate=0.25

Phone numbers also have a prefix_number structure, which does not really look like a phone number. Let us draw a string of 10 digits, and adjust the nullable rate so that 1% of phone numbers are not known. We also set the size manually to avoid too many collisions, but we could have chosen to keep them as is, as some readers do share phone numbers.

  phone TEXT
    -- these directives could be on a single line
    -- df: chars='0-9' length=10 lenvar=0
    -- df: null=0.01 size=1000000

The last table is about currently borrowed books. The timestamps are around now, we are going to spread them on a period of 50 days, that is 24 * 60 * 50 = 72000 minutes (precision is 60 seconds).

  borrowed TIMESTAMP NOT NULL -- df: size=72000 prec=60

Because of the primary key constraint, the borrowed books are the first ones. Let us mangle the result so that referenced book numbers are scattered.

  bid INTEGER REFERENCES Book -- df: mangle

Now we can generate improved data for our one thousand readers library, and fill it directly to our library database:

  sh> datafiller --size=1000 --filter library.sql | psql library

Our test database is ready. If we want more users and books, we only need to adjust the size option. Let us query our test data:

  -- show firstname distribution
  SELECT firtname, COUNT(*) AS cnt FROM Reader
  GROUP BY firstname ORDER BY cnt DESC LIMIT 3;
    -- fn_1_... | 33
    -- fn_2_... | 15
    -- fn_3_... | 12
  -- compute gender rate
  SELECT AVG(gender::INT) FROM Reader;
    -- 0.246

DISCUSSION

We could go on improving the generated data so that it is more realistic. For instance, we could skew the borrowed timestamp so that there are less old borrowings, or skew the book number so that old books (lower numbers) are less often borrowed, or choose firtnames and lastnames from actual lists.

When to stop improving is not obvious: On the one hand, real data may show particular distributions which impact the application behavior and performance, thus it may be important to reflect that in the test data. On the other hand, if nothing is really done about readers, then maybe the only relevant information is the average length of firstnames and lastnames because of the storage implications, and that's it.

ADVANCED FEATURES

Special generators allow to combine or synchronize other generators.

Let us consider an new email attribute for our Readers, for which we want to generate gmail or yahoo addresses. We can use a pattern generator for this purpose:

  email TEXT NOT NULL CHECK(email LIKE '%@%')
    -- df: pattern='[a-z]{3,8}\.[a-z]{3,8}@(gmail|yahoo)\.com'

The pattern sets 3 to 8 lower case characters, followed by a dot, followed by 3 to 8 characters again, followed by either gmail or yahoo domains. With this approach, everything is chosen uniformly: letters in first and last names appear 1/26, their six possible sizes between 3 to 8 are equiprobable, and each domain is drawn on average by half generated email addresses.

In order to get more control about these distributions, we could rely on the chars or alt generators which can be skewed or weighted, as illustrated in the next example.

Let us now consider an ip address for the library network. We want that 20% comes from librarian subnet '10.1.0.0/16' and 80% from reader subnet '10.2.0.0/16'. This can be achieved with directive alt which tells to make a weighted choice between macro-defined generators:

  -- define two macros
  -- df librarian: inet='10.1.0.0/16'
  -- df reader: inet='10.2.0.0/16'
  ip INET NOT NULL
  -- df: alt=reader:8,librarian:2
  -- This would do as well: --df: alt=reader:4,librarian

Let us finally consider a log table for data coming from a proxy, which stores correlated ethernet and ip addresses, that is ethernet address always get the same ip addess, and we have about 1000 distinct hosts:

     -- df distinct: int size=1000
  ,  mac MACADDR NOT NULL -- df share=distinct
  ,  ip INET NOT NULL -- df share=distinct inet='10.0.0.0/8'

Because of the share directive, the same mac and ip will be generated together in a pool of 1000 pairs of addresses.

TUTORIAL CONCLUSION

There are many more directives to drive data generation, from simple type oriented generators to advanced combinators. See the documentation and examples below.

For very application-specific constraints that would not fit any generator, it is also possible to apply updates to modify the generated data afterwards.

DIRECTIVES AND DATA GENERATORS

Directives drive the data sizes and the underlying data generators. They must appear in SQL comments after the object on which they apply, although possibly on the same line, introduced by '-- df: '.

  CREATE TABLE Stuff(         -- df: mult=2.0
    id SERIAL PRIMARY KEY,    -- df: step=19
    data TEXT UNIQUE NOT NULL -- df: prefix=st length=30 lenvar=3
  );

In the above example, with option --size=1000, 2000 tuples will be generated (2.0*1000) with id 1+(i*19)%2000 and unique text data of length about 30+-3 prefixed with st. The sequence for id will be restarted at 2001.

The default size is the number of tuples of the containing table. This implies many collisions for a uniform generator.

DATA GENERATORS

There are 29 data generators which are selected by the attribute type or possibly directives. Most generators are subject to the null directive which drives the probability of a NULL value. There is also a special shared generator.

Generators are triggered by using directives of their names. If none is specified, a default is chosen based on the attribute type.

alt generator: This generator aggregates other generators by choosing one. The list of sub-generators must be specified as a list of comma-separated weighted datafiller macros provided by directive alt, see below. These generator definition macros must contain an explicit directive to select the underlying generator.
array generator: This generator generates an SQL array from another generator. The sub-generator is specified as a macro name with the array directive. It takes into account the length, lenvar, lenmin and lenmax directives.
bit generator: This generator handles BIT and VARBIT data to store sequences of bits. It takes into account the length, lenvar, lenmin and lenmax directives.
blob generator: This is for blob types, such as PostgreSQL's BYTEA. It uses an int generator internally to drive its extent. It takes into account the length, lenvar, lenmin and lenmax directives. This generator does not support UNIQUE, but uniqueness is very likely if the blob length is significant and the size is large.
bool generator: This generator is used for the boolean type. It is subject to the rate directive.
cat generator: This generator aggregates other generators by concatenating their textual output. The list of sub-generators must be specified as a list of comma-separated datafiller macros provided by a cat directive, see below.
chars generator: This alternate generator for text types generates random string of characters. It is triggered by the chars directive. In addition to the underlying int generator which allows to select values, another int generator is used to build words from the provided list of characters, The cgen directives is the name of a macro which specifies the int generator parameters for the random character selection. It also takes into account the length, lenvar, lenmin and lenmax directives. This generator does not support UNIQUE.
const generator: This generator provides a constant text value. It is driven by the const directive.
count generator: This generator provides a simple counter. It takes into account directives start, step and format.
date generator: This generator is used for the date type. It uses an int generator internally to drive its extent. Its internal working is subject to directives start, end and prec.
ean generator: This is for International Article Number (EAN!) generation. It uses an int generator internally, so the number of distinct numbers can be adjusted with directive size. It takes into account the length and prefix directives. Default is to generate EAN-13 numbers. This generator does not support UNIQUE, but uniqueness is very likely.
file generator: Inline file contents. The mandatory list of files is specified with directive files. See also directive mode.
float generator: This generator is used for floating point types. The directive allows to specify the sub-generator to use, see the float directive below. Its configuration also relies on directives alpha and beta. It does not support UNIQUE, but uniqueness is very likely.
inet generator: This is for internet ip types, such as PostgreSQL's INET and CIDR. It uses an int generator internally to drive its extent. It takes into account the network directive to specify the target network. Handling IPv6 networks requires module netaddr.
int generator: This generator is used directly for integer types, and indirectly by other generators such as text, word and date. Its internal working is subject to directives: sub, size (or mult), offset, shift, step, xor and mangle.
interval generator: This generator is used for the time interval type. It uses the int generator internally to drive its extent. See also the unit directive.
isnull generator: Generates the special NULL value.
luhn generator: This is for numbers which use a Luhn's algorithm checksum, such as bank card numbers. It uses an int generator internally, so the number of distinct numbers can be adjusted with directive size. It takes into account the length and prefix directives. Default length is 16, default prefix is empty. This generator does not support UNIQUE, but uniqueness is very likely.
mac generator: This is for MAC addresses, such as PostgreSQL's MACADDR. It uses an int generator internally, so the number of generated addresses can be adjusted with directive size.
pattern generator: This alternative generator for text types generates text based on a regular expression provided with the pattern directive. It uses internally the alt, cat, repeat, const and chars generators.
reduce generator: This generator applies the reduction operation specified by directive op to generators specified with reduce as a comma-separated list of macros.
repeat generator: This generator aggregates the repetition of another generator. The repeated generator is specified in a macro with a repeat directive, and the number of repetitions relies on the extent directive. It uses an int generator internally to drive the number of repetitions, so it can be skewed by specifying a subtype with the sub directive.
string generator: This generator is used by default for text types. This is a good generator for filling stuff without much ado. It takes into account prefix, and the length can be specified with length, lenvar lenmin and lenmax directives. The generated text is of length length +- lenvar, or between lenmin and lenmax. For CHAR(n) and VARCHAR(n) text types, automatic defaults are set.
text generator: This aggregate generator generates aggregates of words drawn from any other generator specified by a macro in directive text. It takes into account directives separator for separator (default ), prefix and suffix (default empty). It also takes into account the length, lenvar, lenmin and lenmax directives which handle the number of words to generate. This generator does not support UNIQUE, but uniqueness is very likely for a text with a significant length drawn from a dictionary.
timestamp generator: This generator is used for the timestamp type. It is similar to the date generator but at a finer granularity. The tz directive allows to specify the target timezone.
tuple generator: This aggregate generator generates composite types. The list of sub-generators must be specified with tuple as a list of comma-seperated macros.
uuid generator: This generator is used for the UUID type. It is really a pattern generator with a predefined pattern.
value generator: This generator uses per-tuple values from another generator specified as a macro name in the value directive. If the same value is specified more than once in a tuple, the exact same value is generated.
word generator: This alternate generator for text types is triggered by the word directive. It uses int generator to select words from a list or a file. This generator handles UNIQUE if enough words are provided.

GLOBAL DIRECTIVES

A directive macro can be defined and then used later by inserting its name between the introductory df and the :. The specified directives are stored in the macro and can be reused later. For instance, macros words, mangle cfr and cen can be defined as:

  --df words: word=/etc/dictionaries-common/words sub=power alpha=1.7
  --df mix: offset=10000 step=17 shift=3
  --df cfr: sub=scale alpha=6.7
  --df cen: sub=scale alpha=5.9

Then they can be used in any datafiller directive with use=...:

  --df: use=words use=mix
  --df: use=mix

Or possibly for chars generators with cgen=...:

  --df: cgen=cfr chars='esaitnru...'

There are four predefined macros: cfr and cen define skewed integer generators with the above parameters. french, english define chars generators which tries to mimic the character frequency of these languages.

The size, offset, null, seed and directives can be defined at the schema level to override from the SQL script the default size multiplier, primary key offset, null rate or seed. However, they are ignored if the corresponding options are set.

The type directive at the schema level allows to add custom types, similarly the --type option above.

TABLE DIRECTIVES

mult=float: Size multiplier for scaling, that is computing the number of tuples to generate. This directive is exclusive from size.
nogen: Do not generate data for this table.
null: Set defaut null rate for this table.
size=int: Use this size, so there is no scaling with the --size option and mult directive. This directive is exclusive from mult.
skip=float: Skip (that is generate but do not insert) some tuples with this probability. Useful to create some holes in data. Tables with a non-zero skip cannot be referenced.

ATTRIBUTE DIRECTIVES

A specific generator can be specified by using its name in the directives, otherwise a default is provided based on the attribute SQL type. Possible generators are: alt array bit blob bool cat chars const count date ean file float inet int interval isnull luhn mac pattern reduce repeat string text timestamp tuple uuid value word. See also the sub directive to select a sub-generator to control skewness.

alt=some:2,weighted,macros:2

List of macros, possibly weighted (default weight is 1) defining the generators to be used by an alt generator. These macros must include an explicit directive to select a generator.

array=macro

Name of the macro defining an array built upon this generator for the array generator. The macro must include an explicit directive to select a generator.

cat=list,of,macros

List of macros defining the generators to be used by a cat generator. These macros must include an explicit directive to select a generator.

chars='0123456789A-Z\n' cgen=macro

The chars directive triggers the chars generator described above. Directive chars provides a list of characters which are used to build words, possibly including character intervals with '-'. A leading '-' in the list means the dash character as is. Characters can be escaped in octal (e.g. \041 for !) or in hexadecimal (e.g. \x3D for =). Unicode escapes are also supported: eg \u20ac for the Euro symbol and \U0001D11E for the G-clef. Also special escaped characters are: null \0 (ASCII 0), bell \a (7), backspace \b (8), formfeed \f (12), newline \n (10), carriage return \r (13), tab \t (9) and vertical tab \v (11). The macro name specified in directive cgen is used to setup the character selection random generator.

For exemple:

  ...
  -- df skewed: sub=power rate=0.3
  , stuff TEXT -- df: chars='a-f' sub=uniform size=23 cgen=skewed

The text is chosen uniformly in a list of 23 words, each word being built from characters 'abcdef' with the skewed generator described in the corresponding macro definition on the line above.

const=str

Specify the constant text to generate for the const generator. The constant text can contains the escaped characters described with the chars directive above.

extent=int or extent=int-int

Specify the extent of the repetition for the repeat generator. Default is 1, that is not to repeat.

files=str

Path-separated patterns for the list for files used by the file generator. For instance to specify image files in the ./img UN*X subdirectory:

  files='./img/*.png:./img/*.jpg:./img/*.gif'

float=str

The random sub-generators for floats are those provided by Python's random:

beta: Beta distribution, alpha and beta must be >0.
exp: Exponential distribution with mean 1.0 / alpha
gamma: Gamma distribution, alpha and beta must be >0.
gauss: Gaussian distribution with mean alpha and stdev beta.
log: Log normal distribution, see normal.
norm: Normal distribution with mean alpha and stdev beta.
pareto: Pareto distribution with shape alpha.
uniform: Uniform distribution between alpha and beta. This is the default distribution.
vonmises: Circular data distribution, with mean angle alpha in radians and concentration beta.
weibull: Weibull distribution with scale alpha and shape beta.

format=str

Format output for the count generator. Default is d. For instance, setting 08X displays the counter as 0-padded 8-digits uppercase hexadecimal.

inet=str

Use to specify in which IPv4 or IPv6 network to generate addresses. For instance, inet=10.2.14.0/24 chooses ip addresses between 10.2.14.1 and 10.2.14.254, that is network and broadcast addresses are not generated. Similarily, inet=fe80::/112 chooses addresses between fe80::1 and fe80::ffff. The default subnet limit is 24 for IPv4 and 64 for IPv6. A leading , adds the network address, a leading . adds the broadcast address, and a leading ; adds both, thus inet=';10.2.14.0/24' chooses ip addresses between 10.2.14.0 and 10.2.14.255.

length=int lenvar=int lenmin=int lenmax=int

Specify length either as length and variation or length bounds, for generated characters of string data, number of words of text data or blob.

mangle

Whether to automatically choose random shift, step and xor for an int generator.

mode=str

Mode for handling files for the file generator. The value is either blob for binaries or text for text file in the current encoding. Default is to use the binary format, as it is safer to do so.

mult=float

Use this multiplier to compute the generator size. This directive is exclusive from size.

nogen

Do not generate data for this attribute, so it will get its default value.

null=float

Probability of generating a null value for this attribute. This applies to all generators.

offset=int shift=int step=int

Various parameters for generated integers. The generated integer is offset+(shift+step*i)%size. step must not be a divider of size, it is ignored and replaced with 1 if so.

Defaults: offset is 1, shift is 0, step is 1.

op=(+|*|min|max|cat)

Reduction operation for reduce generator.

pattern=str

Provide the regular expression for the pattern generator.

They can involve character sequences like calvin, character escape sequences (octal, hexadecimal, unicode, special) as in directive chars above, character classes like [a-z] and [^a-z] (exclusion), character classes shortcuts like . \d \h \H \s and \w which stand for [ -~] [0-9] [0-9a-f] [0-9A-F] [ \f\n\r\t\v] and [0-9a-zA-Z_] respectively, as well as POSIX character classes within [:...:], for instance [:alnum:] for [0-9A-Za-z] or [:upper:] for [A-Z].

Alternations are specified with |, for instance (hello|world). Repetitions can be specified after an object with {3,8}, which stands for repeat between 3 and 8 times. The question mark ? is a shortcut for {0,1}, the star sign * for {0,8} and the plus sign + for {1,8}.

For instance:

  stuff TEXT NOT NULL -- df: pattern='[a-z]{3,5} ?(!!|\.{3}).*'

means 3 to 5 lower case letters, maybe followed by a space, followed by either 2 bangs or 3 dots, and ending with any non special character.

The special [:GEN ...:] syntax allow to embedded a generator within the generated pattern, possibly including specific directives. For instance the following would generate unique email addresses because of the embedded counter:

  email TEXT UNIQUE NOT NULL
    -- df: pattern='\w+\.[:count format=X:]@somewhere\.org'

prefix=str

Prefix for string ean luhn and text generators.

rate=float

For the bool generator, rate of generating True vs False. Must be in [0, 1]. Default is 0.5.

For the int generator, rate of generating value 0 for generators power and scale.

repeat=macro

Macro which defines the generator to repeat for the repeat generator. See also the extent directive.

seed=str

Set default global seed from the schema level. This can be overriden by option --seed. Default is to used the default random generator seed, usually relying on OS supplied randomness or the current time.

separator=str

Word separator for text generator. Default is (space).

share=macro

Specify the name of a macro which defines an int generator used for synchronizing other generators. If several generators share the same macro, their values within a tuple are correlated between tuples.

size=int

Number of underlying values to generate or draw from, depending on the generator. For keys (primary, foreign, unique) , this is necessarily the corresponding number of tuples. This directive is exclusive from mult.

start=date/time , end=date/time, prec=int

For the date and timestamp generators, issue from start up to end at precision prec. Precision is in days for dates and seconds for timestamp. Default is to set end to current date/time and prec to 1 day for dates et 60 seconds for timestamps. If both start and end are specified, the underlying size is adjusted.

For example, to draw from about 100 years of dates ending on January 19, 2038:

  -- df: end=2038-01-19 size=36525

sub=SUGENERATOR

For integer generators, use this underlying sub-type generator.

The integer sub-types also applies to all generators which inherit from the int generator, namely blob date ean file inet interval luhn mac string text timestamp repeat and word.

The sub-generators for integers are:

serial

This is really a counter which generates distinct integers, depending on offset, shift, step and xor.

uniform

Generates uniform random number integers between offset and offset+size-1. This is the default.

serand

Generate integers based on serial up to size, then use uniform. Useful to fill foreign keys.

power with parameter alpha or rate

Use probability to this alpha power. When rate is specified, compute alpha so that value 0 is drawn at the specified rate. Uniform is similar to power with alpha=1.0, or rate=1.0/size The higher alpha, the more skewed towards 0.

Example distribution with --test='!int sub=power rate=0.3 size=10':

  value     0   1   2   3   4   5   6   7   8   9
  percent  30  13  10   9   8   7   6   6   5   5

scale with parameter alpha or rate

Another form of skewing. The probability of increasing values drawn is less steep at the beginning compared to power, thus the probability of values at the end is lower.

Example distribution with --test='!int sub=scale rate=0.3 size=10':

  value     0   1   2   3   4   5   6   7   8   9
  percent  30  19  12   9   7   6   5   4   3   2

suffix=str

Suffix for text generator. Default is empty.

text=macro

Macro which defines the word provided generator for the text generator.

type=str

At the schema level, add a custom type which will be recognized as such by the schema parser. At the attribute level, use the generator for this type.

unit=str

The unit directive specifies the unit of the generated intervals. Possible values include s m h d mon y. Default is s, i.e. seconds.

word=file or word=:list,of,words

The word directive triggers the word generator described above, or is used as a source for words by the text generator. Use provided word list or lines of file to generate data. The default size is the size of the word list.

If the file contents is ordered by word frequency, and the int generator is skewed (see sub), the first words can be made to occur more frequently.

xor=int

The xor directive adds a non-linear xor stage for the int generator.

EXAMPLES

The first example is taken from pgbench. The second example is a didactic schema to illustrate directives. See also the library example in the tutorial above. As these schemas are embedded into this script, they can be invoked directly with the --test option:

  sh> datafiller --test=pgbench -T --size=10 | psql pgbench
  sh> datafiller --test=comics -T --size=10 | psql comics
  sh> datafiller --test=library -T --size=1000 | psql library

PGBENCH SCHEMA

This schema is taken from the TCP-B benchmark. Each Branch has Tellers and Accounts attached to it. The History records operations performed when the benchmark is run.

  -- TPC-B example adapted from pgbench

  -- df regress: int sub=power alpha=1.5
  -- df: size=1

  CREATE TABLE pgbench_branches( -- df: mult=1.0
    bid SERIAL PRIMARY KEY,
    bbalance INTEGER NOT NULL,   -- df: size=100000000 use=regress
    filler CHAR(88) NOT NULL
  );

  CREATE TABLE pgbench_tellers(  -- df: mult=10.0
    tid SERIAL PRIMARY KEY,
    bid INTEGER NOT NULL REFERENCES pgbench_branches,
    tbalance INTEGER NOT NULL,   -- df: size=100000 use=regress
    filler CHAR(84) NOT NULL
  );

  CREATE TABLE pgbench_accounts( -- df: mult=100000.0
    aid BIGSERIAL PRIMARY KEY,
    bid INTEGER NOT NULL REFERENCES pgbench_branches,
    abalance INTEGER NOT NULL,   -- df: offset=-1000 size=100000 use=regress
    filler CHAR(84) NOT NULL
  );

  CREATE TABLE pgbench_history(  -- df: nogen
    tid INTEGER NOT NULL REFERENCES pgbench_tellers,
    bid INTEGER NOT NULL REFERENCES pgbench_branches,
    aid BIGINT NOT NULL REFERENCES pgbench_accounts,
    delta INTEGER NOT NULL,
    mtime TIMESTAMP NOT NULL,
    filler CHAR(22)
    -- UNIQUE (tid, bid, aid, mtime)
  );

The integer *balance figures are generated with a skewed generator defined in macro regress. The negative offset setting on abalance will help generate negative values, and the regress skewed generator will make small values more likely.

If this is put in a tpc-b.sql file, then working test data can be generated with:

  sh> datafiller -f -T --size=10 tpc-b.sql | psql bench

COMICS SCHEMA

This schema models Comics books written in a Language and published by a Publisher. Each book can have several Authors through Written. The Inventory tells on which shelf books are stored. Some of the books may be missing and a few may be available twice or more.

  -- Comics didactic example.

  -- Set default scale to 10 tuples for one unit (1.0).
  -- This can be overwritten with the size option.
  -- df: size=10

  -- This relation is not re-generated.
  -- However the size will be used when generating foreign keys to this table.
  CREATE TABLE Language( --df: nogen size=2
    lid SERIAL PRIMARY KEY,
    lang TEXT UNIQUE NOT NULL
  );

  INSERT INTO Language(lid, lang) VALUES
    (1, 'French'),
    (2, 'English')
  ;

  -- Define a char generator for names:
  -- df chfr: int sub=scale rate=0.17
  -- df name: chars='esaitnrulodcpmvqfbghjxyzwk' cgen=chfr length=8 lenvar=3

  CREATE TABLE Author( --df: mult=1.0
    aid SERIAL PRIMARY KEY,
    -- There are 200 firstnames do draw from, most frequent 2%.
    -- In the 19th century John & Mary were given to >5% of the US population,
    -- and rates reached 20% in England in 1800. Currently firstnames are much
    -- more diverse, most frequent at about 1%.
    firstname TEXT NOT NULL,
      -- df: use=name size=200 sub=scale rate=0.02
    -- There are 10000 lastnames to draw from, most frequent 8/1000 (eg Smith)
    lastname TEXT NOT NULL,
      -- df: use=name size=10000 sub=scale rate=0.008
    -- Choose dates in the 20th century
    birth DATE NOT NULL,     -- df: start=1901-01-01 end=2000-12-31
    -- We assume that no two authors of the same name are born on the same day
    UNIQUE(firstname, lastname, birth)
  );

  -- On average, about 10 authors per publisher (1.0/0.1)
  CREATE TABLE Publisher( -- df: mult=0.1
    pid SERIAL PRIMARY KEY,
    pname TEXT UNIQUE NOT NULL -- df: prefix=pub length=12 lenvar=4
  );

  -- On average, about 15.1 books per author (15.1/1.0)
  CREATE TABLE Comics( -- df: mult=15.1
    cid SERIAL PRIMARY KEY,
    title TEXT NOT NULL,     -- df: use=name length=20 lenvar=12
    published DATE NOT NULL, -- df: start=1945-01-01 end=2013-06-25
    -- The biased scale generator is set for 95% 1 and 5% for others.
    -- There are 2 language values because of size=2 on table Language.
    lid INTEGER NOT NULL REFERENCES Language, -- df: int sub=scale rate=0.95
    pid INTEGER NOT NULL REFERENCES Publisher,
    -- A publisher does not publish a title twice in a day
    UNIQUE(title, published, pid)
  );

  -- Most books are stored once in the inventory.
  -- About 1% of the books are not in the inventory (skip).
  -- Some may be there twice or more (15.2 > 15.1 on Comics).
  CREATE TABLE Inventory( -- df: mult=15.2 skip=0.01
    iid SERIAL PRIMARY KEY,
    cid INTEGER NOT NULL REFERENCES Comics, -- df: sub=serand
    shelf INTEGER NOT NULL -- df: size=20
  );

  -- on average, about 2.2 authors per comics (33.2/15.1)
  CREATE TABLE Written( -- df: mult=33.2
    -- serand => at least one per author and one per comics, then random
    cid INTEGER NOT NULL REFERENCES Comics, -- df: sub=serand mangle
    aid INTEGER NOT NULL REFERENCES Author, -- df: sub=serand mangle
    PRIMARY KEY(cid, aid)
  );

WARNINGS, BUGS, FEATURES AND GRUMBLING

All software has bug, this is a software, hence it has bugs.

If you find one, please sent a report, or even better a patch that fixes it!

BEWARE, you may loose your hairs or your friends by using this script. Do not use this script on a production database and without dumping your data somewhere safe beforehand.

There is no SQL parser, table and attributes are analysed with optimistic regular expressions.

Foreign keys cannot reference compound keys.

Handling of quoted identifiers is partial and may not work at all.

Beware that unique constraint checks for big data generation may require a lot of memory.

The script works more or less with Python versions 2.6, 2.7, 3.2, 3.3 and 3.4. However Python compatibility between major or even minor versions is not the language's forte. The script tries to remain agnostic with respect to this issue at the price of some hacks. Sometimes it fails. Complain to Python's developers who do not care about script stability, and in some way do not respect other people work. See discussion in PEP387, which in my opinion is not always implemented.

Python is quite paradoxical, as it can achieve forward compatibility (from __future__ import ...) but often fails miserably at backward compatibility. Guido, this is the other way around! The useful feature is to make newer versions handle older scripts.

LICENSE

This is free software, both inexpensive and with sources.

The GNU General Public License v3 applies, see http://www.gnu.org/copyleft/gpl.html for details.

The summary is: you get as much as you paid for, and I am not responsible for anything.

If you are happy with this software, feel free to send me a postcard saying so! See my web page for current address http://www.coelho.net/.

If you have ideas or patches to improve this script, send them to me!

DOWNLOAD

This is datafiller version 2.0.0 (r792 on 2014-03-23).

Latest version and online documentation should be available from http://www.coelho.net/datafiller.html.

Download script: https://www.cri.ensmp.fr/people/coelho/datafiller.

VERSIONS

History of versions:

version 2.0.0 (r792 on 2014-03-23)
version 1.1.5 (r360 on 2013-12-03): Improved user and code documentations. Improved validation. Add --no-filter option for some tests. Add some support for INET, CIDR and MACADDR types.
version 1.1.4 (r326 on 2013-11-30): Improved documentation, in particular add a TUTORIAL section and reorder some other sections. Add some pydoc in the code. Memoize enum regular expression. Raise an error if no generator can be created for an attribute, instead of silently ignoring it. Fix a bug which could generate primary key violations when handling some unique constraints.
version 1.1.3 (r288 on 2013-11-29): Improved documentation, including a new KEYWORDS section. Check consistency of parameters in float generator. Add some partial support for ALTER TABLE. Multiple --debug increases verbosity, especially in the parser stage. Support settings directives out of declarations. Better support of PostgreSQL integer types. Add separate VERSIONS section. One fix for python 3.3...
version 1.1.2 (r268 on 2013-06-30): Improved and simplified code, better comments and validation. Various hacks for python 2 & 3 compatibility. Make validations stop on errors. Check that lenvar is less than length. Fixes for length and lenvar overriding in string generator.
version 1.1.1 (r250 on 2013-06-29): Minor fix to the documentation.
version 1.1.0 (r248 on 2013-06-29): Improved documentation, code and comments. Add --test option for demonstration and checks, including an embedded validation. Add --validate option as a shortcut for script validation. Add seed, skip, mangle and nomangle directives. Add null directive on tables. Add nogen and type directives on attributes. Accept size 0. Change alpha behavior under float sub=scale so that the higher alpha, the more skewed towards 0. Add alternative simpler rate specification for scale and power integer generators. Deduce size when both start and end are specified for the date and timestamp generators. Add tz directive for the timestamp generator. Add float, interval and blob generators. Add some support for user-defined enum types. Add --tries option to control the effort to satisfy UNIQUE constraints. Add support for non integer foreign keys. Remove square and cube integer generators. Change macro definition syntax so as to be more intuitive. Add -V option for short version. Some bug fixes.
version 1.0.0 (r128 on 2013-06-16): Initial distribution.

KEYWORDS

Automatically populate, fill PostgreSQL database with test data. Generate test data with a data generation tool (data generator) for PostgreSQL. Import test data into PostgreSQL.