Relational

The Gretel Trainer Relational sub-package provides interfaces for processing multiple tables from relational databases and data warehouses. You may utilize a Connector to automatically connect to supported databases or provide your own CSVs and define relationships manually.

Please see https://docs.gretel.ai/reference/relational to get started.

The primary interfaces are documented below.

Connectors

This module provides the “Connector” class which allows for reading from and writing to databases and data warehouses. This class can handle metadata and table extraction automatically. When this is done with the “Connector.extract” method, a “RelationalData” instance is provided which you can then use with the “MultiTable” class to process data with Gretel Transforms, Classify, Synthetics, or a combination of both.

class gretel_trainer.relational.connectors.Connector(engine: Engine)

Wraps connections to relational databases and backups.

Parameters:

engine (sqlalchemy.engine.base.Engine) –

A SQLAlchemy engine configured to connect to some database. A variety of helper functions exist to assist with creating engines for some popular databases, but these should not be considered exhaustive. You may need to install additional dialect/adapter packages via pip, such as psycopg2 for connecting to postgres.

For more detail, see the SQLAlchemy docs: https://docs.sqlalchemy.org/en/20/core/engines.html

extract(only: set[str] | None = None, ignore: set[str] | None = None, schema: str | None = None, config: ExtractorConfig | None = None, storage_dir: str = 'relational_source') RelationalData

Extracts table data and relationships from the database. Optional args include:

Parameters:
  • only – Only extract these table names, cannot be used with ignore

  • ignore – Skip extracting these table names, cannot be used with only

  • schema – An optional schema name that is passed through to SQLAlchemy, may only be used with certain dialects.

  • config – An optional extraction config. This config can be used to only include specific tables, ignore specific tables, and configure subsetting. Please see the ExtractorConfig docs for more details.

  • storage_dir – The output directory where extracted data is stored.

classmethod from_conn_str(conn_str: str) Connector

Alternate constructor that creates a Connector instance directly from a connection string.

Parameters:

conn_str – A full connection string for the target database.

Core

This module exposes the “RelationalData” class to users, which allows the processing of relational databases and data warehouses with Gretel.ai.

When using a “Connector” or a “TableExtractor” instance to automatically connect to a database, a “RelationalData” instance will be created for you that contains all of the learned metadata.

If you are processing relational tables manually, with your own CSVs, you will need to create a “RelationalData” instance and populate it yourself.

Please see the specific docs for the “RelationalData” class on how to do this.

class gretel_trainer.relational.core.ForeignKey(table_name: 'str', columns: 'list[str]', parent_table_name: 'str', parent_columns: 'list[str]')
exception gretel_trainer.relational.core.MultiTableException
class gretel_trainer.relational.core.RelationalData(directory: str | Path | None = None, source_data_handler: SourceDataHandler | None = None)

Stores information about multiple tables and their relationships. When using this object you could create it without any arguments and rely on the instance methods for adding tables and key relationships.

Example:

rel_data = RelationalData()
rel_data.add_table(...)
rel_data.add_table(...)
rel_data.add_foreign_key_constraint(...)

See the specific method docstrings for details on each method.

add_foreign_key_constraint(*, table: str, constrained_columns: list[str], referred_table: str, referred_columns: list[str]) None

Add a foreign key relationship between two tables.

Parameters:
  • table – The table name that contains the foreign key.

  • constrained_columns – The column name(s) defining a relationship to the referred_table (the parent table).

  • referred_table – The table name that the foreign key in table refers to (the parent table).

  • referred_columns – The column name(s) in the parent table that the constrained_columns point to.

add_table(*, name: str, primary_key: str | list[str] | None, data: DataFrame | str | Path) None

Add a table. The primary key can be None (if one is not defined on the table), a string column name (most common), or a list of multiple string column names (composite key).

This call MAY result in multiple tables getting “registered,” specifically if the table includes nested JSON data.

property foreign_key_cycles: list[list[str]]

Returns lists of tables that have cyclic foreign key relationships.

get_ancestors(table: str) list[str]

Same as get_parents except recursively keep adding parent tables until there are no more.

get_descendants(table: str) list[str]

Given a table name, recursively return all tables that carry foreign keys that reference the primary key in this table and all subsequent tables that are discovered.

get_modelable_table_names(table: str) list[str]

Returns a list of MODELABLE table names connected to the provided table. If the provided table is the source of invented tables, returns the modelable invented tables created from it. If the provided table is itself modelable, returns that table name back. Otherwise returns an empty list.

get_parents(table: str) list[str]

Given a table name, return the table names that are referred to by the foreign keys in this table.

get_primary_key(table: str) list[str]

Return the list of columns defining the primary key for a table. It may be a single column or multiple columns (composite key).

get_table_columns(table: str) list[str]

Return the column names for a provided table name.

get_table_data(table: str, usecols: list[str] | None = None) DataFrame

Return the table contents for a given table name as a DataFrame.

get_table_row_count(table: str) int

Return the number of rows in the table.

property is_empty: bool

Return a bool to indicate if the RelationalData contains any table information.

list_all_tables(scope: Scope = Scope.MODELABLE) list[str]

Returns a list of table names belonging to the provided Scope. See “Scope” enum documentation for details. By default, returns tables that can be submitted as jobs to Gretel (i.e. that are MODELABLE).

list_tables_parents_before_children() list[str]

Returns a list of all tables with the guarantee that a parent table appears before any of its children. No other guarantees about order are made, e.g. the following (and others) are all valid outputs: [p1, p2, c1, c2] or [p2, c2, p1, c1] or [p2, p1, c1, c2] etc.

remove_foreign_key_constraint(table: str, constrained_columns: list[str]) None

Remove an existing foreign key.

restore(tableset: dict[str, pandas.core.frame.DataFrame]) dict[str, pandas.core.frame.DataFrame]

Restores a given tableset (presumably output from some MultiTable workflow, i.e. transforms or synthetics) to its original shape (specifically, “re-nests” any JSON that had been expanded out.

Users should rely on MultiTable calling this internally when appropriate and not need to do so themselves.

set_primary_key(*, table: str, primary_key: str | list[str] | None) None

(Re)set the primary key on an existing table. If the table does not yet exist in the instance’s collection, add it via add_table.

update_table_data(table: str, data: DataFrame | str | Path) None

Set a DataFrame as the table data for a given table name.

class gretel_trainer.relational.core.Scope(value)

Various non-mutually-exclusive sets of tables known to the system

ALL = 'all'

Every known table (all user-supplied tables, all invented tables)

EVALUATABLE = 'evaluatable'

A subset of MODELABLE that additionally omits invented child tables (but includes invented root tables)

INVENTED = 'invented'

Includes all tables invented from un-modelable user source tables

MODELABLE = 'modelable'

Includes flat source tables and all invented tables, omits source tables that led to invented tables

PUBLIC = 'public'

Includes all user-supplied tables, omits invented tables

class gretel_trainer.relational.core.SourceDataHandler(*args, **kwargs)
put_invented_table_source(name: str, data: DataFrame) str

Ensures invented table data exists in a preferred, accessible internal location. Returns a string handle to the data that can be used with smart_open.

put_source(name: str, data: DataFrame | str | Path) str

Ensures source data exists in a preferred, accessible internal location. Returns a string handle to the data that can be used with smart_open.

resolve_data_location(data: str | Path) str

Returns a string handle that can be used with smart_open to read source data.

class gretel_trainer.relational.core.TableMetadata(primary_key: 'list[str]', source: 'str', columns: 'list[str]', invented_table_metadata: 'Optional[InventedTableMetadata]' = None, producer_metadata: 'Optional[ProducerMetadata]' = None, safe_ancestral_seed_columns: 'Optional[set[str]]' = None)

Multi Table

This module provides the “MultiTable” class to users. This allows you to take extracted data from a database or data warehouse, and process it with Gretel using Transforms, Classify, and Synthetics.

class gretel_trainer.relational.multi_table.MultiTable(relational_data: RelationalData, *, strategy: str = 'independent', project_display_name: str | None = None, project: Project | None = None, refresh_interval: int | None = None, backup: Backup | None = None, output_handler: OutputHandler | None = None, session: ClientConfig | None = None)

Relational data support for the Trainer SDK

Parameters:
  • relational_data (RelationalData) – Core data structure representing the source tables and their relationships.

  • strategy (str, optional) – The strategy to use for synthetics. Supports “independent” (default) and “ancestral”.

  • project_display_name (str, optional) – Display name in the console for a new Gretel project holding models and artifacts. Defaults to “multi-table”. Conflicts with project.

  • project (Project, optional) – Existing project to use for models and artifacts. Conflicts with project_display_name.

  • refresh_interval (int, optional) – Frequency in seconds to poll Gretel Cloud for job statuses. Must be at least 30. Defaults to 60 (1m).

  • backup (Backup, optional) – Should not be supplied manually; instead use the restore classmethod.

generate(record_size_ratio: float = 1.0, preserve_tables: list[str] | None = None, identifier: str | None = None, resume: bool = False) None

Sample synthetic data from trained models. Tables that did not train successfully will be omitted from the output dictionary. Tables listed in preserve_tables may differ from source tables in foreign key columns, to ensure joining to parent tables (which may have been synthesized) continues to work properly.

Parameters:
  • record_size_ratio (float, optional) – Ratio to upsample real world data size with. Defaults to 1.

  • preserve_tables (list[str], optional) – List of tables to skip sampling and leave (mostly) identical to source.

  • identifier (str, optional) – Unique string identifying a specific call to this method. Defaults to synthetics_ + current timestamp.

  • resume (bool, optional) – Set to True when restoring from a backup to complete a previous, interrupted run.

classmethod restore(backup_file: str, session: ClientConfig | None = None) MultiTable

Create a MultiTable instance from a backup file.

retrain_tables(tables: dict[str, Union[pandas.core.frame.DataFrame, str, pathlib.Path]]) None

Provide updated table data and retrain. This method overwrites the table data in the RelationalData instance. It should be used when initial training fails and source data needs to be altered, but progress on other tables can be left as-is.

run_transforms(identifier: str | None = None, in_place: bool = False, data: dict[str, Union[pandas.core.frame.DataFrame, str, pathlib.Path]] | None = None, encode_keys: bool = False) None

Run pre-trained Gretel Transform models on Relational table data:

Parameters:
  • identifier – Unique string identifying a specific call to this method. Defaults to transforms_ + current timestamp

  • in_place – If True, overwrites source data in all locations (internal Python state, local working directory, project artifact archive). Used for transforms->synthetics workflows.

  • data – If supplied, runs only the supplied data through the corresponding transforms models. Otherwise runs source data through all existing transforms models.

  • encode_keys – If set, primary and foreign keys will be replaced with label encoded variants. This can add an additional level of privacy at the cost of referential integrity between transformed and original data.

train_synthetics(*, config: str | Path | dict | None = None, table_specific_configs: dict[str, Union[str, pathlib.Path, dict]] | None = None, only: set[str] | None = None, ignore: set[str] | None = None) None

Train synthetic data models for the tables in the tableset, optionally scoped by either only or ignore.

Extractor

Extract database or data warehouse SQL tables to flat files with optional subsetting.

class gretel_trainer.relational.extractor.ExtractorConfig(target_row_count: float = -1.0, sample_mode: SampleMode = SampleMode.CONTIGUOUS, only: set[str] | None = None, ignore: set[str] | None = None, schema: None = None)

Configuration class for extracting tables from a remote database. An instance of this class should be passed as a param to the “TableExtractor” constructor.

property empty_table: bool

Returns True if the config is set to only extract column names.

property entire_table: bool

Returns True if the config is set to extract entire tables from the remote database.

ignore: set[str] | None = None

Ignore these tables during extraction. Cannot be used with only.

only: set[str] | None = None

Only extract these tables. Cannot be used with ignore.

sample_mode: SampleMode = 'contiguous'

The method to sample records from tables that do not contain any primary keys that are referenced by other tables. We call these “leaf” tables because in a graph representation they do not have any children.

The default mode is to sample contiguously based on how the specific database/data warehouse supports it. This essentially does a ‘SELECT * FROM table LIMIT <N>’ based on the provided target_row_count.

If using “random” sampling, the extractor will attempt to select leaf table rows randomly, however different dialects have different support for this. If the “random” sampling fails, the extractor will fall back to the “contiguous” method.

schema: None = None

Limit scope to a specific schema, this is a pass-through param to SQLAlchemy. It is not supported by all dialects

target_row_count: float = -1.0

The target number of rows (or ratio of rows) to sample. This will be used as the sample target for “leaf” tables, or tables that do not have any references to their primary keys. If this number is >= 1 then that number of rows will be used, if the value is between 0..1 then it is considered to be a percetange of the total number of rows. A 0 value will just extract headers and -1 will extract entire tables.

The default value, -1, implies that full tables should be extracted.

class gretel_trainer.relational.extractor.SampleMode(value)

An enumeration.

exception gretel_trainer.relational.extractor.TableExtractorError
class gretel_trainer.relational.extractor.TableMetadata(original_row_count: int, sampled_row_count: int, column_count: int)

Contains information about an extracted table.