SQL is and probably remains the main workforce behind any ETL (and especially ELT flavour of ETL) tool. Automating SQL generation has arguably always been the biggest obstacle in building an ideal ETL tool - ie. completely metadata-driven, with small foot-print, multiple platform support on single code base... and, naturally, capable of generating complex SQL in an easy and flexible manner, with no rocket scientists required nearby. While SQL stands for Structured Query Language, ironically the language itself is not too well 'structured', and the abundance of vendor dialects and extensions does not help either. 

Attempts to build an SQL generator supporting full feature list of SQL language have generally fallen into one of the two camps: one of them trying to create a graphical click-and-pick interface that would encompass the syntax of every single SQL construct, another one designing an even more high-level language or model to describe SQL itself, a kind of meta-SQL. The first approach would usually be limited to simple SQL statements, be appropriate mostly for SELECT statements only and struggle with UPDATEs and INSERTs, and be limited to a single vendor dialect.  

The second approach would drown in the complexity of SQL itself. In theory, one could decompose all of the SQL statements into series of binary expressions, store them away, and (re)assemble into SQL statements again as needed, driven by the syntax of a particular SQL dialect. However, usually this approach fails to produce something usable, mostly because SQL is too loosely defined (considering all those vendors and dialects) and trying to cover everything just results in a system too cumbersome for anyone to use. The result would probably be an order of magnitude more complex to use than just hand-coding SQL statements, even with several parallel code bases. And that's exactly what the developers would do: invent a method to bypass the abstract layer and hand-code SQL directly.

Enter Template-Based SQL Generation. Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (eg. Apache Velocity), each one realizing a separate SQL fragment, a full SQL statement or a complete sequence of commands implementing a complex process. Template engine merges patterns and mappings into executable SQL statements so instead of going as deep as full decomposition we only separate and extract mappings (structure) and template (process) parts of SQL. This limits us to only a set of predefined templates, but anyone can add new or customize the existing ones.

The important thing about this is: templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates. Template engine 'instantiates' mappings and templates to create executable SQL statement 'instances' which brings us closer to OO mindset. The number of tables joined, the number of columns selected, the number of WHERE conditions etc. is arbitrary and is affected by and driven by the contents of the mappings only, ie. well-designed templates are transparent to the level of complexity of the mappings. The same template would produce quite different SQL statements driven by minor changes in mappings.

As an example, a 'basic' template-driven INSERT..SELECT statement might look like this:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
call c
LEFT OUTER JOIN
network n ON n.network_id = c.network_id
WHERE
...

Indicating that three consequtive mappings are actually to be treated as one complex statement with subqueries would change the generated SQL to:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
(SELECT
DISTINCT a.cust_id AS cust_a
,b.cust_id AS cust_b
,c.call_type::integer AS type
,c.call_length::integer AS length
,c.call_date::date AS date
FROM
(SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
FROM
raw_cdr r
...

On the other hand, we might prefer cascading INSERTs through temporary tables for performance reasons, that would morph the SQL into:

SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
INTO TEMP TABLE cdr
FROM
raw_cdr r
WHERE
... 

Selecting Oracle as the target platform would switch the same template over to Oracle syntax producing:

CREATE TABLE cdr AS
SELECT
DISTINCT r.call_type AS call_type
,r.call_length AS call_length
,r.call_date AS call_date
FROM
raw_cdr r
WHERE
... 

To accomplish all (or at least a lot) of this we have (so far) assembled two template 'libraries'. MMX XDTL Basic SQL Library covers a wide range of 'building blocks' for implementing complex data processing command chains: basic INSERT..SELECT, complex INSERT..SELECT with subqueries, cascaded (staged) INSERT..SELECT, UPDATE..FROM etc. MMX XDTL Basic ELT Library includes more complex multi-step patterns used in typical ELT scenarios focusing on single-table synchronisation: Full Replace, Incremental Load, Upsert, History Load etc. These pattern libraries serve the purpose of reference templates and are easily customizable to fit the unique characteristics of a specific use case.