mmx metadata framework
...the DNA of your data
MMX metadata framework is a lightweight implementation of OMG Metadata Object Facility built on relational database technology. MMX framework
is based on three general concepts:
Metamodel | MMX Metamodel provides a storage mechanism for various knowledge models. The data model underlying the metadata framework is more abstract in nature than metadata models in general. The model consists of only a few abstract entities... see more.
Access layer | Object oriented methods can be exploited using inheritance to derive the whole data access layer from a small set of primitives created in SQL. MMX Metadata Framework provides several diverse methods of data access to fulfill different requirements... see more.
Generic transformation | A large part of relationships between different objects in metadata model are too complex to be described through simple static relations. Instead, universal data transformation concept is put to use enabling definition of transformations, mappings and transitions of any complexity... see more.

XDTL: An Architectural Perspective

October 11, 2009 21:13 by marx

XDTL stands for eXtensible Data Transformation Language (see the previous post). This is an XML based language for describing data transformations most often utilized in loading data into data warehouses, or building complex data processing tasks consisting of series of data manipulations. XDTL language definition (XML Schema) has its namespace defined here: 

xmlns:xdtl="http://xdtl.org/xdtl"
xsi:schemaLocation="http://xdtl.org/xdtl xdtl.xsd" 

Note: The schema is non-normative and only provided as means to validate XDTL instances and, as such, is naturally subject to change as the language evolves.

(1) Runtime Engine interpreting XDTL scripts. XDTL is just a language used to build the scripts describing data transformations, so it needs an execution mechanism to (pre)process and run those scripts. An XDTL engine (interpreter) assembles the scripts, mappings and templates into a series of executable commands basically consisting of file and database operations and runs them. There can be more than one XDTL runtime, each one designed for its own purpose and implementing a specific subset of the language definition. An XDTL runtime could also be embedded into another system to provide the low-level plumbing for an application that has to accomplish some ELT functions internally. 

(2) Mappings, stored either in MMX Repository or directly inside XDTL script. Mappings' concept is based on the ideas laid out in [1]. Mappings express the structural dependencies and data dependencies between data 'sources' and 'targets' during different stages of a transformation process and "...describe all data movement-related dependencies as a set of Mapping instances. One instance represents a "single-stage" link between a set of source Entity instances and a set of target Entity instances where every entity plays only a single role, either source ot target." [1] There are three basic types of mapping instances: Filter-Mapping (corresponding to SQL's WHERE and HAVING clauses), Join-Mapping (JOINs) and Aggregate-Mapping (GROUP BYs).

Implementation of the mappings concept in XDTL involves a set of four collections: Sources (covering source tables), Target (target table), Columns (column mappings accompanied by IsJoinKey, IsUpdateableColumn etc.) and Conditions (conditions used in JOIN, WHERE and HAVING clauses). Mappings are either imported from MMX Metadata Repository in XML format during execution, included from an external file or explicitly defined in the XDTL script. An arbitrary number of mappings can be cascaded to express transformations of very high complexity. Storing mappings in the Repository opens up endless opportunities for using the same information in various other applications, eg. Impact Analysis or Data Quality tools.

(3) SQL templates turned into executable SQL statements. Being an ELT language, SQL statement represents its single most important functional part. The big question with SQL automation is: how far you want to go with substituting SQL code with something more abstract? In theory, you could decompose all your SQL statements into series of binary expressions, store them away, and assemble into SQL statements again as needed, driven by syntax of one particular SQL dialect. However, usually this approach fails to produce something useful - mostly because SQL is too loosely defined (considering all those vendors and dialects) and trying to cover everything just results in a system too cumbersome for anyone to use. The result would be a sort of 'metaSQL' describing 'real SQL' that is probably an order of magnitude more complex to maintain than hand-coded statements, even with several parallel code bases. And that's exactly what the developers would do: invent a mechanism to bypass the abstract layer and hand-code SQL directly.

(4) Template Engine. Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (Velocity [2] in particular). Template engine merges patterns and mappings into executable SQL statements. So instead of going as deep as full decomposition we only separate mappings (structure) and template (process) parts of SQL. This limits us to only a set of predefined templates but new ones can always be added. The templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates. Template engine 'instantiates' mappings and templates to create executable SQL statement 'instances'.

As an example, provided with proper mappings, this simple template 

...
#macro( insertClause $Cols )
#foreach ($col in $Cols )
 $colTarget#if( $velocityHasNext ),#end
#end
)
#end
 
#macro( selectClause $Cols )
#foreach ( $col in $Cols )
 $colSource#if( $colType )::$colType#end AS $colTarget
#if( $velocityHasNext ) ,#end
#end
#end
 
#macro( fromClause $Srcs $Conds )
#foreach ( $cond in $Conds )
#foreach ( $src in $Srcs )
#if( $condId == $velocityCount )
 $srcName $srcAlias ON $condExpr
#if( $velocityHasNext ) 
JOIN#end
#end
#end
#end
#end
 
#macro( whereClause $Conds )
#foreach ( $cond in $Conds )
#if( $velocityCount > 1 )AND #end
 $condExpr
#end
#end
 
## generic insert-select statement
#set ($tgt = $Target)
INSERT INTO $tgtName (
#insertClause ( $Columns )
SELECT 
#selectClause ( $Columns )
FROM 
#fromClause ( $Sources $Conditions )
WHERE 
#whereClause ( $Conditions )

...

would produce the following SQL statement:

INSERT INTO call (
 cust_a, cust_b, type, length, date)
SELECT
 a.cust_id AS cust_a
 , b.cust_id AS cust_b
 , call_type::integer AS type
 , call_length::integer AS length
 , call_date::date AS date
FROM
 cdr c
JOIN customer a
 ON c.phone_a = a.phone_no
JOIN customer b
 ON c.phone_b = b.phone_no
WHERE
 c.phone_a IS NOT NULL

This pretty simple template can actually produce a lot of quite different (and much more complex) SQL statements all following the same basic pattern (insert select from multiple table join) which is probably one of the most frequent ones in ELT processes. Of course, in an ideal world, a runtime engine would also have zero footprint and zero overhead, support multiple platforms and multiple SQL dialects...

[1] Stöhr, T.; Müller, R.; Rahm, E.: An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments, 1999. 

[2] http://velocity.apache.org/ 

 



XDTL (eXtensible Data Transformation Language)

April 12, 2009 21:30 by marx

Traditional ETL (Extract Transform Load) tools broadly used in Data Warehouse environments tend to have two common deficiencies:

- emphasis on graphical user interface (and lack of a more efficient code interface) makes the design process slow and inflexible;

- dedicated ETL server generally means one extra hop for the data being transferred, which might be unacceptable considering today's data loads.

Enter XDTL (eXtensible Data Transformation Language). XDTL is an XML based descriptional language designed for specifying data transformations from one database/storage to another. XDTL syntax is defined in an XML Schema document (http://xdtl.org/xdtl). XML Schema of XDTL has semantic annotations linking it to XDTL ontology model.

XDTL documents are interpreted by an XDTL Runtime Engine. XDTL/XDTL Runtime Engine is built not from the perspective of a slick IDE or a cool engine, but an efficient language for describing the data transformations. The goal is to produce a lightweight ETL development/runtime environment that would handle most of the common requirements with better efficiency than traditional jack-of-all-trades tools. XDTL Runtime Engine is currently under development for both .NET and Linux environments. XDTL language is free to use for anyone.

XDTL documents are stored in well-formed XML files that can be validated with XDTL XML Schema. Package is the primary unit of execution that can be addressed by the runtime engine. A single XDTL document can contain several packages. Every package has a unique identifier in the form of a URI. There are also special non-executable packages (libraries) that serve as containers of tasks callable by other packages. A package contains an arbitrary number of Variables, an unordered collection of Connections and an ordered collection of Tasks. 

Variables are name-value pairs mostly used for parameterization and are accessible by transformations. Connections are used to define data sources and targets used in transformations. Connections can refer to database resources (tables, views, result sets), text files in various formats (CSV, fixed format, Excel files) or Internet resources in tabular format.

Tasks are the smallest units of execution that have a unique identifier. A task is an ordered collection of Transformations that move, create or change data. There is at least one transformation in a task. Tasks can be parameterized in case one or several of it's transformations have parameters; in that case all the parameters should have default values defined as package variables.

What sets XDTL apart from traditional ETL tools? 
 
- while ETL tools in general focus on the graphical IDE and entry-level user, the needs of a professional user are not addressed as he/she has to struggle with inefficient workflow. XDTL relies on XML as development vehicle making it easy to generate the data transformation documents automatically or with XML tools of choice.

- as data amounts grow, the paradigm shifts from ETL to ELT, where bulk of the transformations take place inside the (target) database. Therefore most of the fancy features provided by heavyweight ETL tools are rarely or never used and the main workforce is SQL. However, there is very little to boost the productivity of SQL generation and reuse. XDTL addresses this with metadata-based mappings and transformations served from metadata repository, and by the use of transformation templates instead of SQL generation, capturing the typical scenarios in task libraries for easy reuse.

-  most of the heavyweight tools try to address every single conceivable problem which turns solving the trivial tasks obscure and too complex. They also aim to provide support for every single database product even if the chances to ever encounter most of them are almost zero. XDTL focuses on the most frequent scenarios and mainstream brands and put the emphasize on productivity and efficiency. 

- XDTL takes advantage of the general-purpose metadata repository of MMX Framework targeting a broad range of metadata-related activities and not locking the user into an ETL-specific and ETL-only repository from <insert your ETL tool vendor>.
 

 



Generic Transformations

September 29, 2008 17:12 by marx

Applications might require more complex relationships between abstract objects than would be possible with simple static relationships (eg. detailed information about conversion from one data element to another). These requirements are fulfilled by implementing a concept of 'generic transformation' enabling definition of transformations, mappings and transitions of any complexity, exploiting recursion to flatten complex transformations. A limited set of transformation types ('templates') is defined on user-interface level to simplify creation and management of these transformations.

Transformations are decomposed and stored as hierarchies of binary operations based on basic operators (arithmetic operators, predicates, functions), compound operators (templates built from basic operators) and extended relational algebra. Such decomposition enables storage of transformations of arbitrary complexity, making it possible to capture everything from arithmetic operations and simple mappings to complex statistical methods and lengthy ETL procedures in SQL.

Having semantic information (ontologies), business rules and technical metadata (data structures) together with transformation metadata and relationships linking them all together makes it possible to achieve functionality that would not be possible with separated metadata subsystems, like metadata driven processing environments, full impact analysis, data-storage-aware business dictionaries etc.