mmx metadata framework
...the DNA of your data
MMX metadata framework is a lightweight implementation of OMG Metadata Object Facility built on relational database technology. MMX framework
is based on three general concepts:
Metamodel | MMX Metamodel provides a storage mechanism for various knowledge models. The data model underlying the metadata framework is more abstract in nature than metadata models in general. The model consists of only a few abstract entities... see more.
Access layer | Object oriented methods can be exploited using inheritance to derive the whole data access layer from a small set of primitives created in SQL. MMX Metadata Framework provides several diverse methods of data access to fulfill different requirements... see more.
Generic transformation | A large part of relationships between different objects in metadata model are too complex to be described through simple static relations. Instead, universal data transformation concept is put to use enabling definition of transformations, mappings and transitions of any complexity... see more.

MMX Wiktionary: A Wiki With An Attitude

November 17, 2009 14:45 by kalle

MMX Wiktionary is the web based collaborative application, top on MMX Metadata Framework, to provide semantic wiki-like user interface for metadata creation and management. Main creative idea, behind the MMX Wiktionary, is structured, metamodel driven, universal metadata repository in combination with wiki user interface. This combination allows users to see and feel complicated metadata structures as conventional pages, without losing required formalization, driven by defined metamodel. Same time, there are no restrictions to use Wiktionary for loosely formalized content creation, like document management, using predefined open schema/metamodel approach, when needed. While it seems easier to start without modeling, we do not see it promising for organizational metadata perspective. Our moderately modeled approach brings guided metadata creation to every end user, in intuitive and simplified form. We do not sacrifice the semantics, which is coded to metamodel, in the journey of simplification and usability creation. Content dependent classification of pages, hierarchy management, named relation and properties extraction and linking within text creation, are some examples of usability and semantics mashup.  

The editor user interface is one of the biggest challenges in our wiki initiative. Trying to avoid wiki's markup mess we use "wysiwyg" editor, for rich content formatting and directed metadata creation. The editor is meant to end users, who are grown up with mark-and-click editing style and do not know or remember, how text creation was "programmed" in WordStar or WordPerfect environment or do not have extensive "writing in Wikipedia" experience. Created text parsed during saving and stored to metadata repository in structured form, that is defined by model. The rich formatting is stored to the text body, using basic html markup, which will be interpreted during reading and writing, by browser and editor. Defined properties and created links will be extracted from text and stored to metadata structures as property values or relations between objects. Addition to saved text, the markup is the connection mechanism between text and stored properties and relation, which is giving the layout and presentation dimension to the captured metadata, same time preserving structure and machine process capability. 

Some keywords and topics in our Wiktionary initiative, which keeps us busy: 

  • wiki style ui and wysiwyg editor
  • usability and semantics, integration of user interface and metadata
  • community driven content tagging for business glossary creation
  • page templates and metadata driven layout
  • history and versioning
  • discussion forum and commentaries
  • import and export

 




XDTL: Template-Based SQL Generation

November 6, 2009 22:21 by marx

SQL is and probably remains the main workforce behind any ETL (and especially ELT flavour of ETL) tool. Automating SQL generation has arguably always been the biggest obstacle in building an ideal ETL tool - ie. completely metadata-driven, with small foot-print, multiple platform support on single code base... and, naturally, capable of generating complex SQL in an easy and flexible manner, with no rocket scientists required nearby. While SQL stands for Structured Query Language, ironically the language itself is not too well 'structured', and the abundance of vendor dialects and extensions does not help either. 

Attempts to build an SQL generator supporting full feature list of SQL language have generally fallen into one of the two camps: one of them trying to create a graphical click-and-pick interface that would encompass the syntax of every single SQL construct, another one designing an even more high-level language or model to describe SQL itself, a kind of meta-SQL. The first approach would usually be limited to simple SQL statements, be appropriate mostly for SELECT statements only and struggle with UPDATEs and INSERTs, and be limited to a single vendor dialect.  

The second approach would drown in the complexity of SQL itself. In theory, one could decompose all of the SQL statements into series of binary expressions, store them away, and (re)assemble into SQL statements again as needed, driven by the syntax of a particular SQL dialect. However, usually this approach fails to produce something usable, mostly because SQL is too loosely defined (considering all those vendors and dialects) and trying to cover everything just results in a system too cumbersome for anyone to use. The result would probably be an order of magnitude more complex to use than just hand-coding SQL statements, even with several parallel code bases. And that's exactly what the developers would do: invent a method to bypass the abstract layer and hand-code SQL directly.

Enter Template-Based SQL Generation. Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (eg. Apache Velocity), each one realizing a separate SQL fragment, a full SQL statement or a complete sequence of commands implementing a complex process. Template engine merges patterns and mappings into executable SQL statements so instead of going as deep as full decomposition we only separate and extract mappings (structure) and template (process) parts of SQL. This limits us to only a set of predefined templates, but anyone can add new or customize the existing ones.

The important thing about this is: templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates. Template engine 'instantiates' mappings and templates to create executable SQL statement 'instances' which brings us closer to OO mindset. The number of tables joined, the number of columns selected, the number of WHERE conditions etc. is arbitrary and is affected by and driven by the contents of the mappings only, ie. well-designed templates are transparent to the level of complexity of the mappings. The same template would produce quite different SQL statements driven by minor changes in mappings.

As an example, a 'basic' template-driven INSERT..SELECT statement might look like this:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
call c
LEFT OUTER JOIN
network n ON n.network_id = c.network_id
WHERE
...

Indicating that three consequtive mappings are actually to be treated as one complex statement with subqueries would change the generated SQL to:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
(SELECT
DISTINCT a.cust_id AS cust_a
,b.cust_id AS cust_b
,c.call_type::integer AS type
,c.call_length::integer AS length
,c.call_date::date AS date
FROM
(SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
FROM
raw_cdr r
...

On the other hand, we might prefer cascading INSERTs through temporary tables for performance reasons, that would morph the SQL into:

SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
INTO TEMP TABLE cdr
FROM
raw_cdr r
WHERE
... 

Selecting Oracle as the target platform would switch the same template over to Oracle syntax producing:

CREATE TABLE cdr AS
SELECT
DISTINCT r.call_type AS call_type
,r.call_length AS call_length
,r.call_date AS call_date
FROM
raw_cdr r
WHERE
... 

To accomplish all (or at least a lot) of this we have (so far) assembled two template 'libraries'. MMX XDTL Basic SQL Library covers a wide range of 'building blocks' for implementing complex data processing command chains: basic INSERT..SELECT, complex INSERT..SELECT with subqueries, cascaded (staged) INSERT..SELECT, UPDATE..FROM etc. MMX XDTL Basic ELT Library includes more complex multi-step patterns used in typical ELT scenarios focusing on single-table synchronisation: Full Replace, Incremental Load, Upsert, History Load etc. These pattern libraries serve the purpose of reference templates and are easily customizable to fit the unique characteristics of a specific use case. 

 



XDTL: An Architectural Perspective

October 11, 2009 21:13 by marx

XDTL stands for eXtensible Data Transformation Language (see the previous post). This is an XML based language for describing data transformations most often utilized in loading data into data warehouses, or building complex data processing tasks consisting of series of data manipulations. XDTL language definition (XML Schema) has its namespace defined here: 

xmlns:xdtl="http://xdtl.org/xdtl"
xsi:schemaLocation="http://xdtl.org/xdtl xdtl.xsd" 

Note: The schema is non-normative and only provided as means to validate XDTL instances and, as such, is naturally subject to change as the language evolves.

(1) Runtime Engine interpreting XDTL scripts. XDTL is just a language used to build the scripts describing data transformations, so it needs an execution mechanism to (pre)process and run those scripts. An XDTL engine (interpreter) assembles the scripts, mappings and templates into a series of executable commands basically consisting of file and database operations and runs them. There can be more than one XDTL runtime, each one designed for its own purpose and implementing a specific subset of the language definition. An XDTL runtime could also be embedded into another system to provide the low-level plumbing for an application that has to accomplish some ELT functions internally. 

(2) Mappings, stored either in MMX Repository or directly inside XDTL script. Mappings' concept is based on the ideas laid out in [1]. Mappings express the structural dependencies and data dependencies between data 'sources' and 'targets' during different stages of a transformation process and "...describe all data movement-related dependencies as a set of Mapping instances. One instance represents a "single-stage" link between a set of source Entity instances and a set of target Entity instances where every entity plays only a single role, either source ot target." [1] There are three basic types of mapping instances: Filter-Mapping (corresponding to SQL's WHERE and HAVING clauses), Join-Mapping (JOINs) and Aggregate-Mapping (GROUP BYs).

Implementation of the mappings concept in XDTL involves a set of four collections: Sources (covering source tables), Target (target table), Columns (column mappings accompanied by IsJoinKey, IsUpdateableColumn etc.) and Conditions (conditions used in JOIN, WHERE and HAVING clauses). Mappings are either imported from MMX Metadata Repository in XML format during execution, included from an external file or explicitly defined in the XDTL script. An arbitrary number of mappings can be cascaded to express transformations of very high complexity. Storing mappings in the Repository opens up endless opportunities for using the same information in various other applications, eg. Impact Analysis or Data Quality tools.

(3) SQL templates turned into executable SQL statements. Being an ELT language, SQL statement represents its single most important functional part. The big question with SQL automation is: how far you want to go with substituting SQL code with something more abstract? In theory, you could decompose all your SQL statements into series of binary expressions, store them away, and assemble into SQL statements again as needed, driven by syntax of one particular SQL dialect. However, usually this approach fails to produce something useful - mostly because SQL is too loosely defined (considering all those vendors and dialects) and trying to cover everything just results in a system too cumbersome for anyone to use. The result would be a sort of 'metaSQL' describing 'real SQL' that is probably an order of magnitude more complex to maintain than hand-coded statements, even with several parallel code bases. And that's exactly what the developers would do: invent a mechanism to bypass the abstract layer and hand-code SQL directly.

(4) Template Engine. Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (Velocity [2] in particular). Template engine merges patterns and mappings into executable SQL statements. So instead of going as deep as full decomposition we only separate mappings (structure) and template (process) parts of SQL. This limits us to only a set of predefined templates but new ones can always be added. The templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates. Template engine 'instantiates' mappings and templates to create executable SQL statement 'instances'.

As an example, provided with proper mappings, this simple template 

...
#macro( insertClause $Cols )
#foreach ($col in $Cols )
 $colTarget#if( $velocityHasNext ),#end
#end
)
#end
 
#macro( selectClause $Cols )
#foreach ( $col in $Cols )
 $colSource#if( $colType )::$colType#end AS $colTarget
#if( $velocityHasNext ) ,#end
#end
#end
 
#macro( fromClause $Srcs $Conds )
#foreach ( $cond in $Conds )
#foreach ( $src in $Srcs )
#if( $condId == $velocityCount )
 $srcName $srcAlias ON $condExpr
#if( $velocityHasNext ) 
JOIN#end
#end
#end
#end
#end
 
#macro( whereClause $Conds )
#foreach ( $cond in $Conds )
#if( $velocityCount > 1 )AND #end
 $condExpr
#end
#end
 
## generic insert-select statement
#set ($tgt = $Target)
INSERT INTO $tgtName (
#insertClause ( $Columns )
SELECT 
#selectClause ( $Columns )
FROM 
#fromClause ( $Sources $Conditions )
WHERE 
#whereClause ( $Conditions )

...

would produce the following SQL statement:

INSERT INTO call (
 cust_a, cust_b, type, length, date)
SELECT
 a.cust_id AS cust_a
 , b.cust_id AS cust_b
 , call_type::integer AS type
 , call_length::integer AS length
 , call_date::date AS date
FROM
 cdr c
JOIN customer a
 ON c.phone_a = a.phone_no
JOIN customer b
 ON c.phone_b = b.phone_no
WHERE
 c.phone_a IS NOT NULL

This pretty simple template can actually produce a lot of quite different (and much more complex) SQL statements all following the same basic pattern (insert select from multiple table join) which is probably one of the most frequent ones in ELT processes. Of course, in an ideal world, a runtime engine would also have zero footprint and zero overhead, support multiple platforms and multiple SQL dialects...

[1] Stöhr, T.; Müller, R.; Rahm, E.: An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments, 1999. 

[2] http://velocity.apache.org/