mmx metadata framework
...the DNA of your data
MMX metadata framework is a lightweight implementation of OMG Metadata Object Facility built on relational database technology. MMX framework
is based on three general concepts:
Metamodel | MMX Metamodel provides a storage mechanism for various knowledge models. The data model underlying the metadata framework is more abstract in nature than metadata models in general. The model consists of only a few abstract entities... see more.
Access layer | Object oriented methods can be exploited using inheritance to derive the whole data access layer from a small set of primitives created in SQL. MMX Metadata Framework provides several diverse methods of data access to fulfill different requirements... see more.
Generic transformation | A large part of relationships between different objects in metadata model are too complex to be described through simple static relations. Instead, universal data transformation concept is put to use enabling definition of transformations, mappings and transitions of any complexity... see more.

XDTL: Handling JSON and XML, Part 1

March 19, 2012 10:39 by mmx

Semi-structured data has been increasingly taking over the role of a universal container for data exchange. JSON, XML and other formats are dominating the area that once used to belong to fixed-format text files and CSV files. However, while CSV file import and export is a no-brainer and a standard part of any data integration tool, JSON/XML could pose a huge problem. As the structure of a JSON/XML document is dynamic and sometimes quite complex, traditional approaches for data transformation used by ETL tools for structured data often fail here.

XDTL takes a different approach. As always, instead of trying to provide means to handle any conceivable semi-structured data of arbitrary complexity with extremely sophisticated toolset XDTL focuses on a handful of most important and most frequent scenarios with simple and lightweight tools. Three different scenarios are considered:

1. The documents are small but with complex structure. The structure is known beforehand and is more or less static. This scenario mostly applies to a feed from a JSON/XML API that are popular in today's web environment and is applicable for creation of a connector (adapter) to a specific web application.

2. XML data is medium/large but with simple hierarchical structure, usually no more than 2-3 levels deep. The structure could be known beforehand or not. This scenario fits for medium size XML files where XML is used in place of more traditional text (CSV) format.

3. Very big document, probably with complex structure that never changes. This scenario would most probably require an XDTL extension (a library task written in XDTL). XDTL provides means for creation of such an extension that could take advantage of a variety of different facilities, eg. XSLT transformations, database platform-specific bulk load utilities and additional scripting, encapsulating and hiding all the complexity from a user. 

To handle the first scenario, xdtl:decode command is introduced. xdtl:decode takes an URI (referencing a data feed) or a document name as the source and decodes it into internal Java structures (Array, HashMap) that then could be easily shredded with for loops and .get('<name>') methods and either converted into INSERT statements or a series of flat files following the structure of data.

Here's a simple code fragment of decoding JSON: 

<xdtl:decode source="https://gdata.youtube.com/feeds/api/standardfeeds/most_viewed?alt=json" target="root" type="JSON"/>
<xdtl:log msg="${root.get('feed').get('title').get('$t')}"/>
<xdtl:log msg="${root.get('feed').get('updated').get('$t')}"/>
<xdtl:for item="entry" rowset="${root.get('feed').get('entry')}">
    <xdtl:log msg="${entry.get('id').get('$t')}"/>
    <xdtl:log msg="${entry.get('title').get('$t')}"/>
</xdtl:for>

That'll do it. One xdtl:decode command and a couple of nested loops will scrape the relevant data fields. The rest is up to the designer. There's no need to create XPath patterns or XSLT transformations. The same example could be elaborated to decode XML data:

<xdtl:decode source="https://gdata.youtube.com/feeds/api/standardfeeds/most_viewed" target="feed" type="XML"/>
<xdtl:log msg="${feed.get('title').get('textvalue')}"/>
<xdtl:log msg="${feed.get('updated')}"/>
<xdtl:for item="entry" rowset="${feed.get('entry')}">
    <xdtl:log msg="${entry.get('id')}"/>
    <xdtl:log msg="${entry.get('title').get('textvalue')}"/>
</xdtl:for>

Internally, XML is represented as Java Array and HashMap objects. XML element values are converted to "textvalue" properties.

Finally, here's a more complete example shredding the same YouTube feed into INSERT statements. Four tables (FEED, LINK, ENTRY and CATEGORY) are being filled with data, complete with parent-child relationships foreign keys. 

<xdtl:decode source="https://gdata.youtube.com/feeds/api/standardfeeds/most_viewed" target="feed" type="XML"/>

<xdtl:query source="INSERT INTO FEED VALUES ('${feed.get('id')}', '${feed.get('updated')}', '${feed.get('title').get('textvalue')}', '${feed.get('author').get('name')}');"/>

<xdtl:for item="link" rowset="${feed.get('link')}">

    <xdtl:query source="INSERT INTO LINK VALUES ('${feed.get('id')}', '${link.get('type')}', '${link.get('href')}');"/> 

</xdtl:for> 

<xdtl:for item="entry" rowset="${feed.get('entry')}"> 

    <xdtl:query source="INSERT INTO ENTRY VALUES ('${feed.get('id')}', '${entry.get('id')}', '${entry.get('published')}', '${entry.get('title').get('textvalue')}');"/> 

    <xdtl:for item="category" rowset="${entry.get('category')}"> 

        <xdtl:query source="INSERT INTO CATEGORY ('${entry.get('id')}', '${category.get('scheme')}', '${category.get('term')}');"/> 

    </xdtl:for> 

</xdtl:for>

Actually there's nothing to stop the script from shredding every single data field out of the XML feed with this method, and it only takes a couple of more xdtl:for loops. And last but definitely not least, the contents of the data could serve as sources for successive calls to other APIs with xdtl:decode within a loop, enabling recursive traversal of several related APIs.



The X Is For eXtensibility

September 11, 2011 19:24 by mmx

XDTL stands for eXtensible Data Transformation Language. Extensibility here means that new language elements can be easily added without having to make changes to XML schema defining the core XDTL language. These extensions can, for example, be coded in XDTL and stored as XDTL packages with task names identifying the extension elements. XDTL Runtime expects to find the extension element libraries in directories listed in extensions.path parameter in xdtlrt.xml configuration file: the pathlist is scanned sequentially until a task with a matching name is found. 

During package execution an extension is provided with a full copy of the current context. An extension gets access to every variable value that is present in the calling context, as well as its attribute values that get converted into variables with names of the attributes. From the extension's point of view all those values are 'read-write', but only those passed as variables retain their values after the extension element finishes. So, considering passing values to an extension, variables can be seen as 'globals' that return values and extension element attributes as 'locals' that get discarded.

XDTL syntax definition (XML schema) includes an any element that allows XDTL language to be extended with elements not specified directly in the schema. any element in XDTL is defined as

<xs:any namespace="##other" processContents="lax"/>

##other means that only elements from namespace other than the namespace of the parent element are allowed. In other words, when parser sees an unknown element it will not complain but assume that it could be defined in some other schema. This prevents ambiguity in XML Schema (Unique Particle Attribution). Setting processControl attribute of an any element to "lax" states that if that 'other' schema cannot be obtained, parser will not generate an error.

 

So how does this work? We assume that our main script is referencing an external XML schema, elements of which are qualified with prefix 'ext':

xmlns:ext="http://xdtl.org/xdtl-ext"

In this external schema a single element, "show" with attribute "text" is defined. Here are some examples of what works and what doesn't.

<ext:show text="sometext"/> works, as the external namespace with element "show" is referenced by the prefix 'ext'.

<show xmlns="http://xdtl.org/xdtl-ext" text="sometext"/> also works, as the namespace reference is 'embedded' in the "show" element.

<show text="sometext"/> does not validate, as the parser looks for element "show" in the current schema (error message Invalid content was found starting with element 'show' is produced).

<ext:show nottext="sometext"/> does not validate either (Attribute 'nottext' is not allowed to appear in element 'ext:show').

<ext:notshow text="sometext"/> validates but still does not work! As the processContents attribute of any element is "lax", although the element is not found the parser ignores this. However, the XDTL Runtime complains as it cannot find element definition in extension pathlist.

 

What if we would want to use extensions without XML schema accompanying it? For that we remove the reference to the external schema from the script header and run the examples once again.

<ext:show text="sometext"/> would not validate any more as the prefix 'ext' is not defined. The same applies to all the other examples with prefix in front of the extension element.

<show text="sometext"/> would not validate either as the parser looks for extension element in the current schema.

<show xmlns="http://xdtl.org/xdtl-ext" text="sometext"/>, however, validates and works! Although the parser cannot find the schema it does not complain due to "lax" processContents attribute. As long as XDTL Runtime is able to find the library package containing the extension in the pathlist everything is fine, otherwise it would give Command 'Extension' failed error.

 

So here's the summary. Extended elements (commands) can be well-defined (having their syntax definitions in form of an XML schema) or undefined (just the package, no XML schema), just as a transformation designer sees fit. In the former case, extended elements will be validated exactly as the core language elements would, in the latter case they will pass without validation. If an undefined and non-validated extension element is executed and does not match its invocation a run-time error would be generated.

 



Moving Data in XDTL: Wooden, Painted Red

August 4, 2011 00:37 by mmx

Transferring data from a place to another is the cornerstone of almost every data transformation job. XDTL has a very flexible and feature-ladden fetch command for that job. fetch element has the following attributes:

source

a query to a source database. source can contain either an implicit query or a variable containing the query. 

target

target table or output text file. Specifying destination connection attribute indicates that target is a table. In case target is a text file, all the common attributes of text files (delimiter, quote, null, encoding) also apply and can optionally be present.

overwrite

overwrite="1" causes an automatic TRUNCATE command against the target table. 

rowset

specifies an XDTL variable (internally a Java array) to store the resultset of the source query. rowset and target are mutually exclusive and should not be present in the same fetch element. 

connection

connection of the source query. 

destination 

connection of the target table. 

 

The most obvious use case for fetch would be transferring data from a database table to a text file. The following xdtl command would fetch the data specified by the query (source attribute) into a CSV (target attribute) file with ; as the delimiter and " as the quote symbol (defaults, they can be overriden):

<xdtl:fetch source="SELECT col, col, ... FROM table WHERE cond" target="textfile" connection="$conn"/> 

In case we need to process the results of the fetch operation further inside the xdtl package, fetch can store the result set in a variable (technically a Java array): 

<xdtl:fetch source="SELECT col, col, ... FROM table WHERE cond" rowset="myrowset" connection="$conn"/>
<xdtl:for item="myrow" rowset="$myrowset"> 
...do something here. columns have to be addressed via Java array syntax: ${myrow[0]}, ${myrow[1]}, etc...
</xdtl:for>

We can go even further and let the read command - normally used to read data from text file(s) into database table(s) - to consider this rowset variable as its input and store data into a table with compliant structure, ie. matching the rowset by position  (type="ROWSET" tells read that input data is coming from a Java array instead of a text file):

<xdtl:fetch source="SELECT * FROM sourcetable WHERE cond" rowset="myrowset" connection="$conn"/>
<xdtl:read source="$myrowset" target="targettable" type="ROWSET" overwrite="1" connection="$dest"/>

Finally, in case we have no need to process the data in the xdtl package, use fetch command to achieve the behavior of a typical DataPump task in most of the graphical ETL tools in just one step (here, the second connection attribute, destination is used to specify target connection):

<xdtl:fetch source="SELECT * FROM sourcetable WHERE cond" target="targettable" connection="$conn" destination="$dest"/> 

Caution! fetch can transfer data between different servers that could come from different vendors. It can do automatic datatype conversions within reasonable limits (*), but all 'weird', non-standard data types (**) need to be taken care of in the source query statement. Source and target data structures are assumed to be identical, this requirement must be handled in source query by presenting data in suitable format and structure for the target table. And, mondo importante: data columns are matched based on position, not column names!

Taking all this into consideration, all the basic data transfer tasks can be handled with ease. For more complex transformations extensions or customized library tasks should be used. Naturally fetch would work best for reasonable data amounts; for large data other approaches (eg. bulkload tools) should be employed.

(*) Datatype conversion is done in JDBC via Java Primitive Data Types.

(**)