XML Importer overview (a.k.a. TCP 2.0)

Authors:

Rob van Maris (Finalist IT Group)

 

Erik Visser (Finalist IT Group)

Date:

dec 24th, 2001

Introduction

The goal of the XML importer is to extend MMBase with powerful new XML import facilities, facilitating bulk import of data from different sources (e.g. third parties).

This document gives an overview of the XML Importer that is brought to the open source distribution of MMBase. The XML Importer is largely based on an implementation that is build and tested for the VPRO.

Before the XML Importer project, one way to bulk import data in an efficient way, was by means of XML-defined transactions, using the vocabulary defined by the Temporary Cloud Project (TCP). While these semantics are sufficient to populate empty MMBase tables with new objects, they are very limited in other situations.

What follows:

Import scenario's supported by TCP semantics

There are two TCP scenarios supported by MMBase through TCP semantics. The simplest scenario for bulk data input is adding new object graphs:
- Create a transaction context.
- Create new objects within the transaction context.
- Create new relations between these objects.
- Commit the transaction.
As a result the new objects and relations are added to MMBase. If similar objects were already present in MMBase, this would result in duplicates.

To demonstrate how this translates to TCP-semantics, the following example adds two new objects (type "movie" and "person") with one relation ("director"):

<transactions>
  <create>

    <createobject id ="m1" type="movies">
      <setfield name="title">psycho</setfield>
      <setfield name="year">1960</setfield>
    </createobject>

    <createObject id="p1" type="persons">
      <setField name="firstname">Alfred</setField>
      <setField name="lastname">Hitchcock</setField>
    </createobject>

    <createRelation type="director" source="m1" destination="p1"/>
  </create>
</transactions>

The second and slightly more advanced TCP scenario adds new object graphs, involving existing MMBase objects as well:
- Create a transaction context.
- Create new objects within the transaction context.
- Acces existing MMBase objects, this creates copies within the context.
- Create new relations between the objects.
- Commit the transaction.
This results in both new objects and new relations, involving both new and existing objects.

Disadvantage of the latter scenario is, the MMBase objects involved have to be explicitly identified by their MMBase-id. Because of this, we cannot define such a transaction without prior inspection of existing MMBase objects.

This example demonstrates how an existing MMBase object can be accessed within a transaction, to set its fields to new values:

<accessObject id="p1" mmbaseId="12345">
  <setField name="firstname">Alfred</setField>
  <setField name="lastname">Hitchcock</setField>
</accessObject>

TCP 2.0: a new sophisticated import scenario: Find and Merge

XML Importer introduces a more sophisticated scenario that:
- presents a number of objects and relations that should be present in MMBase without previous knowledge of which actual objects and relations are already present in MMBase.
- uses object (sub) graphs instead of the mmbaseId.

Formally speaking: we present MMBase with the object (sub)graphs that we want to be in MMBase. This approach focuses on the desired result, instead of detailing the steps to be taken. It shifts the burden of detailing all the steps to the side of MMBase.

In order to see how to accomplish this, let's look at an example. We want to add the same objects ("persons" and "movies") and relation ("director") as in the previous examples, but following the proposed scenario, that avoids duplicates by taking into account the objects already present in MMBase.

1 - Create transaction context.
2 - Create the objects within the transaction context (further called "input objects").
3 - Create the relation between these objects.
4 - Look for a "movies" object in MMBase that is similar to the input movies object. If such an object is found, access it within the context (further called "access object"), and have it replace the original input object. The MMBase object then becomes the new destination of the relation.
5 - Look for a "persons" object in MMBase that is similar to the input person. If such an object is found, access it within the context, end have it replace the original input object. The MMBase object then becomes the new source of the relation.
6 - Look for a similar relation in MMBase. If such an object is found, delete the original input relation.
7 - Commit the transaction.

This transaction will result in both objects and the relation to be present in MMBase, regardless of what was present before the transaction, and without duplicates being created. This is the behaviour we are looking for.

Find and Merge scenario details

Note that these steps as presented in the previous paragraph are actually very straight forward, and can easily be formalized to cover the general case of many objects and relations. Also note that the results depend on a notion of object similarity, so let's look into what we really mean by that (having avoided using the word "equality").

Object similarity

The notion of two MMBase objects, candidate or present, being considerd "similar" is not as straigtforward as may seem.

A definition might be to consider them similar if all their fields are equal. But in practice, this definition might be not restrictive enough in some cases, while being too restrictive in others. For instance, we can think of situations where it is only acceptable to consider objects equal if the objects they have relationships with, are also equal (i.e. movies should not only have the same title, but also the same director).

On the other hand, data-entry by hand into textfields may introduce all kind of minor errors, resulting in objects having different fields, while meant to represent the same object. In this case object similarity may translate into "similar enough" - using fuzzy-logic comparison. Clearly, if we want to bulk import data, while at the same time relying heavily on a concept of object similarity, we must be able to specify our own criteria of how to find a similar object.

Merging objects

In our example we replace our input object by the object already present in MMBase. This is not always desirable. For example the input object may contain a person's birthdate, while this field is empty in the MMBase object. In that case we might want the birthdate value to be copied to the existing MMBase object as well. Generalizing this approach we could say that we create a new object based on two objects that are both replaced by it - we will refer to this as "merging" objects. For objects to be merged, we have to specify how the fields and relations of the resulting object are set, based on the original objects/relations. Note that we may choose to drop relations that involved one of the original objects.

Unreferenced objects

Since relations can be dropped when objects are merged, input objects may lose all their relations with other objects. Some types of objects are only of interest because the relations they have with other objects. We have to be able to specify that we don't want these object added to MMBase when they have become "unreferenced", e.g. not related to other objects.

Feedback

Processing of a transaction may fail because for an input object more than one similar object is found or an error occurs XXX the iport is stopped XXX but this will not stop the whole import. The current transaction is cancelled. The Importer continues with the next transaction. All transactions without duplicates or errors are committed to MMBase.

If for an input object more than one similar object is found the following happens. The complete transaction is appended to a report file.
In the next stage duplicates_transactions.XML is processed. The user has to be consulted to decide wich merge result is preferred.

Example. If there was an input object A and two similar objects were found (B and C). The following is presented to the user user on screen (probably a jsp page): the original input object (A) and for every similar object the merge result. Thus (A+B) and (A+C). The user has to select which merge result is preferred. Processing of this corrected transaction can continue in a next processing cycle.

All other kind of errors, e.g. syntax error (XML not according to dtd) or object field not found or object not found. For all these errors the transaction processing is cancelled an entry is written to a report file and the full transaction is written to a file (e.g. error_transactions.XML). XXX the iport is stopped XXX The user can consult the report-file afterward to review the transactions that went wrong. The report-file will contain all information necessary to correct the problems and give these transactions a second try.

Extensions to the TCP-semantics: TCP 2.0

To implement the Find and Merge scenario extension of the TCP-semantics is necessary. These extended semantics allow us to instruct the Transaction Handler to carry out the tasks.

new object operator: mergeObjects (objectType)

For all input objects within the transaction context of the type specified by objectType, perform the following actions:
1 - Look for similar objects in MMBase and elsewhere in the transaction context (see similarObjectFinder).
2 - If a similar object is found, access that object, and merge it with the input object (see objectMerger). This may affect the relations of the objects as well.
3 - For all relations of the merged object, check whether it duplicates a relation already in MMBase. In that case, drop the relation from the transaction context. (Note: relations are considered to be duplicates if both are at least of the same type and reference the same (candidate or existing) MMBase objects, see ObjectMerger).
4 - For all relations that are dropped, check whether this causes an input object to become unreferenced. In that case drop the input object from the transaction context if required by the user.

objectFinder

- class that implements the SimilarObjectFinder interface (see below), this provides methods to search both the persistent cloud and the transaction for similar objects.
- objectFinder parameters - parameters to be passed to the SimilarObjectFinder instance (support for parametrized implementations of similarObjectFinder).

objectMerger

- class that implements the ObjectMerger interface (see below), this provides methods to merge two objects. Also it determines which relations of the merged objects are retained and which are dropped.
- objectMerger parameters - parameters to be passed to the objectMerger instance (support for parametrized implementations of ObjectMerger).

new parameter for createObject: disposeWhenNotReferenced

This sets whether a new object is to be dropped when it becomes unreferenced.

new parameter for transactions: reportFile

Specifies a file to use to report transactions that failed.

SimilarObjectFinder interface

methods:

ObjectMerger interface

methods:

TransactionHandler enhancements (performance and useability)

Further enhancements that will make TCP functionality easier to use than it is now (due to its SCAN- heritage)

Performance

The current TransactionHandler uses a DOM-parser to parse the transactions from XML. TCP2.0 will replace this with a SAX-parser to improve perfomance and to reduce memory-usage, especially when importing large files (e.g. tens of MB).

Useability

The only way to use the current TransactionHandler is by including the transaction code in a SCAN- page. TCP2.0 will extend this with two mechanisms that enable the TransactionHandler to be accessed more directly.

- XML-files: A method will be provided that reads and executes transactions directly from a valid XML-file.

- Programmatically: TCP2.0 will be implemented in a number of classes/methods that mirror the syntax used in XML. These provide a new interface to TCP, giving direct access to its functionality without the need to translate these to XML first.

This example performs the same actions as the first XML-example above:

transaction = Transaction.createTransaction(uti, null, false, 60);
TmpObject m1 = transaction.createObject("m1", "movie", false);
m1.setField("title", "Psycho"); m1.setField("year", "1960");
TmpObject p1 = transaction.createObject("p1", "person", false);
p1.setField("firstname", "Alfred"); p1.setField("lastname","Hitchcock");
transaction.createRelation(null, "director", "m1", "m2");
transaction.commit();

Without going into details of the interfaces, it is easily seen that it matches the XML- syntax very closely. In this way TCP2.0 provides a very easy to use interface for transactions.

TCP 2.0 Syntax

The complete syntax for the XML-compliant TCP2.0 transaction language is presented here. See also the Transactions.dtd. TCP2.0 is an extended version of the TCP. See TCP project for details.

The TCP 2.0 language is quite hierarchical. There is one 'Transactions context' within which can be more 'Transaction contexts', within which can be more 'Object contexts' or 'Object merge contexts'. Within an 'Object context' more fields can be defined. Within an 'Object merge context' more parameters can be defined.

(The names 'Transactions context' and 'Transaction contexts' might lead to some confusion. We are tied to those names because TCP 2.0 has to be backwards compattible with TCP.)

Transactions context

The TCP2.0 code might be embedded in some other code (SCAN.... or some other language). So first tag is to indicate that it is TCP2.0: the transactions tag.

<transactions [ exceptionPage="ex-page-def" ] [ key = "password" ] [ reportFile="report.txt" ]>
</transactions>

Note, all symbols are part of the definition except "[" and "]" which denote optional elements.

The parameter exceptionPage specifies a (s)html page that is shown whenever an error occurs handling the transactions. This can either be a syntax error or it can be an error resulting from an erroneous operation. The key parameter is used to access servers that require a password for transactions. This facility gives extra security.

Transaction contexts

Within the 'Transactions context' one can specify zero or more 'Transaction contexts'. There are two 'Transaction contexts' in which no objects can be manipulated. These contexts only affect the transaction itself.

<commit id="id"/>
<delete id="id">

For object manipulation there are two other 'Transaction contexts'.

<create [ id ="id"] [ commit="true"(default) / "false" ] [ timeOut="number"] >
</create>

<open id="id" [ commit="true"(default) / "false" ] >
</open>

Within these two transaction context types:
- objects can be manipulated (TCP)
- objects can be merged (TCP 2.0)
For details see the next sections.

Object contexts

Within a 'Transaction context' one can specify zero or more 'Object contexts'. Although a relation might logically differ from an object, for MMBase a relation is just another object.

There are six different 'Object context' types. Two of them only affect the object as a whole (no fields can be manipulated).

<deleteObject id="id"\>

<markObjectDelete mmbaseId="mmbaseId" [ deleteRelations="true" / "false"(default) ] \>

For object manipulation (in fact: manipulation of the fields within the object) there are four other object context types.

<accessObject mmbaseId="mmbaseId" [ id ="id" ] >
</accessObject>

<createObject type="MMbase-type" [ id ="id" ] >
</creatObject>

<openObject id ="id" >
</openObject>

<createRelation type="MMbase-type" source="id" destination="id" [ id ="id" ] >
</creatRelation>

Within these four object context types, fields can be set for the object. This has the following syntax.

<setField name="name-of-field" [ url ="url" ] > field-value
</setField>

Object Merge contexts

Within a 'Transaction context' one can specify zero or more 'Object Merge contexts'. In fact this is just another 'Object context'. There is only 1 type for 'Object Merge contexts'.

<mergeObject type="MMbase-type" >
</creatRelation>

Within an Object Merge context you have to specify one ObjectFinder and one ObjectMerger (in this exact order).

<objectFinder class="ObjectFinder class" >
</objectFinder>

<objectMerger class="ObjectMerger class" >
</objectMerger>

Within both ObjectFinder and ObjectMerger parameters can be specified:

<param name="name-of-field" value="value-of-field" \>

TCP 2.0 dtd

<?xml version="1.0" encoding="UTF-8"?> <!-- dec. 1st. 2001 -->

<!ELEMENT transactions (create | open | commit | delete)* >
<!ATTLIST transactions exceptionPage CDATA #IMPLIED>
<!ATTLIST transactions reportFile CDATA #IMPLIED> <!-- TCP2.0 -->
<!ATTLIST transactions key CDATA #IMPLIED>

<!ELEMENT create ((createObject | createRelation | openObject | accessObject | deleteObject | markObjectDelete)*, mergeObject*, mergeObjects*) > <!-- TCP2.0 added mergeObjects* -->
<!ATTLIST create id CDATA #IMPLIED>
<!ATTLIST create commit (true | false) "true">
<!ATTLIST create timeOut CDATA "60">

<!ELEMENT open ((createObject | createRelation | openObject | accessObject | deleteObject | markObjectDelete)*, mergeObject*, mergeObjects*) > <!-- TCP2.0 added mergeObjects* -->
<!ATTLIST open id CDATA #REQUIRED>
<!ATTLIST open commit (true | false) "true">

<!ELEMENT commit EMPTY >
<!ATTLIST commit id CDATA #REQUIRED>

<!ELEMENT delete EMPTY >
<!ATTLIST delete id CDATA #REQUIRED>

<!-- OBJECTS -->
<!ELEMENT createObject (setField*)>
<!ATTLIST createObject id CDATA #IMPLIED>
<!ATTLIST createObject type CDATA #REQUIRED>
<!ATTLIST createObject disposeWhenNotReferenced (true | false) "false"> <!-- TCP2.0 -->

<!ELEMENT createRelation (setField*)>
<!ATTLIST createRelation id CDATA #IMPLIED>
<!ATTLIST createRelation type CDATA #REQUIRED>
<!ATTLIST createRelation source CDATA #REQUIRED>
<!ATTLIST createRelation destination CDATA #REQUIRED>

<!ELEMENT openObject (setField*)>
<!ATTLIST openObject id CDATA #REQUIRED>

<!ELEMENT deleteObject EMPTY >
<!ATTLIST deleteObject id CDATA #REQUIRED>

<!ELEMENT accessObject (setField*)>
<!ATTLIST accessObject mmbaseId CDATA #REQUIRED>
<!ATTLIST accessObject id CDATA  #IMPLIED>

<!ELEMENT markObjectDelete EMPTY >
<!ATTLIST markObjectDelete mmbaseId CDATA #REQUIRED>
<!ATTLIST markObjectDelete deleteRelations (true | false) "false">

<!ELEMENT mergeObject (objectMerger) > <!-- TCP2.0 -->
<!ATTLIST mergeObject from CDATA #REQUIRED>
<!ATTLIST mergeObject to CDATA #REQUIRED>

<!ELEMENT mergeObjects (objectMatcher, objectMerger) > <!-- TCP2.0 -->
<!ATTLIST mergeObjects type CDATA #REQUIRED > <!-- TCP2.0 -->

<!ELEMENT objectMatcher (param*) > <!-- TCP2.0 -->
<!ATTLIST objectMatcher class CDATA "org.mmbase.module.tcp.match.NodeMatcher" > <!-- TCP2.0 -->

<!ELEMENT objectMerger (param*) > <!-- TCP2.0 -->
<!ATTLIST objectMerger class CDATA "org.mmbase.module.tcp" > <!-- TCP2.0 -->

<!-- FIELDS -->
<!ELEMENT setField (#PCDATA) >
<!ATTLIST setField name CDATA #REQUIRED>
<!ATTLIST setField url CDATA #IMPLIED>

<!-- PARAMETERS --> <!-- TCP2.0 -->
<!ELEMENT param EMPTY> <!-- TCP2.0 -->
<!ATTLIST param name CDATA #REQUIRED> <!-- TCP2.0 -->
<!ATTLIST param value CDATA #REQUIRED> <!-- TCP2.0 -->