Fusion ETL Server Database Mapping

From Metadata Technology Wiki
Jump to navigation Jump to search


Overview

The ETL Server requires a mapping configuration for each Dataflow that requires a link to the local database. The mappings are created manually in JSON format, in the dbmappings directory of the ETL Server Directory.

There are two styles of mapping supported, single table and double table. Single table mapping links a single Dataflow to one Database table for all of the data. Double table mapping links the Dataflow to a Series table containing all the Dimension values, and an Observation table containing the observation values and observation attributes over time.

The mapping expects there to be a column for each Dimension, however it is possible to fix Dimenison values if the value is always the same for every series or observation value. If column names match the Id of the Dimenison or Attribute from the DSD then no further mapping is required. If the names differ then a should be provided of column name to DSD component Identifier.

Mapping files are read on Tomcat server startup, and can be refreshed by re-starting the service.

HealthMapping.png

Mapping File Contents

The mapping file is in JSON format. Each mapping is maintained in its own file, the name of the file does not matter. The files must be created under the dbmappings folder of the ETL Server Directory.

  • ConnectionId is a reference to the Id of the Database Connection the mapping uses.
  • Dataflow provides the reference to the Dataflow that is being mapped.
  • ObsTable provides the name of the database table that the data will be extracted from.
  • SeriesTable is optional, and only required if using Double table Mapping.
{
   "ConnectionId": "MY_SQL",
   "Dataflow": {
       "Agency": "WB",
       "Id": "WDI_EDUCATION",
       "Version": "1.0"
   },
   "ObsTable": "wb_wdi_education_1_0_obs",
   "SeriesTable": "wb_wdi_education_1_0_key"
}


Fixed Value

If there are any values which never change for any series / observation value, then these can be fixed in the mapping configuration file. These are provided as a key/value pair under the FixedValues section.

As these values are fixed, there does not need to be a database column for these Dimensions or Attributes

{
   "ConnectionId": "MY_SQL",
   "Dataflow": {
       "Agency": "WB",
       "Id": "WDI_HEALTH",
       "Version": "1.0"
   },
   "ObsTable": "health_obs_fixed_val",
   "FixedValues": {
       "REF_AREA": "DZA",
       "FREQ": "A",
       "UNIT_MULT": "0"
   }
}

Mappping Time Periods

There are two concepts for time, one is the actual time period formatted in a way which conforms to the SDMX specification for time. This is formatted as a string, and provides information on both time and frequency, based on the way the string is formatted (YYYY represents annual data whereas YYYY-Q1 is quarterly data). This is a required piece of information, and then column is required.

The second concept for time is a datestamp which is used to indicate when the observation was updated. This is used to support delta extractions, which query for all data updated since the last sucessful publication time. A column for the last updated time is not required, but without this information delta publications will not be possible.

The SDMX time column has a default name of TIME_PERIOD as this is the fixed Identifier in a SDMX Data Structure Definition for the Time Dimension.

The Last Updated column has a default name of FR_LAST_UPDATED_OBS and must be mapped if the column name does not match this value.

Single Table Design

The single table design expects there to be a column for each Dimension unless the Dimenison is fixed, this includes the Time Dimension. There must also be a column for the observation value. There does not have to be a column for Attributes as long as the Attributes are optional, as the Fusion Registry will reject any publications that do not contain all the required data.

If the column name matches the Id of the Component in the DSD then no additional mapping is required. If the column name does not match the DSD then the column name can be associated with the DSD Component.

{
   "ConnectionId": "MY_SQL",
   "Dataflow": {...},
   "ObsTable": "health_sdmx_csv_obs_time",
   "ColumnMappings": [
       {
           "Column": "MY_FREQ",
           "ComponentId": "FREQ"
       },
       {
           "Column": "MY_SERIES",
           "ComponentId": "SERIES"
       },
       {
           "Column": "OBS_DATE",
           "ComponentId": "FR_UPDATED_DATE_OBS"
       }
   ]
}

Double Table Design

The double table design expects there to be a column for each Dimension in the Series Table unless the Dimenison is fixed. The Observation table should include the TIME_PERIOD Dimenison and the OBS_VALUE Measure. Any Series Attributes should go in the Series Table (if required) and any Observation Attributes should go in the Observation Table (if required). The same rules for Attributes apply, FUsion ETL does not require that there is a column for an Attribute but Fusion Registry will reject any data publication that does not provide data for Attributes that are defined as Mandatory.

The Series Table must have a primary key which links to the Observation table via its Foreign Key, the default column names for these are ID (primary key in series table) and FR_KEY_ID (foreign key in obs table). If these differ in the database an alternative can be provided through the mapping file. The mapping file can provide alternative names for any Dimension or Attributes Ids as well if the column name does not match the Id of the Component.

The mapping definition does not need to differentiate between Dimensions, Series Attribute, or Observation Attributes in the ColumnMappings section, the ETL server uses the Data Structure Definition to look these details up, and expects Dimensions and Series Attributes to be in the Series Table and Obsevation Attributes in the Observation table.

{
   "ConnectionId": "MY_SQL",
   "Dataflow": {...},
   "ObsTable": "health_sdmx_obs",
   "SeriesTable": "health_sdmx_series",
   "ColumnMappings": [
       {
           "Column": "ID",
           "ComponentId": "MY_ID"
       },
       {
           "Column": "FK",
           "ComponentId": "FR_KEY_ID"
       },
       {
           "Column": "MY_FREQ",
           "ComponentId": "FREQ"
       },
       {
           "Column": "MY_SERIES",
           "ComponentId": "SERIES"
       },
       {
           "Column": "OBS_DATE",
           "ComponentId": "FR_UPDATED_DATE_OBS"
       }
   ]
}