Data Reporting Template

From Fusion Registry Wiki
Revision as of 06:50, 2 March 2020 by Mnelson (talk | contribs) (Floating Point Errors)
Jump to navigation Jump to search

Overview

A Data Reporting Template is a metadata driven Excel Form to enable Data Providers to easily report data. The form is Metadata Driven as the definition of each Worksheet takes into account:

  • Dataset (Dataflow)
  • Which Dimensions are placed in the Rows
  • Which Dimensions are placed in the Columns
  • Whether to output the headings and values as Ids or Names (e.g FREQ/A or Frequency/Annual)
  • Whether to exclude fixed Dimension values, or place them in the Header
  • How to report Attribute values such as Observation Confidentiality, Status or Pre-Break values
  • Constraints are used to determine the universe of data, each form can contain a diffent universe of data depending on which Data Provider the form is generated for
  • Mathmatical Validation Rules can be used to inject formual and checking tables into the Worksheet

A Reporting Template definition is created in the Fusion Registry and describes the contents and layout of one Excel Workbook. Each workbook contains at least one worksheet which is backed by the Dataflow that the data collection is for. A workbook may contian multiple worksheets, enabling one excel file to be used to submit data for multiple datasets. The Fusion Registry provides both a web User Interface and web services to generate the Excel Workbook for a Reporting Template for a specific Data Provider. The contents of the Excel workbook are dynamically obtained based on who the Data Provider is. This enables a single data collection template to be set up for multiple data providers, where each provider's workbook contains only the worksheets, and data cells they need to report data for.

As of Fusion Registry 10.1.0 it is also possible to use a Reporting Template as a way to describe a data output format. This enables a user to query for data from the Fusion Registry web service, and the Fusion Registry will construct the output excel file according to the layout of the reporting template.

Report template

Defining a Data Reporting Template

Overview

A Reporting Template has the same high level properties as every other structure type in the Fusion Registry – this includes an owning Agency, who is responsible for maintaining the template, an Id and a Version which help uniquely identify the template. The Report Template Name is multilingual and is displayed on the main worksheet on the generated workbook.

Rt main.png

Excel Worksheets

The Excel workbook is generated with a main worksheet, which contains the Report Template Name, the Data Provider details that the workbook was generated for, and if the report template was generated for a single period, there is an editable cell for the reporting organisation to fill in the period. This enables the reporting organisation to reuse the same workbook for different periods, by simply changing the period on the main worksheet.

Rt org period.png

All other worksheets in the workbook are linked to a Dataflow (one Dataflow per worksheet).

Rt worksheets.png

Floating Point Errors

It is important to note early on that Excel can suffer from number precision errors. For a Reporting Template the following example nicely shows the problem:

Rt floating point.png

Each cell is calculated from the first cell, moving across it's simply +0.1 and down is +1. This results in floating point issues, in Java if you set up a simple loop like the following:

double d = 1;
while(d < 2) {
 d += 0.1;
 System.out.println(d);
}

Then the output quickly shows this problem:

1.1
1.2000000000000002
1.3000000000000003
1.4000000000000004

The same is true in Excel, as you can see here, in the underlying XML for the worksheet, excel stores the information with these errors:

Rt excel xml.png

When reading the Excel file back in, the system reads the numbers from the underlying XML, and this may not be the same as what the user sees in the Excel worksheet. Floating point errors of this type only occur as a result of a calculated value. The Fusion Registry resolves this issue by allowing the maximum number of decimal places to be specified per worksheet definition. This way the Fusion Registry will round the numbers to the level of precision required.

Defining the table layout

A table layout is constructed by choosing which Dimensions of your data structure will appear in the table rows, and columns. The order of the Dimensions can also be defined. The choices here will impact how the resulting table looks.

Rt table layout.png

It is possible to display only the Ids of the Dimensions / Codes, for example outputting UK as oppose to United Kingdom. Codes will always be output in the order of the Codes in the underlying Codelist, unless a Reporting Hierarchy is used for the dimension. It is important when designing a reporting template to ensure that the total table size does not grow too large. The table size can be calculated by multiplying all the possible values for each Dimension. For example a table with the following Dimensions:

FREQUENCY: A, M
REF_AREA: UK, FR, DE
AGE_RANGE: 0, 1, 2, 3, 4
SEX: M, F

Would result in a table with 60 cells (2 * 3 * 5 * 2). Data Structure definitions are typically much larger then this, so the possible table size can run into extremely high numbers.

There are two ways to keep table size down.

  1. Use content constraints to restrict dimension values based on the dataset and/or data provider. For example the Reference Area dimension may contain over 200 possible countries, but if the data provider is UK the country can be restricted to UK.
  2. Set a Dimension as a Variable (not a header or a row). A use case for a variable dimension would be for a collection which include a reporting country and counterpart country, and where the possible counterpart countries are unknown for each reporting country. By setting counterpart country to a variable a table can be built with counterpart country as a drop down on either the main worksheet or on the worksheet itself
Rt varaible main.png

Fixed Values

When an Excel file is generated for a specific Data Provider, some Dimensions may result in having fixed values. For example if the reporting country Dimension is restricted to one country for a data provider. In this case the fixed value will be in the header section of the worksheet. However, it is possible to configure the worksheet to exclude any number of Dimensions from the worksheet if they have a fixed value. The value will still be read back in when the Fusion Registry imports/converts the file, but it won’t appear on the worksheet.

Hierarchies

It is possible to use the Hierarchy set in a Codelist, or link a Dimension to a Hierarchical Codelist if one exists for the Dimension’s Codelist.

If the hierarchy is for a header dimension, and the hierarchy includes codes which are not part of the reporting universe but has child codes that are, then the code will be used as a means to group the child codes. Rt hierarchy header.png

If the hierarchy is for a row dimension, then the labels will appear in the same order as the hierarchy, and will include a small indentation

Rt hierarchy row.png

A Hierarchy can also be used for data validation, to ensure the reported members of the hierarchy add up to the parent’s reported value. This can be achieved by creating a Validation Rule in a Validation Scheme.

Attribute Placement

Observation Attributes can also be reported alongside the reported data. There are a number of options for how attributes are reported, and different options may be selected for different attribute. The options are


Observation Attribute Placement Options
Placement Option Description
Exclude from Report The attribute is not output in the worksheet, and attribute value will not be read in by the Fusion Registry when processing the worksheet
Fix Reported Value The attribute is not output in the worksheet. The attribute value will be fixed to a pre-defined value (defined in the Report Template definition) when read in by the Fusion Registry.
Include in Excel Table The attribute will be output in the excel table, so for each observation cell, there will be an additional cell for the user to report the attribute value
Include in separate table The attribute values are reported in the same worksheet as the observation table’s sheet. There is a duplicate of the observation table built under the observation table, enabling data to be reported for all the attributes set to this placement value.
Include in separate worksheet The attribute will be reported in a new worksheet. The table structure will be the same as the observation table.
Conditional on reported values The attribute value is not reported by the user, but generated based on the reported values. For example, Observation Status could be set to missing if there is no observation reported. Another example is setting the status to Break Value if the observation pre-break attribute has a reported value against it for the observation.
Use colour to represent value This option is only available for one Observation Attribute, and the attribute must be a coded attribute (i.e allowed values are defined by a Codelist). It is possible to define one or more codes from the Codelist to a colour. For example, the Observation Confidentiality attribute could have Confidential set to RED, Not for Publication set to Yellow, and Free set to Green. This option will build a legend of Attribute to Colour in the header section of the worksheet. The data reporter can then use Excel’s style formatter to copy and paste the colour onto the cells. When using this option it is also possible to set both a general default colour to use, and series specific colours. For example the default ‘Free’ value can be set on all cells, except those for a specific currency/country or combination of, where a different confidentiality can be set.
Rt obs colour.png     Rt obs attr sametable.png
Showing Observation attribute Values reported using colour     Showing attributes in the observation table

Defining the Table Contents

The Table's Contents, or the 'Universe of Data' depends on three aspects

  1. The enumerated list of values defined by the Codelist for each Dimensions (excluding the Frequency and Time Dimension).
  2. The restrictions applied to these lists, based on the Dataflow or Data Provider (Content Constraint)
  3. The Dimensions that are set to be Variable in the Report Templates definition.

The first aspect is based on the Data Structure Definition used by the Dataflow. The full universe can be calculated by multiplying the length of each Codelist for each Dimension defined by the Data Structure Definition. This Universe can be quite large, and generally needs further restictions by applying the 2nd aspect, the Content Constraints. Note the Frequency Dimenison is excluded from this calculation because when the Excel workbook is generated, it is generated for a specific Time Period or Range. The Time Period is defined using SDMX Date formatting, for example a template generated for 2002-2004 would fix the Frequency for the excel workbook to Annual.

The second aspect helps to tailor a worksheet to a specific collection or Data Provider. Take for example a Data Structure with 4 Dimensions with the given number of possible values; Frequency (8), Domain(20), Reference Area(240), and Indicator (65k). The total reporting universe if 20 * 240 * 65k is 312million cells (we do not include Frequency in the calculation). The Data Structure may be re-used across multiple Dataflows; Central Government Debt, Trade, Balance of Payments, etc. Each Dataflow only collects data for a subset of Indicators, and fixes the Domain. In this instance there would be a Content Constraint defined to restrict Domain to a single value for the Trade Dataflow, and restrict the Indicators to only the ones which are collected for Trade say 30. The universe of data is now only 1 * 240 * 20 or 4,800 observation cells. The next layer of restrictions can be applied to either the Provision Agreement, or Data Provider. When the collection is from UK organisations, the Reference Area Dimension only allows the UK to be reported. Now the excel file will only contain 20 cells for the United Kingdom to complete for the Trade collection.

Content Constraints can get quite complicated, for example partial series can be defined for both inclusion or exclusion, as such it is possible for an excel worksheet to contain cells which should not have observations reported against. In this instance the cell will be hashed out, and non-editable.

Rt protected cells.png

Mathematical Validation Rules

An excel workbook can contain formula in certain cells to pre-calculate values. A workbook may also contain a checking table to ensure reported values are consistent with each other, or meet other criteria such as lying within an expected range.

Calculations in a workbook come from two sources of information:

  1. Restrictions imposed on the Primary Measure at the level of the Data Structure Definition
  2. Validation rules defined in a Validation Scheme

Restrictions on the Data Structure Definition

The Data Structure Definition defines the expected text format for the Primary Measure – or in simpler terms, it defines the allowed content for the reported Observation values. By default, the allowed content is any text with no restrictions imposed, meaning any reported value is valid. It is possible to restrict this value to a more tightly defined data type, such as an Integer, Double, or even non-numerical types such as Boolean or Alpha Numeric. The excel workbook does not validate text formatting, however the workbook can validate the following restrictions on numerical formats:

  1. Max allowed value
  2. Min allowed value

If a reported value is outside of the min or max value, the cell will be identified in the error check table and check summary worksheet. If the Min or Max value is dependent on a particular Dimension (e.g max value changes for each Country) then Validation Rules can be used instead. If both a global Min is required, with Dimension specific use cases, then both the Data Structure and Validation Rules can be used, and both rules will be applied.

Validation Scheme Rules

Any calculations defined by Validation Schemes which are applicable to a worksheet can be included in the excel worksheet. Validation Rules fall into two categories:

  1. Those that result in a single value (equality rules), for example EUR=FR+DE
  2. Those that result in a range of possible values, for example EUR > FR + DE

For the equality rules, the equality formula will be built into the relevant cell(s) of the worksheet.

Rt cell formula.png

Equality rules can also be built by linking a validation rule to a Code Hierarchy, where the Hierarchy is used to ensure the totals equal the sum of their parts. When a validation expression is not using the equality operator, the formula will not be built into the observation table as it is not possible to calculate an absolute result, only a range of values. However, both equality checks and range checks are built into the checking table which is placed under the observation table.

Checking Table

[img]

A checking table is an optional table which can be included on a worksheet (the inclusion is defined by the Agency who creates the Reporting Template).

A checking table reports errors in the observation values of the excel workbook based on both the Data Structure’s min/max value of the Primary Measure, and any validation rules relevant to the data in the worksheet.

For equality errors, the checking table will report the difference between the expected value, and the reported value. For example, if EUR=DE+FR and the user reports 10 for DE and 10 for FR then the expected value for EUR is 20.

If the user reports 25 for EUR, the checking table will report 5 for the cell EUR, as the reported value is 5 more than the expected value. For range errors, the checking table will report the Identifier of the rule that has been broken. This Identifier is set in the Validation Rule of the Validation Scheme.

When a checking table is included in a worksheet, a checking summary worksheet is also included in the workbook. The checking summary provides a breakdown of validation rules that have passed or failed per worksheet. For range checks, it also provides the formula that was used for each rule.

Formula Inclusion / Exclusion Rules

A Validation Rule is built up of an output, and one or more inputs. For example:
EUR = FR + DE

In the simple case where all the operators a plus (+), the output (EUR) must exist in the worksheet for the rule to be included, so too must at least one of the inputs (FR or DE). If FR or DE is not present in the worksheet, zero will be assumed.

If the rule contains operators other than the plus (+) operator, the output and all the inputs must be present in the worksheet for the rule to be included. For example:
EUR = FR * DE

EUR, FR, and DE must be present in the worksheet for the rule to be included. If the ‘default to zero’ logic were assumed, this would result in any reported value, other than zero, for EUR to fail the rule.

Workbook Locale

Every label in the Excel workbook is derived from the Metadata in the Fusion Registry, this includes concept name, code names, and the report template name. As Fusion Registry supports multilingual labels for all metadata, the labels will be written in the language specified when the report is built. If built from the web user interface, this will default to the locale of the web browser, or the language chosen from the Fusion Registry UI. If built from the web service, the locale can be passed as a parameter, or the standard HTTP Accept-Language header can be used. When the Excel workbook is opened, the Excel locale configurations will be used as set by the user. When the workbook is read back in by Fusion Registry, numerical values will be resolve in as follows: 1 . numbers with exponents e.g 1.2E2 will be resolved as 1200 2 . numbers with decimal separators will always be converted to use the period separator ‘.’ as the decimal point.

Creating an Excel Workbook

An Excel Workbook is created from a Report Template definition. In order to create a workbook two pieces of information are required by the Fusion Registry: 1 . Who is the Data Provider? 2 . What is the reporting period?

Data Provider for Workbook

In order to set up a Data Provider for a Workbook, a Provision Agreement must be created, linking the Data Provider to a Dataflow. If a workbook contains multiple worksheets, only the worksheets that the Data Provider has a Provision Agreement for will be output in the workbook. If a Data Provider does not have any Provision Agreements for any of the worksheets, the workbook will not be built.

Reporting Period

A workbook can either be built for a single reporting period, for example 2008, or a range for example 2008 to 2012. If the workbook is created for a single period, the main worksheet will include the period in one of the worksheet cells. The cell will be editable, allowing the data reporter to change the value if they wish. If the workbook is created for a range of periods, the periods will be part of the observation table, with the placement of the Time Period based on the Report Template tables and rows definition. [rt_multiple_periods.png]

Protecting Workbooks

It is recommended to protect worksheets to prevent users from making change to the workbook which prevent it being read back in by the Fusion Registry. The Fusion Registry supports a password to be set per Agency. For example Agency ‘WB’ can set a password of 1234 for its workbooks, and ESTAT can set a password of ‘abcd’. Passwords can be set in the Admin section of the Fusion Registry, or via the Fusion Registry web service. A protected workbook will only allow the user to fill in observation and attribute cells, and the report period on the main worksheet if the workbook is created for a single period.

Reading / processing a Workbook

The Fusion Registry is able to read any excel workbook that it has created from a Report Template. The data in the workbook can be read in for data validation, transformation, or import into one of the Fusion Registries data stores using the usual mechanisms. The workbook contains the required metadata to read it back in, even if the report template definition has been subsequently changed or deleted since the workbook’s initial creation. As long as the Fusion Registry contains the same Provision Agreement and Dataflow that the worksheet references, and the Data Structure Definition has not had any changes to the number of, or order of Dimensions, then the worksheet will be read back in.

Populating a Workbook from a Dataset

A Data Provider can to export a Workbook for a Report Template and have it prepopulated with data. This can be achieved either by using the Fusion Registry Data Transformation services in order to 'inject' the data from a dataset in a different format into their workbook. The Data Transformation Service needs to know what the desired output data format is, in this case it is set to TODO.

Alternatively, if the Fusion Registry is being used as a Data Store (or virtual store) on the Data Provider’s data, then the Data Provider can simply ask for a Reporting Template for a given time period with existing data already included.

Report Templates as a mechanism to describe an output data format

A Reporting Template can be used as a mechanism to describe an output format for data export / dissemination.

Unlike the Reporting Template for data collection, as an export format there does not need to be a Data Provider or Provision Agreement present in order for the template to be built. The other difference is when using the template for output, there will be no main worksheet included and data validation rules will not be included.

Another key difference between output formats versus as a collection workbook, is that the Universe of Data is derived from the data being exported, as oppose to the DSD and Content Constraints. If the Reporting Template defines any Dimensions as variable Dimensions, these will be incorporated back into the observation table, as a row Dimension. The exported excel workbook will only contain the worksheet(s) for the datasets that were written to the export.

The excel workbook will be protected as usual if a password has been set for the Agency who owns the Reporting Template. The Fusion Registry is able to re-read excel workbooks that have been exported as an output data format, so it is possible to modify the observation values for loading back in for validation/transformation/import purposes.

To query for data and output the result formatted according to a Reporting Template, use the standard REST API, with the HTTP Accept-Header of [todo] or use the format parameter [todo].

To transform a dataset into an output format, use the standard data transformation service, using the HTTP Accept-Header of [todo] – or if transforming using the web user interface, select the Report Template Output data format.

When the Fusion Registry receives a data query for a Dataflow and the output format is [todo], the Fusion Registry will lookup which Reporting Template contains a worksheet defined for the Dataflow. If it cannot find any Reporting Templates, then a [todo] error is reported.

The Fusion Data Browser is able to offer Reporting Templates as an export format for Dataflows that have had templates set up for them.