Difference between revisions of "Validate data"
(→Validation Scheme) |
|||
Line 67: | Line 67: | ||
==What is a Validation Scheme?== | ==What is a Validation Scheme?== | ||
Validation Schemes define one or more validation rules which can be executed against a Dataflow at the data validation stage of a data load. Each validation rule consists of a mathematical expression or a link to an aggregation hierarchy which is used to create an expression. This validation goes beyond syntactical and semantical validation of the dataset and is instead checking that the values supplied in the dataset conform to specific business rules. Examples of this could be that a particular field must have a value less than 100, or that the total value reported must be the same as the total of a set of other observation values. | Validation Schemes define one or more validation rules which can be executed against a Dataflow at the data validation stage of a data load. Each validation rule consists of a mathematical expression or a link to an aggregation hierarchy which is used to create an expression. This validation goes beyond syntactical and semantical validation of the dataset and is instead checking that the values supplied in the dataset conform to specific business rules. Examples of this could be that a particular field must have a value less than 100, or that the total value reported must be the same as the total of a set of other observation values. | ||
+ | |||
[[File:Valid1.PNG|850px]]<br> | [[File:Valid1.PNG|850px]]<br> | ||
+ | |||
+ | A Validation Scheme must be assigned against a single Dataflow and may consist of one or many validation rules. A single Validation Rule consists of: | ||
+ | |||
+ | * an ID and name | ||
+ | * an optional description | ||
+ | * a type: either a custom expression or a hierarchic expression | ||
+ | * a result type (either numerical or code type) and a value (e.g. 100 or GDP) | ||
+ | * an equality operator ( one of the following mathematical operators: =, <>, <, <=, >, >= ) | ||
+ | * an expression (e.g. [EUR]+[FR] ) | ||
+ | |||
+ | The Validation Scheme rules will be applicable to all datasets submitted against the Dataflow the Validation Scheme is linked to. | ||
+ | |||
+ | ==How are Rules Applied== | ||
+ | |||
+ | A validation rule operates on a single dimension, an example of a rule to calculate Total from the inputs Males and Females would look like the following: | ||
+ | |||
+ | [T] = [M] + [F] | ||
+ | |||
+ | '''Note''': the syntax used in a validation scheme puts code Ids into square brackets. | ||
+ | |||
+ | This rule would be applied to every series where all other parts of the series key match, so the following series there would be two matches to this rule, one for employment, and one for unemployment. | ||
+ | |||
+ | |||
+ | [[File:Valid2.PNG|850px]]<br> | ||
+ | |||
+ | For a validation rule to be executed there must be data reported for the output, and at least one of the inputs. If data are missing in the inputs, then they are treated a zero values. In the following example, only 1 rule is matched, and there is only one input (Male). | ||
+ | |||
+ | |||
+ | [[File:Valid3.PNG|850px]]<br> | ||
+ | |||
+ | There are two types of validation rules, ones which use a custom written expression, as described above. The second type references a Hierarchy in the Registry, and the Hierarchy is used as the basis for an Aggregation expression. For example the following image shows a hierarchy of countries, against theoretical reported values. This is an example of a hierarchy being used to validate a dataset. | ||
+ | A hierarchy can be applied to any dimension that uses the same Codelist as the Hierarchy. When values are read in the data file, the totals at each sub-hierarchy are summed up to ensure they are consistent with the parent value. If any values are missing data, they are treated as having a value of zero. | ||
+ | |||
+ | |||
+ | [[File:Valid4.PNG|600px]]<br> | ||
+ | |||
+ | '''Note''': the Registry only checks data in the submitted file, and does not cross check against any persisted data when validating. For example if you have already stored the totals in a Registry database, submitting a Dataset containing the values making up the totals, the Registry will not validate from the file against the totals already stored. | ||
+ | |||
+ | ==About this Tutorial== | ||
+ | This tutorial describes the manual steps in the process to create a Validation Scheme. It is required that your Registry be populated with structures that support this process (such as Data Structure Definitions and Dataflows). |
Revision as of 09:01, 3 November 2020
Contents
Overview
To Validate Data you need to have the following structures in place.
Preparation
Data Provider
A Data Provider is an Organisation Type. When a Provision Agreement is created a Dataflow and a DataProvider must be present. An example Data Provider is shown below.
Provision Agreement
A Provision Agreement is the union of a Dataflow with a Data Provider. A Provision Agreement (PA) is a definition that the Data Provider is allowed to provide data for the Dataflow. Data is always reported by a Data Provider against the PA. You can read more about Provision Agreements in this article. An example Provision Agreement is shown below.
Dataflow
A Dataflow is a structure on which data is collected and disseminated. A Dataflow references a Data Structure Definition (DSD) which is used as the underlying template to which the data must conform. You can read more about Dataflows in this article. An example Dataflow is shown below.
Load Data
Once all the elements are in place as described above, the next step is to load the data which is done via the Convert option on the Data Menu.
Data can be loaded from a file of via a URL (for example from the Metadata Technology's Fusion Registry Demo site).
To successfully validate, the data must adhere to the SDMX standard in terms of format as well as what has been defined in the the Data Structure.
You can read more about Data Structures in this article.
You can read more about how to create a simple Data Structures in this article.
Supported formats are:
- SDMX_2.1-Generic
- SDMX-V2.0-Compact
- SDMX-EDI
- SDMX-JSON
- SDMX-V2.0-Generic
To see this process in action you can watch this video
Validate Data
Click Load Data to start the validation process as explained in the image below.
Validation Scheme
What is a Validation Scheme?
Validation Schemes define one or more validation rules which can be executed against a Dataflow at the data validation stage of a data load. Each validation rule consists of a mathematical expression or a link to an aggregation hierarchy which is used to create an expression. This validation goes beyond syntactical and semantical validation of the dataset and is instead checking that the values supplied in the dataset conform to specific business rules. Examples of this could be that a particular field must have a value less than 100, or that the total value reported must be the same as the total of a set of other observation values.
A Validation Scheme must be assigned against a single Dataflow and may consist of one or many validation rules. A single Validation Rule consists of:
- an ID and name
- an optional description
- a type: either a custom expression or a hierarchic expression
- a result type (either numerical or code type) and a value (e.g. 100 or GDP)
- an equality operator ( one of the following mathematical operators: =, <>, <, <=, >, >= )
- an expression (e.g. [EUR]+[FR] )
The Validation Scheme rules will be applicable to all datasets submitted against the Dataflow the Validation Scheme is linked to.
How are Rules Applied
A validation rule operates on a single dimension, an example of a rule to calculate Total from the inputs Males and Females would look like the following:
[T] = [M] + [F]
Note: the syntax used in a validation scheme puts code Ids into square brackets.
This rule would be applied to every series where all other parts of the series key match, so the following series there would be two matches to this rule, one for employment, and one for unemployment.
For a validation rule to be executed there must be data reported for the output, and at least one of the inputs. If data are missing in the inputs, then they are treated a zero values. In the following example, only 1 rule is matched, and there is only one input (Male).
There are two types of validation rules, ones which use a custom written expression, as described above. The second type references a Hierarchy in the Registry, and the Hierarchy is used as the basis for an Aggregation expression. For example the following image shows a hierarchy of countries, against theoretical reported values. This is an example of a hierarchy being used to validate a dataset. A hierarchy can be applied to any dimension that uses the same Codelist as the Hierarchy. When values are read in the data file, the totals at each sub-hierarchy are summed up to ensure they are consistent with the parent value. If any values are missing data, they are treated as having a value of zero.
Note: the Registry only checks data in the submitted file, and does not cross check against any persisted data when validating. For example if you have already stored the totals in a Registry database, submitting a Dataset containing the values making up the totals, the Registry will not validate from the file against the totals already stored.
About this Tutorial
This tutorial describes the manual steps in the process to create a Validation Scheme. It is required that your Registry be populated with structures that support this process (such as Data Structure Definitions and Dataflows).