Data Validation Web Service

From Fusion Registry Wiki
Jump to navigation Jump to search

Overview

Entry Point /ws/public/data/validate
Access Public (default). Configurable to Private
Http Method POST
Accepts CSV, XLSX, SDMX-ML, SDMX-EDI (any format for which there is a Data Reader)
Compression Zip files supported, if loading from URL gzip responses supported
Content-Type

1. multipart/form-data (if attaching file) – the attached file must be in field name of uploadFile

2. application/text or application/xml (if submitting data in the body of the POST)

Response Format application/json
Response Statuses

200 - Validation could be performed

400 - Validation could not be performed (either an unreadable dataset, or unresolvable reference to a required structure)

401 - Unauthorized (if access has been restricted)

500 - Server Error

Data Formats

See Data Load Formats

HTTP Headers

HTTP Header Purpose Allowed Values
Accept

(Since v9.8)

Optional. Instructs the service which data output format to output the valid or invalid datasets in.

Note: From Fusion Registry 11.8.0 the format (if not specified) defaults to the input format. Previous versions defaulted to SDMX Structure Specific 2.1.

Note: This Header is only used if Inc-Valid or Inc-Invalid are set to true.

See Accept Formats
Data-Format Used to inform the server when the data is in CSV format. csv;delimiter=[delimiter]

Where [delimiter] is either:

  • comma
  • tab
  • semicolon
  • space
Structure

(optional) Provides the structure to validate the data against.

This information may be present in the header of the dataset, if provided this value will override the value in the dataset.

Fusion Registry v10.4.7 onwards accepts a comma separated list of URNs if multiple datasets exist in the loaded file, however if multiple URNs are provided, URN must be of a Dataflow or Provision Agreement and the datasets must identify at a minimum the Data Structure that is needed to read the file.

Valid SDMX URN for Provision Agreement, Dataflow, or Data Structure Definition
Inc-Metrics

(Since v9.8)

Optional. Includes metrics on the validation.

This will add extra detail to the validation report.

Boolean (true/false)
Inc-Valid

(Since v9.8)

Optional. Instructs the service to include a dataset with all the valid series and observations in the response.

As the result will contain a separate file for the dataset, the response format will be set to either multipart/mixed message with a boundary per file, or if the Zip header is set to true, the output will be a single zip file.

The file is called ValidData with the file extension based on the output format.

Boolean (true/false)
Inc-Invalid

(Since v9.8)

Optional. Instructs the service to include a dataset with all the invalid series and observations in the response.

As the result will contain a separate file for the dataset, the response format will be set to either multipart/mixed message with a boundary per file, or if the Zip header is set to true, the output will be a single zip file.

The file is called InvalidData with the file extension based on the output format.

Boolean (true/false)
Zip

(Since v9.8)

Optional. Compresses the output as a zip file. If used in conjunction with Inc-Valid or Inc-Invalid the zip will contain multiple files. Boolean (true/false)
Prior-Data-Dependent

(Since v9.8)

Optional. This allows data to be validated under the assumption that other data will provide missing information. If this value is set to true, particular data validators will not be used when validating the data. These validators are "Mandatory Observations" and "Valid Calculations".

Default value is false.

Boolean (true/false)

Response

The validation output contains both human readable error descriptions, as well as machine processible locations of the errors within the dataset. The location in the dataset is described as a key or observation locator in the format; A:UK:M:2008 – where each component relates to the Dimension value, separated by a colon. If the error position is observation, the last part of the key is the observation time period.

Since Fusion Registry 10.5.4, error code values are output, which provide a unique code for each type of error. For a list of the error codes, refer to error codes.

There are 3 types of output that can be produced which share a common structure: unable to parse input(returns HTTP 400); able to parse input but references invalid data structure (returns HTTP 200); parsed input and returns output, which may have validation errors (return HTTPS 200). Below are examples of each:

Valid Dataset

{
 "Meta": {
   "RequestTime": 1564410081711,
   "Duration": 43
 },
 "FileFormat": "Structure Specific (Compact) v2.1",
 "Prepared": "2019-07-29T10:23:01",
 "SenderId": "FR_DEMO",
 "DataSetId": null,
 "Status": "Complete",
 "Errors": false,
 "Datasets": [
   {
     "DSD": "urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=OECD:HIGH_AGLINK_2011(1.0)",
     "Dataflow": "urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=OECD:AGRIC_OUTLOOK_2011_2020(1.0)",
     "DataProvider": "urn:sdmx:org.sdmx.infomodel.base.DataProvider=METATECH:DATA_PROVIDERS(1.0).METATECH",
     "ProvisionAgreement": "urn:sdmx:org.sdmx.infomodel.registry.ProvisionAgreement=OECD:OECD_AGRIC_OUTLOOK(1.0)",
     "KeysCount": 2,
     "ObsCount": 62,
     "GroupsCount": 0,
     "Errors": false
     "ReportedPeriods": {
     "A": {
         "Name": "Annual",
         "StartPeriod": "1990",
         "EndPeriod": "2020"
       }
     }, 
   }
 ],
 "PreventsConversion": false,
 "PreventsPublication": false
}


Dataset with Errors

{
 "Meta": {
   "RequestTime": 1564401209760,
   "Duration": 34
 },
 "InvalidData": {
   "Datasets": [
     {
       "Structure": "urn:sdmx:org.sdmx.infomodel.registry.ProvisionAgreement=OECD:OECD_AGRIC_OUTLOOK(1.0)",
       "Series": 2,
       "Observations": 61,
       "Groups": 0
     }
   ]
 },
 "ValidData": {
   "Datasets": [
     {
       "Structure": "urn:sdmx:org.sdmx.infomodel.registry.ProvisionAgreement=OECD:OECD_AGRIC_OUTLOOK(1.0)",
       "Series": 2,
       "Observations": 32,
       "Groups": 0
     }
   ]
 },
 "FileFormat": "Structure Specific (Compact) v2.1",
 "Prepared": "2019-07-29T10:23:01",
 "SenderId": "FR_DEMO",
 "DataSetId": null,
 "Status": "Complete",
 "Errors": true,
 "Datasets": [
   {
     "DSD": "urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=OECD:HIGH_AGLINK_2011(1.0)",
     "Dataflow": "urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=OECD:AGRIC_OUTLOOK_2011_2020(1.0)",
     "DataProvider": "urn:sdmx:org.sdmx.infomodel.base.DataProvider=METATECH:DATA_PROVIDERS(1.0).METATECH",
     "ProvisionAgreement": "urn:sdmx:org.sdmx.infomodel.registry.ProvisionAgreement=OECD:OECD_AGRIC_OUTLOOK(1.0)",
     "KeysCount": 3,
     "ObsCount": 93,
     "GroupsCount": 0,
     "ReportedPeriods": {
       "A": {
              "Name": "Annual",
              "StartPeriod": "1990",
              "EndPeriod": "2020"
       }
     },
     "Errors": true,
     "ValidationReport": [
     {
         "Type": "Constraint",
         "Errors": [
           {
             "ErrorCode": "REG-201-250",
             "Message": "Disallowed Dimension Value: REF_AREA=AFR",
             "Dataset": 0,
             "ComponentId": " REF_AREA ",
             "ReportedValue": "AFR",
             "Position": "Series",
             "Keys": ["AFR:BT:AA"]
           }
         ]
       },
       {
         "Type": "Representation",
         "Errors": [
         {
           "ErrorCode": "REG-201-200",
           "Message": "Dimension 'VARIABLE' is reporting value 'AA' which  is not a valid representation in referenced Codelist 'OECD:CL_HIGH_AGLINK_2011_VARIABLE(1.0)'",
           "Dataset": 0,
           "Position": "Series",              
           "ComponentId": "VARIABLE",
           "ReportedValue": "AA",
           "Keys": ["AFR:BT:AA"]
         },
         {
           "ErrorCode": "REG-201-201",
           "Message": "Error in Primary Measure 'OBS_VALUE': Reported value 'XXX' is not of expected type 'Double'",
           "Dataset": 0,              
           "ComponentId": " OBS_VALUE",
           "ReportedValue": "XXX",
           "Position": "Observation",
           "Keys": ["AFR:BT:IM:2010"]
         }
         ]
       },
       {
         "Type": "FormatSpecific",
         "Errors": [
           {
           "Message": "Unexpected attribute 'ASD' for element 'StructureSpecificData/DataSet/Series/Obs'",
           "Dataset": 0,
           "Position": "Dataset"
         }
         ]
       }
     ]
   }
 ],
 "PreventsConversion": false,
 "PreventsPublication": true
}

Note the first three elements ‘Meta’, ‘InvalidData’, ‘ValidData’, there are present in the report if Inc-Metrics is set to true. Inc-valid and Inc-Invalid set to true enables the report to know the metrics for the invalid and valid data.

Note also each Error has a Type, this is the category of error which caused the validator to fail. For a list of all validators see the following section on Validators. The Error Position is either set to Dataset, Series, Observation, or Group.

PreventsConversion and PreventsPublication is an indication on the severity of the error. These settings on which errors prevent conversion and publication can be set in the Fusion Registry by the administrator of the system.

Dataset with an Unresolvable Datset Reference

{
  "FileFormat": "Generic v2.1",
  "MimeType": "application/xml",
  "Status": "InvalidRef",
  "Errors": true,
  "Datasets": [{"Dataflow": "urn:sdmx:org.sdmx.infomodel.datastructure.Dataflow=BIS:INVALID_DATAFLOW(1.0)"}]
}

Errors has a value of true, the Status states InvalidRef, and the Datasets provides the reported reference which could not be resolved

Dataset which could not be read

{
  "Status": "Error",
  "Errors": true,
  "Error": "Unexpected '<' character in element (missing closing '>'?)\r\n at [row,col {unknown-source}]: [17,3]"
}

This error will be reported when the Fusion Registry is unable to determine what type of data the dataset is, so is unable to process the dataset for validation