EdgeServerCompiler

From Metadata Technology Wiki
Jump to navigation Jump to search

Fusion Edge Compiler

Overview

The Fusion Edge Compiler is a command line client, written in Java and can run on Windows or UNIX operating systems. Its responsibility is to compile SDMX data, structure, and metadata files for dissemination by the Fusion Edge Server. The Fusion Edge Compiler provides three functions:


  1. To pull content from SDMX web services (example Fusion Registry web services) in order to populate a local file system of content to publish
  2. To compile content in the local file system to create a new ‘environment’ which can be consumed by the Fusion Edge Server
  3. To publish the environment to an Amazon S3 bucket from which distributed Fusion Edge Servers can take their content, if configured to do so

The second function, compile, is the main function of the compiler. The other two functions can be performed manually if required, however the Fusion Edge Compiler provides these functions to allow the full data extract, transform, and load process to be fully automated.

General Arguments

The command line client provides three scripts for pull, compile, and publish. Each script has a UNIX (.sh) file and a windows (.bat) file. Each script can take a number of command line arguments, some arguments are common to all scripts these are:

  1. The properties file (prop argument). Each script can read one or more properties files which is a JSON file that contains configuration options. The properties file contains all the same configuration options that can be passed directly to script as a command line argument. This allows the script to read arguments from a file, and/or as a direct argument. It is possible to provide both command line arguments in addition to a properties file, the script will merge the arguments from the properties file, with the command line arguments. If a configuration option is passed as a the command line argument but also appears in the properties file, the command line argument will take precedence. As both the command line arguments and properties files can be used in conjunction with one another, the arguments are always marked as optional. However, this document will note which arguments are required, and must exist as either a command line argument, or a properties file argument.
  1. Ledger location (lgr argument). This must be the root location of the ledger, it can be provided as either a path to a folder on a file system, the http(s) URL to the root folder if hosting the ledger on a web service, or prefixed with s3:bucketname if using Amazon S3 as the file store.


​ Pull Content

buildFileSystem.sh (UNIX) or buildFileSystem.bat (Windows)

The Fusion Edge Compiler queries and SDMX web service for structural metadata, data, and reference metadata content based on what it has been requested to pull. It can work against a Fusion Registry web service as well as any other SDMX web service that complies with the SDMX specification.

The Fusion Edge Compiler pulls the content to build a target directory of files in the correct structure for the compile process to operate.

​ Command Line Arguments

Note: Arguments can be provided by the command line, or alternatively via a JSON properties file, or a mix of both.

Argument Example Description
prop -prop “/home/props.json” A reference to one or more properties files (separated by a space)
api -api “https://yourorg.org/sdmx” The URL of the web service to pull the content from
apict -apict 500 Connect Timeout to the API in seconds (default 200)
apirt -apirt 500 Read Timeout from the API in seconds (default 200)
apiua -apiua “EdgeCL” User Agent sent in HTTP Header request to API
tgt -tgt “/home/compiler/target” The target directory to write the files and folders to
lgr -lgr “s3:mybucket” The location of the current live ledger. If this is provided then the last compile time of the ledger will be used as the updated after time to use when pulling data
df -df “ACY:DF_ID(1.0)” “ACY:DF2(1.0)” A reference to one or more Dataflows to pull data for (separated by a space). If Dimension filters are to be applied to the dataflow, then the properties file should be used. The keyword all can be used to pull data for all Dataflows. A Dataflow argument will include both the data and structural metadata (Dataflow plus all descendants) in the output.
str -str “Codelist=ACY:CL_FREQ(1.0),ACY_CL_AGE(1.0) CategoryScheme=ACY:*(*)”

-str all
A list of structures to include in the output, in addition to those that are included automatically based on the Dataflows included in the output. The structure and all descendants of that structure will be included in the output.

The * are used to wildcard Agency, Id, and Version parameters. All Structures can be obtained using the all keyword in the class type.

upd upd “2020-01-30T00:00.00” Pull data that was updated after this time. This will be applied by using the updatedAfter web service query parameter against the target web service

replace

replace -replace If present, all the files in the the target directory will be deleted before the pull content is run
metadata -metadata If present, the pull process will query for all Reference Metadata and include this in the output
zip -zip If present, the output files will all be in zip format
usr -usr “myusername” Username to authenticate with the REST API, if using the Fusion Registry it should correspond to a user account in the Fusion Registry
pwd -pwd “mypassword” Password to authenticate with the REST API
s3rgn -s3rgn “us-east-1” Amazon S3 region – required if the Ledger is hosted on Amazon S3
s3sec -s3sec “azxzcvbnm” Amazon S3 Secret – required if the Ledger is hosted on Amazon S3
s3acc -s3acc “azxzcvbnm” Amazon S3 Access Key – required if the Ledger is hosted on Amazon S3
tmp -tmp /tmp Temporary directory to use for transient files. If not provided, the java.io.tmpdir JVM variable is used, usually defaulting to the user tmp directory
rmtmp -rmtmp If present will delete all files in the tmp directory before start
report -report If present will output the report to a File called report.json in the target (tgt) directory
h -h Display the help information


Properties File

If providing configuration parameter via a properties file, it must be in JSON format, and have the following structure

{
   "Ledger" : 	"s3:mybucket",
   "TgtDir" :  	"/home/compiler/target",
   "SdmxAPI" : 	"https://demo.metadatatechnology.com/FusionRegistry/ws/public/sdmxapi/rest",
   "UpdatedAfter" :	 "2010",
   "Username" : 	"myuser",
   "Password" : 	"pwd",
   "AllData" : 	true,
   "FullReplace" : 	true,
   "Zip" : 		true,
   "Metadata" : 	true,
   "S3Region":	"us-east-1",
   "S3SecretKey":	"azxasdasfcvbn",
   "S3AccessKey":	"sxcvbnmu",
   "SubCubes":{
      "ECB:EXR(1.0)" : {
         "SubCube1" : {
            "Include" : {
               "FREQ":["A","M"],
               "REF_AREA":["UK"]
            }
         }
      },
      "WB:POVERTY(1.0)":{ }
   },
   "Structures":{
      "Codelist": ["ECB,EXR,1.0"]
      "HierarchicalCodelist": ["ECB", "BIS"]
      "all": ["SDMX"]
      ]
   }
}

Note: the properties file supports the same information as the command line arguments, in addition it supports the ability to define subcubes of data, and support for additional structures.


Sub Cubes

The SubCubes section of the properties file can be used to define Dimension filters for each Dataflow.. If the df argument of the command line is used to define a single Dataflow to be exported, and it is also given a properties file containing subcubes definitions for multiple Dataflows, then the compiler will honour the df argument and only export data for the single Dataflow – it will however use the sub-cube filters if they are present in the properties file.

A sub-cube with no filters, as shown in the WB:POVERTY example above, will result in the full dataset being exported.

When the compiler is compiling data for all Dataflows (AllData : true or -df “all” as a command line argument) it will still use the sub-cube definitions if they exist, to filter the Dataflow contents. Structures

The Structures section of the properties file defines which structural metadata should be included in the outputs. Note, when outputting data for a Dataflow, the Dataflow and all descendants (DSD, Codelist, Concept Scheme, Agency Scheme) will be automatically included in the structure metadata that is generated and do not need to be explicitly specified. The Structure section provides the means to include additional structures that are not directly related to the Dataflow, or if exporting structures only, then this section must be present. The arguments are the structure type (this is the same as the path parameter on the REST API, i.e. Codelist – each structure type can then take an array of structure filters in the format AgencyId,Id,Version, where each argument is optional, the absence of which meaning all. The keyword all can be used as a structure type to indicate all structures, which can also take the filters for agency, id and version.

Note: Whenever a structure is included in the export, all the descendants of that structure will be included automatically. For example, if a Hierarchical Codelist is included in the export, the related Codelists and Agency Scheme(s) will also be included, without having to be explicitly mentioned.

Report

A JSON report will be output to the System.out, or if the -report argument is present it will be written to a file called report.json. The report can be used to determine which settings were used, and which datasets were successfully obtained from the API.

An example report is given below (note all durations are in milliseconds)

{
	"Header": {
		"Prepared": "2021-09-07T08:37:32.104Z",
		"API": "https://server/ws/public/sdmxapi/rest",
		"Ledger": null
	},
	"RESTSettings": {
		"ConnectTimeout": 700,
		"ReadTimeout": 500,
		"UserAgent": "FusionEdgeCL"
	},
	"ExplicitStructures": [],
	"Datasetstructures": ["Dataflow=all:all(all)"],
	"Datasets": {
		"ECB:EXR(1.0)": {
			"all": {
				"Success": true,
				"Duration": 9808
			}
		},
		"ECB:TRD(1.0)": {
			"all": {
				"Success": false,
				"Duration": 673
			}
		},
       },
	"Duration": 20724
}

Compile Content

compileFileSystem.sh (UNIX) or compileFileSystem.bat (Windows)

The compile script reads the files in from the source directory and compiles them into the target directory. The result of the compile process is a complete environment that can be published to the Fusion Edge Server. If the target directory does not exist it will be created, if it does exist it is permitted to have files present, for example from previous compilations. The compile process will remove any files from the target directory that are not required in the new environment.

Command Line Arugments

Note: Arguments can be provided by the command line, or alternatively via a JSON properties file, or a mix of both.

Argument Example Description
prop -prop “/home/props.json” A reference to one or more properties files (separated by a space)
src -src “/home/compiler/source” The source directory that contains the files to be compiled (this is the tgt directory in the build file system script)
tgt -tgt “/home/compiler/compiled” The target directory to write the compiled environment to
lgr -lgr “s3:mybucket” The location of the current live ledger, this is required for the following purposes:
  1. The last compile time of the ledger will be used to determine which files to include in the compile. This is done by comparing timestamps on the file with the last compile
  2. The ledger is used to build the next ledger entry, and determine the version number given to the new environment
  3. If running an update compile (not a full replace) any live environment files that are required and unchanged in the new environment will be pulled from the live environments
  4. If updating a dataset, the live dataset will be pulled from the live environment as the base dataset to update with the new series / observations
sgn -sgn “my_signature” A secret signature to sign the generated files with. The same signature should be used for all compilations. The Fusion Edge Server must be given the same secret signature via its properties file so it is able to verify the content in an environment was not corrupted or tampered with.
liv -liv “2020-01-30T00:00.00” The go live time. This is used if the new environment is to go live at a particular point in time. The Fusion Edge Server will not make the environment live until this timestamp.
f -f If present, a full recompile of all the files in the src directory will be performed – note this will still merge datasets with those read from the ledger, unless the ‘rd’ argument is also present
rd -rd If present any data files read in will be used to replace any existing datasets read from the ledger, if not in the ledger, or if no ledger provided the datasets will be created in the output
s3rgn -s3rgn “us-east-1” Amazon S3 region – required if the Ledger is hosted on Amazon S3
s3sec -s3sec “azxzcvbnm” Amazon S3 Secret – required if the Ledger is hosted on Amazon S3
s3acc -s3acc “azxzcvbnm” Amazon S3 Access Key – required if the Ledger is hosted on Amazon S3
tmp -tmp /tmp Temporary directory to use for transient files. If not provided, the java.io.tmpdir JVM variable is used, usually defaulting to the user tmp directory
rmtmp -rmtmp If present will delete all files in the tmp directory before start
h -h Display the help information

Properties File

If providing configuration parameter via a properties file, it must be in JSON format, and have the following structure

{
   "Ledger" : 	"s3:mybucket",
   "SrcDir" :  	"/home/compiler/source",
   "TgtDir" :  	"/home/compiler/compiled",
   "ForceRebuild" :  	false,
   "ReplaceData" :  	false,
   "Signature" : 	"myuser",
   "LiveTime" : 	"2020-01-30T00:00.00",
   "S3Region":	"us-east-1",
   "S3SecretKey":	"azxasdasfcvbn",
   "S3AccessKey":	"sxcvbnmu",
}

​ Publish Content

publishContent.sh (UNIX) or publishContent.bat (Windows)

The Publish Content script is used to move an environment from the local file system to Amazon S3. If the environment is hosted elsewhere, for example a private web server, or a file system local to the Fusion Edge Server, then the environment must be moved via some other means.

The Publish Content will only move the files in the built environment that do not exist already in Amazon S3. The ledger file is always moved last, as this is the file that the Fusion Edge Server polls to check for updates, by moving this file last, the publish workflow ensures that all the files required to build the environment are in place before an update is attempted.

Note: If moving an environment manually, it is possible to replicate the behaviour of this script by only moving the files that have changed, the ledger_indexes folder contains a description of all the files that were created as part of the new version (files are never modified, only created or removed).

​ Command Line Arguments

Note: Arguments can be provided by the command line, or alternatively via a JSON properties file, or a mix of both.


Argument Example Description
prop -prop “/home/props.json” A reference to one or more properties files (separated by a space)
src -src “/home/compiler/compiled” The source directory that contains the compiled content (this is the tgt folder in the compile process)
lgr -lgr “s3:mybucket” The Amazon S3 bucket location
s3rgn -s3rgn “us-east-1” Amazon S3 region – required if the Ledger is hosted on Amazon S3
s3sec -s3sec “azxzcvbnm” Amazon S3 Secret – required if the Ledger is hosted on Amazon S3
s3acc -s3acc “azxzcvbnm” Amazon S3 Access Key – required if the Ledger is hosted on Amazon S3
h -h Display the help information

​ Worked Examples

Compile New Dataset into Environment

Description: The Edge Server is running against a environment with datasets published to it. A new dataset is ready for publication, for a Dataflow which is not part of the current live environment. The file system has been built and contains the dataset and associated structure file for the data. The remainder of the files in the file system are either not present, or are currently in the live environment from a previous compile.

The desired output is to merge the new dataset, and related structures into the current live environment

Compiler Arguments:

compileFileSystem.bat " -src "MyFiles" -tgt "CompiledEnvironment"-lgr "LiveEnvironment -sgn "password"

Explanation:

The default behaviour of the compilation process is to merge in any new information from the local file system into the live environment. A copy of the live environment is downloaded from the -lgr location, and this forms the basis for the new environment which is put into the -tgt folder. The structures read in from the local file system are merged into the copy of the live environment, and the new dataset is compiled and also included in the copy of the live environment. This live environment copy is written into the CompiledEnvironment folder, as designated by the -tgt argument. All other files in the LiveEnvironment remain unchanged in the new CompiledEnvironment, so when the CompiledEnvironment is deployed, the Edge Server will contain all the same information as before, plus the new dataset.

Update or Add Observations or Series into Existing Dataset

Description: The Edge Server is running against a environment with datasets published to it. There is some new data for publication for one of the existing datasets. The new data contains some observations which already exist in the live environment and is to be overwritten, it also contains some new observations which do not yet exist.

The desired output is to merge the information from the data on the file system, into the current live environment

Compiler Arguments:

compileFileSystem.bat " -src "MyFiles" -tgt "CompiledEnvironment"-lgr "LiveEnvironment -sgn "password"

Explanation:

The default behaviour of the compilation process is to merge in any new information from the local file system into the live environment. A copy of the live environment is downloaded from the -lgr location, and this forms the basis for the new environment which is put into the -tgt folder. The data file in the MyFiles folder could either contain all the data (existing plus new data) or simply contain the deltas (just the new observations and those being edited). Either way, the compiler will detect that this dataset already exists in the live environment and read it in, it will then read the data from the file system and merge the contents into the copy from the live environment, before writing it back out into the CompiledEnvironment folder, as designated by the -tgt argument. All other files in the LiveEnvironment remain unchanged in the CompiledEnvironment, so when the new environment is deployed, the Edge Server will contain all the same information as before, plus the new and edited observations.

Delete Observations or Series in Existing Dataset

Description: The Edge Server is running against a environment with datasets published to it. One of the datasets has observations or series that need to be removed

The desired output is to republish only the dataset that needs the data removed

Compiler Arguments:

buildFileSystem.bat -api "https://yourorg.org/sdmx" -tgt "/home/compiler/replacedata" -df "ACY:DF_ID(1.0)"

compileFileSystem.bat " -src "/home/compiler/ replacedata" -tgt "CompiledEnvironment"-lgr "LiveEnvironment " -sgn "password" -rd

Explanation:

The only way to delete observations and series from a dataset is to recompile the full dataset (minus the observations that are no longer required). The first step is to ensure the full dataset is in the file system, ready for compile. This example shows how to pull the full dataset from a web service API using the buildFileSystem script. However, it is possible to manually create/edit a dataset in the file system in which case the buildFileSystem step step can be skipped.

buildFileSystem.bat is used to pull the full dataset from the web service, and place it in the target directory. There is no need to provide a ledger location, as the pull request should NOT include information about pulling data updated after a given timestamp. The df argument provides the reference to the dataset to pull.

compileFileSystem.bat is used to compile the file in the replacedata folder. The reference to the ledger is used so that a copy of the existing environment can be taken, and the environment version can be updated accordingly. The rd argument is used to ensure that the dataset being compiled into the environment is NOT merged into the existing dataset, but used as a replacement of the dataset in the existing environment.

Multiple Compiles without impacting Last Updated in Delta Mode

Description: When building the file system of data to compile, it is possible to only pull the data from the API which was modified since the last compile. This is achieved by passing the location of the current live ledger to the buildFileSystem script, as the ledger contains the timestamp since the last compile. In this use case, the requirement is to perform more then one compile, to build the next environment, without impacting the ability to pull the delta. A concrete example is, user wants to delete some observations from a dataset, which is achieved by compiling a full replace, they then want to pull the deltas of the remaining datasets and compile these - they want to ensure that the timestamp of the delta is not impacted by the previous delete operation.

There are a couple of ways this can be achieved, as explained below:

Technique 1:

The first technique is to use 2 ledger locations. The buildFileSystem always references the Live ledger, this is the location of the current live environment and it contains the timestamp since this environment was built. The first time the compileFileSystem is used (for example to compile in the full replacement of the datasets) it will output a new environment to the local file system, for example /home/compiled/. The second time the compileFileSystem is used (for example to merge in the deltas) the lgr location is given as the output of the previous compile (which can also be the location of the target folder for the new compile), for example /home/compiled/.

This technique essentially uses the live location to always know when the last 'real' update was, and the local compiled folder as the location of compiling each new piece of information into. A worked example is shown below

Build Replace Dataset File System

buildFileSystem.bat -api "https://yourorg.org/sdmx" -tgt "/home/compiler/replacedata" -df "ACY:DF_ID(1.0)"

Compile Replace Dataset into live Environment

compileFileSystem.bat" -src "/home/compiler/replacedata" -tgt "CompiledEnvironment"-lgr "LiveEnvironment " -sgn "password" -rd

Build Delta Dataset File System (all Dataflows)

buildFileSystem.bat -api "https://yourorg.org/sdmx" -tgt "/home/compiler/deltadata" -df "all" -lgr "LiveEnvironment"

Compile Delta Datasets into previous Compile

compileFileSystem.bat" -src "/home/compiler/deltadata" -tgt "CompiledEnvironment" -lgr "CompiledEnvironment " -sgn "password" -rd

Technique 2:

The second technique is to always run the buildFileSystem command passing in an explicit date for updatedAfter, and not referencing the ledger at all, this will ensure that the data query has a known timestamp from which to pull the data from.

buildFileSystem.bat -api "https://yourorg.org/sdmx" -tgt "/home/compiler/deltadata" -df "all" -upd “2020-01-30T00:00.00”

Explanation:

The buildFileSystem does not reference the ledger, instead it passes the explicit upd argument for Updated After

Ignore file timestamps in Filesystem

Description: The default behaviour of the compilation process is to read the timestamp from the ledger (if provided) and use this as a filter on all files in the source file system. Files whose last updated timestamps are before the timestamp on the ledger are deemed to have already been compiled, and as such are ignored by the Compilation process. So even if the source file system contains 100 files, if only 1 of those files was modified since the last compile, the only 1 file will be included in the next compilation process. As the default behaviour of the compiler is to take a copy of the live ledger files and merge into it, the compiler does not need to re-read files it has already processed.

It is possible for files to have older timestamps then those in the ledger, even if they were not in any previous compilation process, for example if they were copied into the file system from another location on the file system. It is possible to ‘touch’ the files in order to update its timestamp, or alternatively the compiler provides an argument that can be used to ignore all timestamps and read all files in the source file system.

Compiler Arguments:

compileFileSystem.bat " -src "MyFiles" -tgt "CompiledEnvironment"-lgr "LiveEnvironment -sgn "password" -f

Explanation:

The source files are read from the MyFiles directory, to be compiled into the CompiledEnviroment directory. The LiveEnvironment is a copy of the last CompiledEnviroment which was published to production. Ordinarily the compiler would only scan the MyFiles directory for files updated since the LiveEnvironment was built. The -f argument tells the compiler to read all the files in the MyFiles directory, regardless of timestamp. All files are merged into the LiveEnvironment.

Replace dataset in Live Environment

Description: The default compile process is to always merge new data from the file system into the live environment. In this use case, the data file in the file system is to be used as a full replacement of the data in the live environment.

Compiler Arguments:

compileFileSystem.bat " -src "MyFiles" -tgt "CompiledEnvironment"-lgr "LiveEnvironment -sgn "password" -rd

Explanation:

The source files are read from the MyFiles directory, to be compiled into the CompiledEnviroment directory. In this example the compiler will only read files whose timestamps are later then the build timestamp in the LiveEnvironment, and as such the MyFiles directory may only contain the dataset to update if required.

Ordinarily the compiler read the data file and merge it into the dataset read from the LiveEnvironment. The presence of the -rd argument tells the compiler to discard the dataset in the LiveEnvironment and replace it with the one read from the file system. All datasets files whose timestamps are later then the timestamp on the LiveEnvironment will be replaced, all other datasets in the LiveEnvironment will remain unchanged.

Remove datasets in Live Environment

Description: The compile process always merges new information into the live environment, it does not remove information. This can be overcome at the dataset level by using the -rd command, which replaces datasets in the live environment rather then merges into them. This command does not delete datasets in the live environment. In order to remove datasets, a full replace should be performed.

Compiler Arguments:

compileFileSystem.bat " -src "MyFiles" -tgt "CompiledEnvironment"-lgr "LiveEnvironment -sgn "password" -fr

Explanation:

The full replace argument reads all files in from the local file system, regardless of timestamp. No merge is performed into the live environment, all datasets are replaced with those read in from the file system. Any datasets in the live environment which are not in the local file system are not included in the built environment.