EdgeServer

From Fusion Registry Wiki
Revision as of 02:07, 26 August 2022 by Mnelson (talk | contribs) (Fusion Edge Application Suite)
Jump to navigation Jump to search

Fusion Edge

Overview

The Fusion Edge Server is a Java web application, it’s responsibility is to host SDMX web services for the dissemination of data and related structural and reference metadata. The web services support all Data Formats supported by Fusion Registry.

The purpose of the Fusion Edge Server is to support public or internal dissemination of data and or metadata either via clients who use the web service API directly, or via software/web user interfaces which convert the information into a graphical display for data discovery, retrieval, and display.

The data and metadata retrieval Web Services offered by Fusion Edge Server are the same as Fusion Registry and as such the two applications can be used interchangeably for dissemination of information. Example Applications which make use of these web services are:

  1. Fusion Data Browser to provide a web user interface for the discovery, display and export of data.
  2. FXLData which provides data connectivity to Microsoft Excel
  3. Tableau Web Data Connector providing data connectivity to Tableau
  4. SDMX Connectors third party library which provides connectivity to multiple applications including R, Matlab, Excel
  5. Fusion Data Portal, which can automate data ingestion into a local database from remote web services

Fusion Edge Application Suite

The Fusion Edge Server suite consists of two applications:

  1. Fusion Edge Compiler. This is used to compile SDMX structure and data files into a format which can be read by the Fusion Edge Server. The output of the Fusion Edge Compiler is an Environment.
  2. Fusion Edge Server. This is web application responsible for reading the Environment into memory, and exposing the information via its SDMX web services

An [Example Architecture] shows how the Fusion Edge Compiler can be used in conjunction with the Fusion Registry in order to obtain data and structures, compile the content, and publish to a load balanced system on a public cloud service

Design

The Fusion Edge Server reads into memory a pre-compiled Environment. An Environment is built from a collection of SDMX structure and data files, and it is this information which is made available via the public web services of Fusion Edge Server. The Fusion Edge Compiler is used to build the Environment, it does so by simply reading files on a file system and converting these into read only stores of information which can be rapidly ingested by Fusion Edge Server. The input of the Fusion Edge Compiler is SDMX files, the output of a Fusion Edge Compiler are a collection of files which comprise the Environment.

Publishing new content to the Fusion Edge Server is simply a case of moving the Environment to a location that can be read by Fusion Edge Server. Fusion Edge Server can be configured to automatically poll for updates (dynamic mode), or it can be configured to only read an Environment on application startup (static mode).

Features

Performance and Durability

The Fusion Edge Server is built to be highly performant, even under extreme load from multiple concurrent users. It uses a purpose built in-memory time series database, augmented with a custom caching layer, both of which have been tuned to work with the SDMX data and metadata model. The caching layer, for instance, knows when two data queries refer to the same data sub-cube even if the queries are expressed differently in the URL. Query optimisation similarly ensures that multiple concurrent queries for the same data are executed only once, directing the subsequent requests to the cached data once it has been built.

The Fusion Edge Server loads the pre-compiled content into memory on application startup, an Edge Server hosting 18 datasets, with 700k series and 50 milliion observations will startup and be ready to serve queries in as little as 20 seconds. The content can be served from central shared location, such as Amazon S3 or a private web server, allowing multiple Fusion Edge Server instances to obtain their content from the same location. When the content is updated in the central location, each Fusion Edge Server can automatically update its local content. The content can even be set to be made available at a specific date and time, meaning each Fusion Edge Server can make the content live at exactly the same time.

This architecture, has no single point of failure making the Fusion Edge Server the perfect solution for horizontal scaling.

Immutable and Secure

Compiled Environments are digitally signed, if they are modified in any way, the modification will be detected by Fusion Edge Server and as such the Environment will be rejected. When Fusion Edge Server loads an Environment into memory it can not be modified, as the in memory objects are immutable. Fusion Edge Server only provides read only web services. There are no external interfaces to support modification of any kind to either the configuration of the application or information it holds. Fusion Edge Server is not exposed to SQL injection as it does not use SQL internally, all data stores are purpose built to efficiently process SDMX queries, and are also built so that the content is immutable.

This makes Fusion Edge Server the perfect solution for public data dissemination where security is a priority.

Interchangeable and Loosely Coupled

Fusion Edge Server is built off the same core code as Fusion Registry, as such it offers the same data and structure formats. When Fusion Registry is upgraded to include new formats, or to expand the web service functionality, these upgrades by default go into the next release of Fusion Edge Server. However the Fusion Edge Server has been designed such that it does not need a Fusion Registry to run. All the Fusion Edge Server requires is that it is fed with a valid SDMX structure files and valid SDMX Data files (if data dissemination is a requirement).

This design means Fusion Edge Server can be used to disseminate information from any system which is able to export SDMX files, including but not limited to Fusion Registry.

Scheduled Release of Data (embargo)

Data can be set a specific time and day for release. The Fusion Edge Server also be set to pre-prepare Environments for release prior to the embargo time. For example if an Environment is scheduled for release at 12.00pm, and the Environment is moved to a secure Amazon S3 file system at 11.50, the Fusion Edge Server can be configured to pull these files into its local file system 5 minutes before go live, to eliminate any risk of network lag pulling the files. The Fusion Ede Server can also be set to pre-load the Environment into memory 30 seconds before go live, the Environment is now ready for dissemination but disconnected from any process which can get to the data, it is still fully secure. The Fusion Edge Server is able to then make the Environment go live by simply swapping the old Environment with the new, making it go live with millisecond precision.

Global collection of servers managed from a single location

Fusion Edge Server is given a location to read an Environment for dissemination. This location may be on the local file system, or it could be a URL to a collection of files hosted on a web service. In addition it could be a folder hosted on Amazon S3. When the Environment is placed in a location which can be accessed by more then one Fusion Edge Server, it is possible to update all Fusion Edge Servers by simply updating one central Environment. This makes it possible to host and manage content in Fusion Edge Servers hosted all over the world from a single location. Combining this feature with Embargo makes it possible to update all Fusion Edge servers at exactly the same time from a single file system.

Audited Events

Each Fusion Edge Server can be configured to persist a log of events. The log is in JSON format and broken down in such a way to make them easily processable to determine metrics such as where the queries are coming from, which dataset are popular, which output formats are popular, which browsers (or other agents) are popular.

Automated Data Pipelines

The Fusion Edge Server can be deployed such that data can be moved from internal systems into the edge via scripts which provide a fully automated solution.


​ Publishing Content to the Fusion Edge Server

Compiling Source Data

Content is published to the Fusion Edge Server by compiling datasets, structure files, and reference metadata files that are present in a local file system. The compilation process is run using the Fusion Edge Compiler. The Fusion Edge Compiler is told the root folder to look in and it expects to find the following folder structure under the root folder:

|- data |-- [agency id] |---- [dataflow id] |------ [dataflow version] (data files are placed in this folder) |- structure (structure files are placed in this folder) |- metadata (metadata files are placed in this folder)

Where agency id, dataflow id, and dataflow version are specific to the Dataflows that the data are for. The content can be in any SDMX format, each folder can contain multiple files, the compiler will merge the information where required.

Note: The Fusion Edge Compiler can build this local file system automatically from content pulled from compliant SDMX web services such as those provided by Fusion Registry. Information is provided later about how this is achieved.

And example folder/file content is given below:

|- data |-- WB |---- POVERTY |------ 1.0 |-------- PovertyData.zip |-------- PovertyUpdate.xml |---- EDUCATION |------ 1.0 |-------- EduData_1990_2010.json |-------- EduData2010_2020.xml |- structure |-- corestructures.zip |-- categories.xml |-- msds.xml |- metadata |--metadataset1.zip |--metadataset2.zip

The files in the file system must be in SDMX format, and may be individually zipped. Each folder may contain multiple files. The compilation process will combine all the files in each folder to create a consolidated output. For example a dataflow folder may contain multiple dataset instances with different series or time periods, the output will be a single compiled dataset instance built from all the dataset files.

Full Replace vs Updates

The compiler can be run in full replace mode, in which case it will compile all files in the source directory. Alternatively the compiler can be run in update mode. In update mode it will only compile files which have been updated after a specific point in time. The compiler will compare the timestamp on the file with the update after time it is given to determine whether to include the file in the compiled output or not.

When the compiler is run it can be given the location of the central ledger, this should be the location of the ledger used in the production environment. The Fusion Edge Compiler uses the ledger information to know what the current live environment looks like, and to know what the current version of the live environment is, and when it was last built. It will use this information in update mode to know what files to include in the next compile (it will only compile files that have been modified since the last compile time). In addition, as the central ledger contains the location of the live compiled datasets, the Fusion Edge Compiler is able to download the current live dataset in order to apply any changes, if new series or observations have been updated since the last compile. Providing the central ledger location is essential if running the Fusion Edge Server in dynamic mode (discussed later) as the Fusion Edge Compiler is able to create the new live environment by merging the current live environment with any changes, and it is able to update the central ledger correctly ready for the next release.

Compilation Output

When the compilation process is run, the compiler will generate a target folder in the location specified. The compiler will create a number of folders under the target folder and in each folder it will write content compiled from the source folders. It will also create a file in the root of the target folder called ledger.json, which contains the environment information, when it was created, when it should go live and what the version is. If the target folder is not empty, the compiler will remove any files which are not part of the new environment.

The final target folder will always contain the complete environment once compilation is complete. This means the target folder, in its entirety can be published to a test Fusion Edge Server instance, as it contains the exact environment which is to become the next live environment.

Each environment is versioned in the central ledger. The first time the compiler is run the version is 1.0.0 and on subsequent compilations the version will increase based on the level of change since the last compilation. The target folder contains ledger_indexes folder, with a file that contains the information about what content is in the release for the the environment. The ledger index ensures that the Fusion Edge Server only pulls the files that are required in the environment, with any additional files that may exist in the same folders will be ignored.

Versioning

The results of a compilation is a new environment for the Fusion Edge Server. The new environment may contain files that remain unchanged since the last compilation (if running in update mode and nothing was modified), files may have been removed, and new files may have been added. The ledger file contains an entry for the new environment, including the time the compilation started, the go live time, and the version. Timestamps are provided as Epoch time in milliseconds. The version is given as a 3 part version syntax, starting at version 1.0.0. The three parts are known as the major version, minor version, and patch version. The patch version is updated if the compiler was run in update mode and the only change was to one or more datasets. The minor version is updated if the structural metadata changed since the last compilation, this could for example be due to new time series requiring new classifications which were not in previous environments. The major version is updated if there are new datasets for Dataflows that previously did not exist or had no data. The major version is also updated if new reference metadata are released.

Example Initial Release Version 1.0.0 Modify a dataset Version 1.0.1 Modify another dataset Version 1.0.2 Export a new Codelist Version 1.1.0 Add a new dataflow with data Version 2.0.0

Publishing Content

The Fusion Edge Server can be run in one of two modes, static mode, and dynamic mode. In both modes it must be given access to the Environment so it can load it into memory.


​Static Mode

In static mode, the Fusion Edge Server loads the environment into memory on application startup. The environment can only be updated by restarting the web server. In static mode, the environment folder is zipped to a file named node.zip. The zip file is placed in the root folder of the Fusion Edge Server home folder. The Fusion Edge Server will read the node.zip file on startup.

Dynamic Mode

In dynamic mode, the environment is placed at a location that the Fusion Edge Server can read (File System, URL, or Amazon S3). The environment must not be zipped, and the folder structure must no be changed from the compiled output. The folder that contains the environment may contain additional files and folders, for example files from previous environments may be present, they will not be read by the Fusion Edge Server as the ledger and corresponding ledger index tell the Fusion Edge Server which files are part of the environment.

In terms of where to place the environment, one of three options are supported:

  1. Environment content can be placed on a private web server and made accessible as a URL. In this instance the Fusion Edge Server is given the URL to the root folder, for example https://mydomain.org/subfolder/edge-conent. The Fusion Edge Server will look for the ledger.json file under this URL and then read the ledger index file. For example it will look for the following files if the latest environment is version 1.0.0:

https://mydomain.org/subfolder/edge-conent/ledger.json https://mydomain.org/subfolder/edge-conent/ledger_indexes/1.0.0

  1. Environment content is placed on a file system accessible by the Fusion Edge Server. In this instance the Fusion Edge Server is given the path to the root folder, for example /home/edge/live-environment
  2. Environment content is uploaded to Amazon S3. In this instance the Fusion Edge Server is given the name of the S3 bucket which the environment was published to. It also requires the AWS region, secret key and access key so it can access the content securely.

Amazon S3 is a good choice of ledger location, as it provides a central secure location for files and the Fusion Compiler is able to publish content to Amazon S3 automatically by running the publish command. For the other two options, the environment files must be moved using anther process (not provided by the Fusion Compiler). If moving an environment ensure the ledger.json file is moved last, as it is this file which tells the Fusion Edge Server that it needs to update its content. If the ledger.json file is moved before the content is moved, the Fusion Edge Server will fail to load the new environment.

Signing Content

To ensure content is not corrupt or tampered with, the Fusion Edge Compiler will give each file a name which is generated from a hash of the file contents, coupled with a secret key. The secret key is provided by the user at compilation time, and should always be the same for each compilation. The Fusion Edge Server is given the same secret key as part of its configuration. When the Fusion Edge Server loads the environment files it will also create a hash of the file contents and couple it with the same secret key, to ensure it matches the file name. If it does not match, the environment will not be loaded as it indicates either the content was corrupted or tampered with after it was created.

Specify Go Live Time

The compilation process can take an optional go live time, this ensures the Fusion Edge Server will not make the environment live until the specific point in time. The content can be made accessible to the Fusion Edge Server before this point in time, but the Fusion Edge Server will not make the content live until the specified point in time. The Fusion Edge Server can be configured in such a way to ensure the environment is pre-loaded into memory before go live time, this ensures the environment is released exactly on schedule. The Fusion Edge Server can also be configured to pull the environment from a URL or S3 into its local file system prior to go live to remove any risks of network latency delaying the go live time. An example configuration is to allow the Edge Server to pre-download an environment 5 minutes prior to go live, and pre-load it into memory 30 seconds prior to go live.

Generating content to publish

The Fusion Edge Compiler expects a specific folder structure which contains the files to compile. The folder and file content can be created manually, by copying structure files and data files into the correct location, however, if running a Fusion Registry instance or have an SDMX compliant web service, the file system can be generated automatically using the Fusion Edge Compiler. The Fusion Edge Compiler queries the SDMX web service for datasets, metadatasets and corresponding structural metadata in order to build the file system. The Fusion Edge Compiler is given the configuration of what to include in the output, including which Dataflows to publish data for, which structures to include. The Fusion Edge Compiler will ensure that whatever content it exports, the corresponding structural metadata will be included, and the structural metadata will be complete. For example if a dataset is generated in the output, the corresponding Dataflow and all descendants of the Dataflow will be output in the structural metadata file. The descendants of a Dataflow include the Data Structure Definition, Codelists, Concept Schemes, and Agency Scheme, everything that is required to read the data. The Fusion Edge Compiler can be configured to output additional structures, for example Category Schemes, Hierarchical Codelists, or any other structure that is available via the web service. The Fusion Edge Compiler can be configured to include Reference Metadata, in which case all reference metadata is included in the output, along with the corresponding Metadata Structure Definitions and metadata targets, and all descendant structures of these.

When running the extract process from the SDMX web service, the Fusion Edge Compiler can be configured to only include datasets updated after a specific point in time, in which case it will query for data using the updatedAfter query parameter. The Fusion Edge Compiler will not delete datasets in the target folder if it is running in update mode, it will only create new dataset files with the updated series and observations present. The Fusion Edge Compiler can be given the location of the ledger when running an extract process, it will use the last compile time as the last update date to use for the extract process.

Restricting content to be published

The Fusion Edge Compiler can be configured export sub-cubes for particular Dataflows by providing Dimension filters for the given Dataflow. For example it can be configured to only publish UK and French data for a particular Dataflow. In addition, if using Fusion Registry as the source of the data and structural metadata, the Fusion Registry security rules can be used to restrict content so that it can never be extracted by the Fusion Edge Compiler.

The Fusion Edge Compiler can query the SDMX web service either as a public user (no authentication provided) or using HTTP Basic authentication (username and password), this is the mechanism used by the Fusion Registry for authenticating users. In this way, the Fusion Registry can create a Data Consumer user account for the Fusion Edge Compiler to use, and then content can be restricted accordingly.