EdgeServer

From Fusion Registry Wiki
Revision as of 09:55, 25 August 2022 by Mnelson (talk | contribs) (Deployment Architecture)
Jump to navigation Jump to search

Fusion Edge

Overview

The Fusion Edge Server is a Java web application, it’s responsibility is to host SDMX web services for the dissemination of data and related structural and reference metadata. The web services provided by the Fusion Edge Server can be used by multiple applications, including and not limited to:


  1. Fusion Data Browser to provide a web user interface for the discovery, display and export of data.
  2. FXLData which provides data connectivity to Microsoft Excel
  3. Tableau Web Data Connector providing data connectivity to Tableau
  4. SDMX Connectors third party library which provides connectivity to multiple applications including R, Matlab, Excel
  5. Custom public data dissemination web sites and ‘data portals’ as run by most national statistics offices, central banks and international organisations
  6. Data / metadata API services for data consumers who need direct access

In addition, consumers can access the web services directly in order to set up automated data pipelines pulling data on demand.

Performance and Durability

The Fusion Edge Server is built to be highly performant, even under extreme load from multiple concurrent users. It uses a purpose built in-memory time series database, augmented with a custom caching layer, both of which have been tuned to work with the SDMX data and metadata model. The caching layer, for instance, knows when two data queries refer to the same data sub-cube even if the queries are expressed differently in the URL. Query optimisation similarly ensures that multiple concurrent queries for the same data are executed only once, directing the subsequent requests to the cached data once it has been built.

The Fusion Edge Server loads the pre-compiled content into memory on application startup, an Edge Server hosting 18 datasets, with 700k series and 50milliion observations will startup and be ready to serve queries in as little as 20 seconds. The content can be served from central shared location, such as Amazon S3 or a private web server, allowing multiple Fusion Edge Server instances to obtain their content from the same location. When the content is updated in the central location, each Fusion Edge Server can automatically update its local content. The content can even be set to be made available at a specific date and time, meaning each Fusion Edge Server can make the content live at exactly the same time.

This architecture, has no single point of failure making the Fusion Edge Server the perfect solution for horizontal scaling.

Security

It is important to note that the content in the Fusion Edge Server is all public, as long as the client has been given access to the web service, they can query for any structural metadata, data, or reference metadata content without restriction. The Fusion Edge Server does not provide any user security, or authentication services. It is possible however to host multiple Fusion Edge Server environments which host different content. For example a public dissemination environment and a private internal dissemination environment. This can be managed both by the Fusion Edge Compiler configuration, which can define what content is exported into each environment. Content can also be controlled in the Fusion Registry by creating multiple user accounts. In one export configuration the Fusion Edge Compiler can act as a public user with no authentication, and in another configuration the Fusion Edge Compiler can provide a username and password to authenticate itself, and the Fusion Registry will filter content accordingly.

The Fusion Edge Server has no User Interface for maintaining content, and therefore no web services for editing content. All content loaded into the Fusion Edge Server is immutable, it can not be modified once loaded, only replaced. It is not open to attacks such as SQL injection as it does not make use of SQL, all data stores are custom built specifically to serve immutable content. The only way to get content into the Fusion Edge Server is via a central Ledger which can be hosted on the local file system, or via a secure web server or Amazon S3. The central Ledger contains tamper protection by signing all content with a secret key, known only to the Fusion Edge Server and the Fusion Edge Compiler which compiles the content for publication. It is not possible to tamper with the content of the ledger without the secret key. A new entry manually added to the ledger will be rejected by the Fusion Edge Server, as will a compiled file which has been manually edited.

Deployment Architecture

Example Architecture

The Fusion Edge Server environment consists of:

  1. The web application (Fusion Edge Server). This is responsible for hosting the data and making it available via web services.
  2. The compiler (Fusion Edge Compiler). This is used to compile the content to be published to the Fusion Edge Server. It manages a central ledger, and provides updates to the ledger and related indexes as part of the compilation process.

​ Publishing Content to the Fusion Edge Server

Compiling Source Data

Content is published to the Fusion Edge Server by compiling datasets, structure files, and reference metadata files that are present in a local file system. The compilation process is run using the Fusion Edge Compiler. The Fusion Edge Compiler is told the root folder to look in and it expects to find the following folder structure under the root folder:

|- data |-- [agency id] |---- [dataflow id] |------ [dataflow version] (data files are placed in this folder) |- structure (structure files are placed in this folder) |- metadata (metadata files are placed in this folder)

Where agency id, dataflow id, and dataflow version are specific to the Dataflows that the data are for. The content can be in any SDMX format, each folder can contain multiple files, the compiler will merge the information where required.

Note: The Fusion Edge Compiler can build this local file system automatically from content pulled from compliant SDMX web services such as those provided by Fusion Registry. Information is provided later about how this is achieved.

And example folder/file content is given below:

|- data |-- WB |---- POVERTY |------ 1.0 |-------- PovertyData.zip |-------- PovertyUpdate.xml |---- EDUCATION |------ 1.0 |-------- EduData_1990_2010.json |-------- EduData2010_2020.xml |- structure |-- corestructures.zip |-- categories.xml |-- msds.xml |- metadata |--metadataset1.zip |--metadataset2.zip

The files in the file system must be in SDMX format, and may be individually zipped. Each folder may contain multiple files. The compilation process will combine all the files in each folder to create a consolidated output. For example a dataflow folder may contain multiple dataset instances with different series or time periods, the output will be a single compiled dataset instance built from all the dataset files.

Full Replace vs Updates

The compiler can be run in full replace mode, in which case it will compile all files in the source directory. Alternatively the compiler can be run in update mode. In update mode it will only compile files which have been updated after a specific point in time. The compiler will compare the timestamp on the file with the update after time it is given to determine whether to include the file in the compiled output or not.

When the compiler is run it can be given the location of the central ledger, this should be the location of the ledger used in the production environment. The Fusion Edge Compiler uses the ledger information to know what the current live environment looks like, and to know what the current version of the live environment is, and when it was last built. It will use this information in update mode to know what files to include in the next compile (it will only compile files that have been modified since the last compile time). In addition, as the central ledger contains the location of the live compiled datasets, the Fusion Edge Compiler is able to download the current live dataset in order to apply any changes, if new series or observations have been updated since the last compile. Providing the central ledger location is essential if running the Fusion Edge Server in dynamic mode (discussed later) as the Fusion Edge Compiler is able to create the new live environment by merging the current live environment with any changes, and it is able to update the central ledger correctly ready for the next release.

Compilation Output

When the compilation process is run, the compiler will generate a target folder in the location specified. The compiler will create a number of folders under the target folder and in each folder it will write content compiled from the source folders. It will also create a file in the root of the target folder called ledger.json, which contains the environment information, when it was created, when it should go live and what the version is. If the target folder is not empty, the compiler will remove any files which are not part of the new environment.

The final target folder will always contain the complete environment once compilation is complete. This means the target folder, in its entirety can be published to a test Fusion Edge Server instance, as it contains the exact environment which is to become the next live environment.

Each environment is versioned in the central ledger. The first time the compiler is run the version is 1.0.0 and on subsequent compilations the version will increase based on the level of change since the last compilation. The target folder contains ledger_indexes folder, with a file that contains the information about what content is in the release for the the environment. The ledger index ensures that the Fusion Edge Server only pulls the files that are required in the environment, with any additional files that may exist in the same folders will be ignored.

Versioning

The results of a compilation is a new environment for the Fusion Edge Server. The new environment may contain files that remain unchanged since the last compilation (if running in update mode and nothing was modified), files may have been removed, and new files may have been added. The ledger file contains an entry for the new environment, including the time the compilation started, the go live time, and the version. Timestamps are provided as Epoch time in milliseconds. The version is given as a 3 part version syntax, starting at version 1.0.0. The three parts are known as the major version, minor version, and patch version. The patch version is updated if the compiler was run in update mode and the only change was to one or more datasets. The minor version is updated if the structural metadata changed since the last compilation, this could for example be due to new time series requiring new classifications which were not in previous environments. The major version is updated if there are new datasets for Dataflows that previously did not exist or had no data. The major version is also updated if new reference metadata are released.

Example Initial Release Version 1.0.0 Modify a dataset Version 1.0.1 Modify another dataset Version 1.0.2 Export a new Codelist Version 1.1.0 Add a new dataflow with data Version 2.0.0

Publishing Content

The Fusion Edge Server can be run in one of two modes, static mode, and dynamic mode. In both modes it must be given access to the Environment so it can load it into memory.


​Static Mode

In static mode, the Fusion Edge Server loads the environment into memory on application startup. The environment can only be updated by restarting the web server. In static mode, the environment folder is zipped to a file named node.zip. The zip file is placed in the root folder of the Fusion Edge Server home folder. The Fusion Edge Server will read the node.zip file on startup.

Dynamic Mode

In dynamic mode, the environment is placed at a location that the Fusion Edge Server can read (File System, URL, or Amazon S3). The environment must not be zipped, and the folder structure must no be changed from the compiled output. The folder that contains the environment may contain additional files and folders, for example files from previous environments may be present, they will not be read by the Fusion Edge Server as the ledger and corresponding ledger index tell the Fusion Edge Server which files are part of the environment.

In terms of where to place the environment, one of three options are supported:

  1. Environment content can be placed on a private web server and made accessible as a URL. In this instance the Fusion Edge Server is given the URL to the root folder, for example https://mydomain.org/subfolder/edge-conent. The Fusion Edge Server will look for the ledger.json file under this URL and then read the ledger index file. For example it will look for the following files if the latest environment is version 1.0.0:

https://mydomain.org/subfolder/edge-conent/ledger.json https://mydomain.org/subfolder/edge-conent/ledger_indexes/1.0.0

  1. Environment content is placed on a file system accessible by the Fusion Edge Server. In this instance the Fusion Edge Server is given the path to the root folder, for example /home/edge/live-environment
  2. Environment content is uploaded to Amazon S3. In this instance the Fusion Edge Server is given the name of the S3 bucket which the environment was published to. It also requires the AWS region, secret key and access key so it can access the content securely.

Amazon S3 is a good choice of ledger location, as it provides a central secure location for files and the Fusion Compiler is able to publish content to Amazon S3 automatically by running the publish command. For the other two options, the environment files must be moved using anther process (not provided by the Fusion Compiler). If moving an environment ensure the ledger.json file is moved last, as it is this file which tells the Fusion Edge Server that it needs to update its content. If the ledger.json file is moved before the content is moved, the Fusion Edge Server will fail to load the new environment.

Signing Content

To ensure content is not corrupt or tampered with, the Fusion Edge Compiler will give each file a name which is generated from a hash of the file contents, coupled with a secret key. The secret key is provided by the user at compilation time, and should always be the same for each compilation. The Fusion Edge Server is given the same secret key as part of its configuration. When the Fusion Edge Server loads the environment files it will also create a hash of the file contents and couple it with the same secret key, to ensure it matches the file name. If it does not match, the environment will not be loaded as it indicates either the content was corrupted or tampered with after it was created.

Specify Go Live Time

The compilation process can take an optional go live time, this ensures the Fusion Edge Server will not make the environment live until the specific point in time. The content can be made accessible to the Fusion Edge Server before this point in time, but the Fusion Edge Server will not make the content live until the specified point in time. The Fusion Edge Server can be configured in such a way to ensure the environment is pre-loaded into memory before go live time, this ensures the environment is released exactly on schedule. The Fusion Edge Server can also be configured to pull the environment from a URL or S3 into its local file system prior to go live to remove any risks of network latency delaying the go live time. An example configuration is to allow the Edge Server to pre-download an environment 5 minutes prior to go live, and pre-load it into memory 30 seconds prior to go live.

Generating content to publish

The Fusion Edge Compiler expects a specific folder structure which contains the files to compile. The folder and file content can be created manually, by copying structure files and data files into the correct location, however, if running a Fusion Registry instance or have an SDMX compliant web service, the file system can be generated automatically using the Fusion Edge Compiler. The Fusion Edge Compiler queries the SDMX web service for datasets, metadatasets and corresponding structural metadata in order to build the file system. The Fusion Edge Compiler is given the configuration of what to include in the output, including which Dataflows to publish data for, which structures to include. The Fusion Edge Compiler will ensure that whatever content it exports, the corresponding structural metadata will be included, and the structural metadata will be complete. For example if a dataset is generated in the output, the corresponding Dataflow and all descendants of the Dataflow will be output in the structural metadata file. The descendants of a Dataflow include the Data Structure Definition, Codelists, Concept Schemes, and Agency Scheme, everything that is required to read the data. The Fusion Edge Compiler can be configured to output additional structures, for example Category Schemes, Hierarchical Codelists, or any other structure that is available via the web service. The Fusion Edge Compiler can be configured to include Reference Metadata, in which case all reference metadata is included in the output, along with the corresponding Metadata Structure Definitions and metadata targets, and all descendant structures of these.

When running the extract process from the SDMX web service, the Fusion Edge Compiler can be configured to only include datasets updated after a specific point in time, in which case it will query for data using the updatedAfter query parameter. The Fusion Edge Compiler will not delete datasets in the target folder if it is running in update mode, it will only create new dataset files with the updated series and observations present. The Fusion Edge Compiler can be given the location of the ledger when running an extract process, it will use the last compile time as the last update date to use for the extract process.

Restricting content to be published

The Fusion Edge Compiler can be configured export sub-cubes for particular Dataflows by providing Dimension filters for the given Dataflow. For example it can be configured to only publish UK and French data for a particular Dataflow. In addition, if using Fusion Registry as the source of the data and structural metadata, the Fusion Registry security rules can be used to restrict content so that it can never be extracted by the Fusion Edge Compiler.

The Fusion Edge Compiler can query the SDMX web service either as a public user (no authentication provided) or using HTTP Basic authentication (username and password), this is the mechanism used by the Fusion Registry for authenticating users. In this way, the Fusion Registry can create a Data Consumer user account for the Fusion Edge Compiler to use, and then content can be restricted accordingly.