Integrate with Azure Data Factory
Guide to Integration with Azure Data Factory
Network Effects of Combining Data
Today we are working with more disparate data sources than ever before. When we talk about Big Data solutions, external data sources as well as your own internal data are important in order to perform advanced analytics to drive competitive advantage.
More organisations are realising the importance of external data which is accessible through APIs and that can be downloaded into cloud storage solutions where it can be combined with their own internal data.
The process of collecting data is called ingestion and there are a number of cloud managed services that can help with this process.
ELT over ETL
Typically, we want to ingest the data in native format and store in the Data Warehouse of our choice so that we can return to the entirety of the data at a later date to perform particular transformations and analysis for a particular business objective
Therefore, in cloud and advanced analytic scenarios you could be more inclined to adopt an Extract, Load and Transform (ELT) model where you’re purely extracting and loading the data in native form.
Azure Data Factory
ADF orchestrates and automates the movement, transformation and analysis of data through the cloud. ADF does not perform the transformation or analytical work but provides the pipelines to move data through from one service to another.
The body of the document details the steps in integrating DoordaHost with Microsoft Azure Data Factory (ADF) in preparation for transformation and/or analytics. It assumes the reader is familiar with ADF and a Data Factory object has already been created to host the pipeline(s).
Doorda Data Integration with Azure Data Factory
The diagram below depicts the logical components involved in building an integration pipeline for Doorda.
The Linked Service Object provides connection information to a data source as well as compute resources. In this guide we will create a Linked Service Object to define a connection to DoordaHost.
The Dataset object defines the inputs and outputs for a pipeline. In this guide we will create a dataset for the information copied from DoordaHost (input).
An activity is a process that performs computations on the data such as transformations, analytics. In this guide we use the out of the box COPY activity to copy data from DoordaHost to produce records/files to a data sink of your choice.
Activities are rolled up into a pipeline and is the object used to schedule and monitor a particular pipeline.
Process and Architecture
It is important to understand the business objective first in order to identify the data sources within the Doorda Data Catalog(s) that can help with this objective. This is the starting point to setting up the appropriate ADF pipeline architecture.
Configuring Azure Portal Copy Data tool
This section walks through the configuration of the Copy Data tool made available inside the Azure Portal. This is the minimum setup required to copy data from DoordaHost to your repository in native format.
Inside Home > Data factories, create a new or select an existing Data Factory object. Select the Author & Monitortile to start the ADF (UI) application on a separate browser tab.
Next, select the Copy data activity from the “Let’s get started” page. This will take you to the Copy Data wizard which is detailed in the next section.
Copy Data Tool
Configure the task cadence or task schedule for this pipeline.
It is recommended the following approach is adopted when copying data from DoordaHost:
- Execute an ADF pipeline to copy all data from the targeted Doorda Ledger Catalog. This is to bootstrap your targeted system of record with all available data. This only has to be run once.
- Schedule a separate ADF pipeline to run on a regular basis to query for record updates on the targeted Doorda Ledger Catalog. This can be achieved by filtering records by date_added column and only copying records with a date after the last successful scheduled copy activity.
Source Connection (Linked Service)
DoordaHost is built on top of Presto, a highly scalable, distributed SQL query engine optimised for Big Data. ADF provides a Presto connector with support for the copy activity which negates the need for any bespoke scripting. Configure the source data store to use the Presto connector and populate the requisite fields as shown in the figure on the right.
If it is disabled, it is recommended the Interactive Authoring is enabled to allow the AutoResolveIntegrationRuntime to test the validity of the Presto Connection configuration. This can be enabled by clicking on the information icon next to the “Interactive authoring disabled” and following the “Edit interactive authoring” link.
The Presto configuration can be tested for correctness by clicking the “Test Connection” at the bottom of the configuration page.
Configure the source dataset with the desired Doorda table(s) or alternatively create a bespoke query. A bespoke query will be required for regular scheduled jobs to filter records by date_added column as described in the Properties section.
The full list of Doorda tables associated with the configured Catalog are presented on screen.
Destination Connection (Linked Service)
Select the data sink of your choice and the appropriate dataset configuration for this sink. There are many options to choose from SQL/NoSQL repositories to various file formats. Please refer to Azure Documentation for further information. In this example we have chosen Azure Data Lake Storage Gen2 which simply copies the data to blob storage in Parquet format.
Once all configured, the pipeline is ready to be deployed. Depending on the properties of the pipeline, this will either run immediately (most appropriate for bootstrapping your system of record with data from the Doorda Catalog) or on a regular schedule (most appropriate for acquiring new records from the Doorda Catalog).