Migration Story: Moving High Scale Data and Compute from AWS to Azure

  • October 9, 2017
  • Views

    91,632

Background

We recently worked with Emedgene, a company that participated in the Microsoft Partners Program, on developing a next-generation genomics intelligence platform which incorporates advanced artificial intelligence technologies to streamline the interpretation and evidence presentation process. With this platform, healthcare providers will be able to provide individualized care to more patients through improved yield.

Emedgene continues to grow and, given their positive collaboration experience with Microsoft engineers, decided to migrate their solution from AWS to Azure with support from Microsoft. This code story demonstrates how to migrate the compute resources to Azure, transfer more than 100 TB of blob storage and handle application secrets without embedding the Azure SDK in the application code.

Architecture & Migration

A key part of Emedgene’s architecture is the provisioning of new EC2 Spot instances to execute computationally heavy analytics processes that require the input of large sets of genomics data in S3. Each analytics job metadata is enqueued in a queue for processing by the EC2 instances. The number of the instances is dynamic and varies according to the number of messages in the queue.

Compute: EC2 instances are provisioned using another EC2 instance that is monitoring a Redis Queue for additional jobs.

Data: The genomic datasets can comprise over one million individual files with a cumulative size limit of 100TB. In order to perform very fast analytics, Emedgene needs to copy the different sets of files from S3 to the instances attached disks each time an instance is provisioned and gain higher throughput and lower network latency between the instances and the data. In Azure, we will copy these files from Azure Data Lake Store to the VM’s attached disks the same way.

To support native scalability without using another application module, like the solution in AWS that included Redis Queues and additional EC2 instances, we used Virtual Machine Scale Sets (VMSS). VMSS enables us to monitor an Azure Service Bus queue for messages and provision new instances when the queue reaches a certain threshold. Once the application finishes its task, it invokes a script (Self Destroy Instance) that deletes the VM instance from VMSS. The script can be invoked in a Docker container for maximum flexibility in the deployment process.

Note: We considered working with Azure Batch Low Priority VMs but scaling with Azure Service Bus and custom VM images are not fully supported.

The DevOps flow

The Continuous Integration / Continuous Deployment (CI/CD) process is managed with Jenkins. While Jenkins provides a lot of flexibility, we needed a way to provision and manage Azure resources in the pipeline. To do this, we used Azure CLI 2.0 but we also needed to be able to propagate the results from each command to the next such as names and paths.

For example, this code is the result of a CLI command. We want to take the “name” property and propagate it to another command since it is dynamic.

To do this, we created the Azure CLI Jenkins Plugin. The following steps describe how to provision a new VM, create an image from that VM, and create a Virtual Machine Scale Set (VMSS) from that image.

  1. Using the Environment Injector plugin, inject environment variables into Jenkins
  2. Provision a new VM with an attached disk
  3. SSH into the VM using Jenkins SSH Plugin
  4. Deploy the application using a simple BASH script
  5. Copy the data from Azure Data Lake Store to the attached disk to enable the application maximum performance in Read/Write
  6. Deprovision the VM
  7. Make an image from the VM
  8. Delete the VM and associated resources
  9. Create or update the VMSS with the current VM image

    The image below shows the Azure CLI commands with environment variables as parameters.

Transferring the Data

Emedgene provides its customers the option of supplying data on either Azure or AWS S3. For Azure, Emedgene decided that they wanted to store their data in Azure Data Lake Store (ADLS), which enables the capture of data of any size, type, or ingestion speed. In order to achieve this functionality, they needed to transfer more than 100 TB of customer data securely from S3 storage to ADLS using Azure Data Factory.  Azure Data Factory allows users to create a workflow that can ingest data from both on-premises and cloud data stores, then transform or process that data using existing compute services.

During the migration process, Emedgene faced a challenge involving their need to periodically pull from their customers’ S3 instances. While Azure Data Factory can be used for a full migration from S3, it only supports incremental data copy from external data sources by date stamp if the data store is properly structured.

Since AWS does not provide a way of enforcing a rigid store hierarchy, Emedegene needed a mechanism to support incremental copy for improperly organized data stores. To resolve this problem, we created a Docker container for incremental data copy from S3 to ADLS. This service enables Emedgene to copy new data incrementally from S3 to ADLS by datestamp without a dependency on data store structure.

Application Secrets

Emedgene’s microservices architecture rests upon a large number of interdependent internal and external services. Emedgene needed a secure and centralized way to manage access to and between these services.  Azure Key Vault provides key and secret management which allowed Emedgene to generate and manage secure access tokens for their services.

However, Emedgene wanted to be able to easily query Azure Key Vault locally using their Azure Service Provider Credentials without having to parse a response object. This approach presented a couple of challenges. While the Azure Service REST API supports Service Provider Authentication, the client-level APIs require Credential Authentication for querying the Key Vault. In addition, certain key vault management features such as retrieving a list of all key vaults are not supported by the REST API.

To support Emedgene’s request we created a scalable Docker container for querying Azure key vaults for secrets using Azure Service Provider Credentials. Once the container service is running, users can choose either to retrieve a secret from a vault or search all vaults for a given secret.

Code

Conclusions

Not all Azure services have direct 1:1 parity with AWS. When migrating from AWS to Azure there are often three important questions:

  1. What is the most efficient path to move my data from AWS?
  2. How can I achieve the same or better compute functionality upon migration?
  3. How do I manage access to my newly migrated services?

Our solution addresses these concerns by providing better native compute scalability, efficient data transfer support, and access management through the Azure Key Vault.

Opportunities for Reuse

The solution outlined in this code story is adaptable to any workload that requires:

  • Continuous transfer of a large amount of data from S3 to Azure Data Lake
  • Changing the Jenkins DevOps process from AWS resources to Azure resources
  • Handling Key Vault secrets without embedding the Key Vault API in your code

Leave a reply

Your email address will not be published. Required fields are marked *