Welcome to a new style of blogging, called the High-level overview (HLO) series. In these blogs, I will describe the problem, which usually is something I came across recently, followed by a high-level solution overview of how it could be solved. The goal is to get you to dig deeper into individual components that make this high-level solution possible.
Very recently, I had a call from one of our architects’ who was tasked with comparing different clouds and present a solution to his customer. While the customer was going forward with one solution, they were interested in finding out how a solution would be built out in AWS. The goal here was to have a replication environment for customer’s on-premise SQL server so it can be failed over to. Moreover, the customer wanted to be able to stream the data out of the SQL environment into an elastic MapReduce cluster for data analytics purposes. The customer was also concerned about storing large amounts of data into an archive and be able to retrieve it when needed. Needless to say, all the connectivity needed to be secure.
In summary –
OBJ.001 – Need to have functional disaster recovery environments for their data
OBJ.002 – Need to have an effective way to do data processing and modeling while keeping costs low
OBJ.003 – Need to have an archival methodology to fulfill long-term storage requirements.
OBJ.004 – Ensure data in transit is secure.
FR.001 – A busy SQL server needed replication, backup and archival to ensure availability, disaster recovery, and long-term storage.
FR.002 – Data from SQL server needs to be pushed into elastic MapReduce (EMR) for data processing and modeling.
FR.003 – Data from EMR needs to be archived for long-term storage purposes.
FR.004 – Secure connectivity to all services over which data would traverse from customer on-prem to a cloud provider.
Below is a high-level illustration of the current deployment as I understood.
The Options –
Given the customer objectives and functional requirements, AWS provides multiple products that help us define a solution to satisfy the customer’s use case. Let’s look at these products individually.
1. Amazon RDS – Amazon’s relational database service (RDS) provides scalable DBaaS offering with the ability to migrate and replicate SQL databases.
2. Amazon S3 – Amazon’s object storage offering, the Amazon S3 (Simple storage service) allows you to store large volumes of data with virtually unlimited capacity.
3. Amazon Glacier – AWS glacier is Amazon’s archival storage offering that allows you to archive PB scale data for extremely cheap prices. AWS glacier can also automatically archive data stored on Amazon S3 using life-cycle policies.
4. AWS Kinesis – Amazon Kinesis allows you to collect, process and analyze real-time data streams. Data can be analyzed using AWS Kinesis data analytics which allows the use of standard SQL queries. You can push this kinesis data stream to a stream processing framework like AWS Elastic MapReduce which support Apache Hadoop, Spark, and other big-data frameworks.
5. AWSÂ LamdaÂ – AWS Lamda is Amazon’s serverless technology that allows you to build purpose full functions and calls.
6. AWS SNS – AWS Simple Notification Service (SNS) allows you to trigger notifications or even AWS lambda functions based on pre-defined events.
7. AWS VPN Gateway – Allows you to create secure connections between sites.
8. AWS Storage Gateway – Allows you to deploy a virtual machine instance with different storage options on-premise. This virtual machine replicates all data stored on AWS S3 bucket.
9. AWS Snowball – AWS’s solution to migrate large amounts of data using cold migration techniques
10. AWS DirectConnect – Cost effective private network solution from on-premise to AWS datacenter for migrating large data. The solution can also be used to push network traffic on local networks rather than the internet.
Let’s connect the dots, slowly.
With AWS VPN Gateway, a customer can connect their on-prem environments securely to the AWS regions. This is crucial and fulfills customer’s FR.004 which requires all data to be secure in transit. It is important to remember that there is a limitation of 5 VPN Gateways per region. This limit can be increased by reaching out to AWS Support. Alternatively, there is an option for the customer to use AWS Directconnect that may be a better option in this scenario provided the customer’s data center is close to an AWS Partner Network provider (APN Technology Partner). AWS DirectConnect offers consistent high bandwidths (10GB) and a private connection into your VPC network on AWS. This means traffic that does not traverse the internet. DirectConnect is also ideal for real-time streaming data and can be used to seamlessly extend the customer’s network into AWS.
The customer currently has large sets of data that need to be migrated to AWS. While using DirectConnect is an option that allows for high bandwidth transfers, it can get very expensive. Amazon offers AWS Snowball to help transfer cold data into the AWS cloud. The process is simple. Once you put in a request for a snowball device, AWS sends you a secure device to your data center which can be connected to your environment. You can then copy all your data to this snowball device. Once done, you ship the device back to AWS. All data on the device is encrypted and is secure. AWS also offers Snowball edge that offers more compute within the device allowing you to access your data using a local EC2 instance. AWS Snowball has a limit of 50TB to 80TB while the edge device has a limit of 100TB.
For PB scale data migration, AWS offers SnowMobile. This is an 18-wheeler truck with a Container as a Datacenter. The container is transported to your data center and needs to be connected to power and network. Once done, you can copy PB scale data to this environment before Amazon picks it back up.
For a simpler way to transfer data, AWS offers storage gateways. A storage gateway is a light-weight virtual machine that can be deployed in your environment and configured with your AWS account. The gateway uses a local disk and exposes it as an iSCSI drive that is accessible by other virtual machines. Any data stored on this drive is then replicated to your AWS S3 storage account. Storage gateways’ are can be configured for hot, cold and cached data so you have a variety of options depending on your use case. Download of this storage gateway appliance is free of cost and so is the deployment so it has a low “barrier to entry” and can ideally be used for file transfers.
Storage and Archival
AWS’s S3 (Simple Storage Service) is an object data store that offers unlimited storage for files. S3 storage, like any object storage, is accessible over HTTP and HTTPS and can store data securely on AWS’s datacenters. The data is locally replicated but can also be replicated (Cross-region replication) across regions to increase availability. You can even serve these files directly from S3 into your application. An interesting concept of S3 is its ability to have life-cycle policies on your files which are stored in “S3 buckets”. You can set a life-cycle policy to archive all files after a set amount of time and S3 can move them over to AWS glacier – Amazon’s low-cost long-term archival solution.
AWS’s Simple notification service (SNS) can be configured to alert the customer based on custom or pre-set triggers. You will find SNS being used in almost everything in AWS. For instance, when you create a new AWS account, the first thing to do is to create an SNS billing alert to ensure that you don’t exceed billing thresholds. SNS can also trigger or get triggered by other AWS services such as Lamda functions (Serverless).
AWSÂ LamdaÂ is Amazon’s serverless technology which allows you to run objective-based focus functions based on events or triggers. You can trigger aÂ lamdaÂ function to perform a certain task. For example, I can have an SNS service to ensure that billing does not exceed $100 per day. If it does exceed, I can have an event trigger sent to aÂ lamdaÂ function that will immediately shut down by instances to save on billing costs.
AWS Kinesis is a solution for real-time data stream analysis. Real-time data can be collected and analyzed using AWS Kinesis data analytics – this can be helpful for this customer because the solution allows using regular SQL queries to analyze data. This data can also be pushed to a stream processing framework such as AWS Elastic MapReduce for big-data analysis before being archived.
AWS RDS (Relational Database Service) offers a managed database environment which can be readily consumed. You simply deploy database instances and pick a database flavor. Flavors such as Oracle and SQL are supported and can be deployed. This fulfills the customer’s use case where there was a need to migrateÂ SQLÂ database to a remote instance for disaster recovery purposes. AWS RDS allows SQL replication with changed data to be replicated from your primary database. You can even have read-only RDS instances and perform Disaster recovery tests to fulfill your Business continuity plan (BCP).
Putting it all together
I encourage you to read more about the different solutions discussed in this blog post. Feel free to comment.
Some important links