The AWS analytics group of services has a lot of members. These are some of the newer offerings from Amazon. However, they are very effective to use in professional development and learning more about your enterprise environment.
Query Data in S3 using SQL. Store your data in S3. Then you can define your schema on top of the data set and run queries. The UI is not very awesome currently, but it is a way to avoid building out a data warehouse for your needs. This serverless query service can get you analytical data back quickly. Better yet, it comes without all of the typical setup.
Managed Search Service. This service provides a way for you to upload data or documents, index them, and provide a search system for that data via HTTP requests. This is flexible and allows you to custom define the indexes. Thus, you can upload almost any document format or data style and utilize the service to handle search requests.
Hosted Hadoop Framework. This service allows you to spin up a Spark or Hadoop system on top of your S3 data lake quickly. It covers the headaches of getting those environments built. Also, it is a cost-effective solution to your data science needs that can scale to avoid over-buying your resources.
Run and Scale Elasticsearch Clusters. ES is a popular open-source search and analytics engine. There are a broad number of uses for this including log file research, stream data analysis, application monitoring, and more. This is quick and easy to set up so you can dive right into the analysis part of your work. The fully managed service has an API and CLI, as you would expect so that you can automate it to your needs.
Work with Real-time Streaming Data. This provides a way to analyze video streams in real time and was covered with the media group. We included it in the media episode. Therefore, we will not spend time on it here.
Fully Managed Apache Kafka Service. I must admit that this is not an area where I am solid so it, is best to use Amazon’s own words.
“Amazon Managed Streaming for Kafka (Amazon MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. Amazon MSK provides the control-plane operations and lets you use Apache Kafka data-plane operations, such as those for producing and consuming data. It runs open-source versions of Apache Kafka. This means existing applications, tooling, and plugins from partners and the Apache Kafka community are supported without requiring changes to application code. This release of Amazon MSK supports Apache Kafka version 1.1.1.”
This service provides fast, simple, and cost-effective data warehousing. If you wonder whether there is a fully managed data warehouse solution out there then here is your answer. Redshift is fully managed, scales up to petabytes, and incorporates the security and administration tools you come to expect from AWS. There are some excellent how-to and tutorials to help you get started and maybe even understand warehouses more in general.
This is a fast business analytics service. Also known as a fully managed BI solution. It is what you would expect from a BI solution. Therefore, it requires setup and forethought to position your data. Although this is a robust service, expect to spend a few hours (at least) to get going.
Next is an orchestration service for periodic, data-driven workflows. Yes, that is their words, not mine. The AWS Data Pipeline is a web service that helps you reliably move data between different AWS compute and storage services. The scope includes on-premises data sources as well. Therefore, you can schedule moving all of your enterprise data to the proper destinations. All of this includes being able to translate and manipulate it at scale. Once you get to the point of having a lot of data in AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR, this becomes critical. Thus, while this is not of much use early on, it is essential to running an enterprise.
This service helps you prepare and load data. AWS Glue is a fully managed ETL (extract, transform, and load) solution. Therefore, this makes it easy to prepare and load your data no matter the end goal. You can create and run an ETL job with a few clicks in the AWS Management Console. I have not used it beyond simple tests, but this may be your best solution to ETL needs. When you store your data on AWS then why not try out this solution? It catalogs the data and makes it easy to dive right into that ETL process.
This is advertised as how to build a secure data lake in days. I find it hard to argue against that claim. We have already seen how well AWS handles storing and cataloging (even indexing) data. Therefore, it makes sense that their data lake tool would extend from those solutions. With data lakes being a sort of new concept you might want to see the latest news and how-tos at this link.