AWS re:Invent 2022 Mega Recap

12.15.2022

This month our geospatial team traveled to Las Vegas to attend AWS re:Invent and talk about FilmDrop—our cloud-native, geospatial processing suite. Over the course of the week, we had a chance to meet tons of great people and spend time in talks and labs learning as much as we could. We’ve put together a mega list of our favorites sessions and announcements from re:Invent 2022:

Sagemaker

Sagemaker now includes geospatial capabilities. You can access pre-trained models for geospatial data, use imagery from Planet Labs, and use semi-structured or unstructured data for ML/analytics. You can do operations like raster band math, land cover segmentation, and classifying cloud-cover vs cloud-free imagery. There is also integration with Foursquare Studio for visualizing and interacting with data.
SageMaker Shared Spaces provides mechanisms to collaborate with other researchers in the notebooks. Shared Spaces uses EFS to provide shared data and directories. Shared Spaces also supports live collaboration within a notebook akin to what it’s like to editing a Google Doc with collaborators.

OpenSearch

OpenSearch Serverless is a new option to run OpenSearch, a fork of ElasticSearch, that allows scaling up and down without changing ingest and search loads. The service has an ingenious design that separates ingest and search into separate “scalable compute units”. The underlying Lucene shards are created by Ingest and stored in S3. Search scales separately from Ingest based on the current number of queries. Search pulling shards from S3 and storing them in memory to fulfill user requests. This approach could provide huge cost savings for organizations with medium to large clusters that can cost thousands to tens of thousands of dollars a month.
OpenSearch Data Prepper (fully managed data ingestion for observability and security analytics)
OpenSearch now gives you insights into log patterns, the ability to detect unexpected outliers, and the ability to detect patterns based on recurrence of event signatures
AWS PrivateLink support for secure communications

Sessions

WPS302-R: Identifying improper payments with analytics and ML (Builder Session). Building a model for highly imbalanced datasets—strategies to synthetically get it balanced and then splitting it up into training/validation data sets, using JupyterHub, oversampling fraud events using Synthetic Minority Over-sampling (SMOTE) and training the model using a Sagemaker container XGBoost.
Intelligently Automating Cloud Operations:
- AWS Health Service Data: This service lets you pull in health service data in real-time for services that you are using.
- AWS Fault Injection Simulator – Allows you to do some reliability testing and simulations. So instead of figuring out how to test using 3rd party tools.
- AWS Devops Guru – It allows you to use ML to monitor infrastructure, and see performance indicators and how to automatically remedy non-performant configuration. One example was enabling the service so that if there are a few service outage status codes then it would rollover to another region.
Elasticsearch drive speed, scale, and relevance: This talk focused on optimizations with different sections within elastic search. One tool mentioned was using eland with this tool you can create data frames from indexes as well as import hugging face or other NLP models within the system.

Everything Else

AWS is investing in a zero-ETL future. New functionalities in this are Aurora zero-ETL integration with Redshift and Redshift auto-copy from S3. Amazon Redshift integration for Apache Spark makes it easier and faster for customers to run Apache Spark applications on data from Amazon Redshift using AWS analytics and machine learning services. No need to move any data, you can run Spark queries on Redshift data from Emr, Glue, SageMaker within seconds.
How to speed up Lambda functions: Lambda SnapStart virtually eliminates the cold start process by creating a snapshot of your Lambda function to bypass the usual initialization process. Also, you can increase memory size, and use provisioned and reserved concurrency.
Cloud data analytics: Talks on building better analytical pipelines. One key thing was using AWS EMR, which has some capabilities from GLUe but at a smaller cost point, especially for hosted notebook integration and shared notebooks.
Sustainability in the cloud. Maintaining AWS workloads that utilize the limited shared resources of the cloud while reducing the overall CO2 emissions. There were multiple sessions covering different aspects such as Sustainability KPIs, architectural best practices, efficient hardware such as Graviton, and programming languages like Rust. In addition, there were multiple sessions on the Rust programming language, citing its ability to perform efficiently and use less memory than many other languages in addition to the safety guarantees provided by the language.
Security Lake is a new data lake service that centralizes security related data like VPC Flow Logs, Cloud Trail, Security Hub, and any other custom data. One of the interesting features is that it stores everything in the new Open Cybersecurity Schema Framework (OSCF) in Parquet files in S3. This OSCF format allows integration with many other sources and analysis tools outside and inside AWS. For example, you can hook up your own data sources providing OSCF and Security Lake will ingest and store it. You can use tools like Athena, OpenSearch, or DataDog to analyze the data.
Amazon Verified Permissions: Looks to do what Cognito does for authentication but for authorization. It provides “scalable, fine-grained permissions management” for custom applications. This could remove a lot of custom code for applications that have to manage permissions for user actions within an application. Permissions are expressed in the Cedar policy language that is reminiscent of IAM permissions.
Compute on orbit. There are so many opportunities for running ultra-low-latency processing pipelines and analytics against data that for one reason or another you don’t want to or can’t downlink.
Delegated Administrator for AWS Organizations
AWS DataZone

Photo by Ameer Basheer on Unsplash