筆記: (AWS re:Invent 2020 DAT310) Deep Dive on Amazon Timestream

Abstract

近年來 TSDB (Time Series Database) 逐步因其特殊性而抽離出來,適合用在 IoT 應用或 DevOps/Apps 分析場景,對應到 AWS 產品即是 Amazon Timestream。運用 AWS 對分散式計算與存儲的優勢,打造了 Serverless 架構、高擴充性的 Amazon Timestream,讓人相當好奇其底層結構。

這篇短短的分享,我覺得可以掌握三個重點:

  1. Time Series Database 的適用場景,與 Amazon Timestream 的強項/優勢。
  2. 基於計費結構規則,而調整寫入資料的組成結構。(如何從 $25 變成 $0.78)
  3. 查詢資料的最佳實務。

不算深入,但算是 30 分鐘快速入門基礎架構。適合正在比較各種 TSDB 的朋友們快速一覽。



Topic

Deep Dive on Amazon Timestream

Speaker

  • Tony Gibbs, AWS Speaker (Principal Database Solution Architect, AWS)

Content

Overview: Deep dive on Amazon Timestream

  • Introducing Amazon Timestream
  • Architectural concepts and terminology
  • Data storage and ingestion
  • Query processing
  • Additional resources

Introducing Amazon Timestream

Time-series use cases

  • IoT applications
    • Collect motion or temperature data from the device sensors, iterpolate to identify the time ranges without motion, or alert consumers to take actions such as turning off the lights to save energy.
  • DevOps analysis
    • Collect and analyze performance and health metrics such as CPU/memory utilization, network data, and IOPS to monitor health and optimize instance usage.
  • App analysis
    • Easily store and analyze clickstream data at scale to understand the customer journey - the user activity across your applications over a period of time.

Building with tie-series data is challenging

  • Relational databases
    • ❌ Inefficient at processing time-series data
    • ❌ Data management issues with rigid schema
    • ❌ Limited integrations for ML, analytics, and data collection
  • Existing time-series solutions
    • ❌ Difficult to scale for large volumes of data
    • ❌ Minimal data lifecycle management
    • ❌ Real-time and historical data are decoupled

Amazon Timestream

Fast, scalable, and serverless time-series database

  • Serverless and easy to use
    • No servers to manage or instances to provision; software patches, indexes, and database optimizations are handled automatically
  • Performance at scale
    • Capable of ingesting trillions of events daily; the adaptive SQL query engine provides rapid point-in-time queries with its in-memory store, and fast analytical queries through its magnetic store
  • Purpose build for time series data
    • Built-in analytics using standard SQL with added interpolation and smoothing functions to identify trends, patterns, and anomalies
  • Secure from the ground up
    • All data is encrypted inflight, and at rest using AWS Key Management Service (AWS KMS) with customer-managed keys (CMK)

Architectural concepts and terminology

  • Continuous releases
    • No maintenance or downtime
    • Serverless architecture

Terminology and concepts: Tables

  • Encrypted container that holds records
  • No data definition or columns are specified at creation
  • Time-based data retention policies for controlling data lifecycle within storage tiers

Terminology and concepts: Storage tiers

Two storage tiers: in-memory and magnetic

  • Retention periods are required for both tiers at table creation
  • Retention periods can be modified after table creation
  • In-memory store can range from 1 hour to 1 year, and magnetic store from 1 dat to 200 years

Amazon Timestream architecture

  • Decoupled architecture
    • Highly available - 99.99% SLA
    • Independently scalable ingestion, storage, and SQL processing
  • High throughput auto-scaling ingestion
    • Data is replicated across multiple Availability Zones
    • Automatic data deduplication handling
    • No need to provision or configure write I/O
  • Multiple tiers of storage
    • Scalable to petabytes and beyond
    • In-memory store is designed for fast point-in-time queries
    • Magnetic store is designed for high performance analytics queries and low cost long-term storage
  • Scalable SQL query engine
    • Adaptive query engine is capable of querying data across multiple data tiers
    • No indexes to configure and no provisioning required

Terminology and concepts: Dimensions

Are a set of attributes that uniquely describe a measurement

  • Eash table allows up to 128 unique dimensions
  • All dimensions are represented as varchars
  • Dimensions are dynamically added to the table during ingestion

Terminology and concepts: Measures

Each Amazon Timestream record contains a single measurement comprised of a name and value

  • Each table supports up to 1,024 unique measure names
  • Measurement values support boolean, bigint, double, and varchar
  • Measures are dynamically added to the table during ingestion

Terminology and concepts: Time series

Sequence of records that are represented as data points over a time interval for given measurement

  • A time series is a set of timestamp and measure value pairs that have the same dimension name, value, and measure name

Example: Time series in Amazon Timestream

Terminology and concepts: Data modeling

  • In a traditional relational database we would create a wide table or use dimension and fact tables to models the data.
  • Amazon Timestream represents data where is a single measure per record.

Characteristics of Amazon Timestream data

  • All records require a timestamp, one or more dimensions, a measurement name, and a measurement value.
  • Records cannot be deleted or updated.
    • Records are only removed when they reach the retention limit within the magnetic tier.
    • Choice of first of last writer wins semantics for handling duplicates.
  • Multiple measures are logically represented as multiple individual records.
    • One measure per record.
  • Automatically scales to handle high throughput, real-time data ingestion.

Data storage and ingestion

Data ingestion: Connectivity

  • Data is written using the AWS SDK
    • Java, Python, Golang, Node.js, .NET, etc.
    • AWS CLI
  • Adapters and plugins
    • AWS IoT Core
    • Amazon Kinesis Data Analytics for Apache Flink connect (GitHub)
    • Telegraf connector (GitHub)

Data ingestion: Pricing

  • $0.50 for 1 million writes of 1KB. (in us-east-1, us-east-2, us-west-2 regions)

Scenario

  • Send 100 different measurements
  • Assume that measurements are sent every 5 seconds
  • Assume on average each record is 110 bytes

Example: Data ingestion using Python (1)

Data ingestion: Calculating pricing (1)

Example: Data ingestion using Python (2)

Put all measurements into one list (records).

Data ingestion: Calculating pricing (2)

Example: Data ingestion using Python (3)

Take all attributes into common attributes.

Data ingestion: Calculating pricing (3)

Data ingestion: Pricing recap

Storage: Memory and magnetic stores

  • In-memory tier
    • Handles the ingestion of all data
    • Timestampe associated with the record must land in the in-memory tier
    • Automatically handles data deduplication
    • Optimized for latency sensitive point-in-time queries
    • $0.036/GB/hour (pricing in us-east-1, us-east-2, us-west-2 regions)
  • Magnetic disk tier
    • Optimized for high performance analytical queries
    • Cost effective for long-term storage
    • $0.03/GB/month (pricing in us-east-1, us-east-2, us-west-2 regions)

Best practices: Data storage and ingestion

  • Use record batching
    • A single write_records request can write a batch of up to 100 records
    • Each write request has a minimum charge of 1KB
  • Use common attributes
    • This removes the need to redundantly send dimensional data for each record
  • Make measure and dimension names only as long as necessary
    • There are ingestion and storage costs for user-defined dimension and measure names
  • Configure the in-memory tier as long as necessary to accommodate late arriving data
    • The in-memory tier is optimized for queries that access narrow windows of time
  • The magnetic tier is better optimized for analytics queries
    • The magnetic tier is cost-optimized to store data indefinitely

Query processing

Query processing: SQL and connectivity

  • (Mostly) ANSI-2003 SQL for querying
    • Time-series, interpolation, and gap-filling functions
    • 250+ scalar, aggregate, and windowing functions
    • Pricing is $0.01/GB of data scanned (pricing in us-east-1, us-east-2, us-west-2 regions)
  • Data is queried using the AWS SDK or AWS CLI
    • Java, Python, Node.js, .NET, etc.
    • JDBC Driver
    • Amazon QuickSight support
    • Grafana (Open Source Edition)

Best practices: Query processing

  • Queries should have a predicate on the measure_name
  • Queries should have a predicate on time
  • Most queries should have a predicate on one or more dimensions
  • Predicates on time, measure_name, and dimensions can reduce data scan charges by leveraging range-restricted scans
  • Queries using a GROUP BY clause will perform faster if the first grouping dimension has a high cardinality
  • Only select the dimensions that are necessary; unnecessary columns read can impact both performance and cost

Additional resources

Loading comments…