(Photo by Christian De Stradis on Unsplash)
概覽摘要 Briefing
Amazon Kinesis 產品線分成四種主要功能:
- Kinesis Video Streams (KVS),
- Kinesis Data Streams (KDS),
- Kinesis Data Firehose (KDF), and
- Kinesis Data Analytics (KDA) 👉 Amazon Managed Service for Apache Flink.
引用官方文件裡的一句話描述各個主要功能:
- Amazon Kinesis Video Streams
- 擷取、處理和存放影片串流以進行分析和機器學習。
- Amazon Kinesis Data Streams
- 建置自訂應用程式以使用熱門的串流處理架構來分析資料串流。
- Amazon Kinesis Data Firehose
- 將資料串流載入 AWS 資料存放區。
- Amazon Kinesis Data Analytics 👉 Amazon Managed Service for Apache Flink
- 使用 SQL 或 Java 處理及分析串流資料。
歷史考古 History
古往今來,縱橫脈絡。
Terms
Here is a list of nouns that appear on the scene, and the full name of the original text, noun definition and source are noted.
Amazon Kinesis Data Streams (KDS)
(Amazon Kinesis Data Streams High-Level Architecture)
Kinesis Data Stream
- A Kinesis data stream is a set of shards.
- Each shard has a sequence of data records.
- Each data record has a sequence number that is assigned by Kinesis Data Streams.
Data Record
- A data record is the unit of data stored in a Kinesis data stream.
- Data records are composed of a sequence number, a partition key, and a data blob, which is an immutable sequence of bytes.
- A data blob can be up to 1 MB.
Retention Period
- The retention period is the length of time that data records are accessible after they are added to the stream.
- Default: 24 hours
- Minimum: 24 hours (using the DecreaseStreamRetentionPeriod operation.)
- Maximum: 8760 hours (365 days) (using the IncreaseStreamRetentionPeriod operation)
- Additional charges apply for streams with a retention period set to more than 24 hours.
Producer
- Producers put data records into Amazon Kinesis Data Streams.
Consumer
- Consumers get data records from Amazon Kinesis Data Streams.
- Consumers == Amazon Kinesis Data Streams Application (KDS App).
Amazon Kinesis Data Streams Application (KDS App)
- Two types:
- Shared fan-out consumers
- Enhanced fan-out consumers
- Using:
- AWS Lambda
- Kinesis Data Analytics 👉 Amazon Managed Service for Apache Flink
- Kinesis Data Firehose
- Kinesis Client Library
- The output of a KDS app can be input for another stream.
- Docs: Reading Data from Amazon Kinesis Data Streams - Amazon Kinesis Data Streams
Shard
- A shard is a uniquely identified sequence of data records in a stream.
- A stream is composed of one or more shards, each of which provides a fixed unit of capacity.
- Each shard can support
- up to 5 transactions per second for reads,
- up to a maximum total data read rate of 2 MB per second,
- up to 1,000 records per second for writes,
- up to a maximum total data write rate of 1 MB per second (including partition keys).
Partition Key
- A partition key is used to group data by shard within a stream.
- Partition keys are Unicode strings, with a maximum length limit of 256 characters for each key.
- An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards.
- When an application puts data into a stream, it must specify a partition key.
Sequence Number
- Each data record has a sequence number that is unique per partition-key within its shard.
- Sequence number is assigned by KDS.
- Sequence numbers cannot be used as indexes to sets of data within the same stream. To logically separate sets of data, use partition keys or create a separate stream for each dataset.
Amazon Kinesis Data Firehose (KDF)
(Amazon Kinesis Data Firehose High-Level Architecture)
Kinesis Data Firehose Delivery Stream
- The underlying entity of Kinesis Data Firehose.
- You use Kinesis Data Firehose by creating a Kinesis Data Firehose delivery stream and then sending data to it.
Data Record
- The data of interest that your data producer sends to a Kinesis Data Firehose delivery stream.
- A record can be as large as 1,000 KB.
Producer
- Producers send records to Kinesis Data Firehose delivery streams.
Buffer Size and Buffer Interval
- Kinesis Data Firehose buffers incoming streaming data to a certain size or for a certain period of time before delivering it to destinations.
- Buffer Size is in MBs and
- Buffer Interval is in seconds.
Amazon Kinesis Data Analytics (KDA) 👉 Amazon Managed Service for Apache Flink
使用情境 Use Cases
常見場景:
Amazon Kinesis Data Streams (KDS)
- 加速日誌和資料饋送的擷取與處理
- 即時指標與報告
- 即時資料分析
- 複雜串流處理 (使用 Directed Acyclic Graphs (DAGs; 有向無環圖))
Amazon Kinesis Data Analytics (KDA) 👉 Amazon Managed Service for Apache Flink
Amazon Kinesis Data Analytics 可讓您快速編寫 SQL 程式碼,以便持續讀取、處理和儲存近乎即時的資料。
- 產生時間序列分析
- 饋送即時儀表板
- 建立即時指標
Frequently Information
KDS Getting Started
Docs: Perform Basic Kinesis Data Stream Operations Using the AWS CLI - Amazon Kinesis Data Streams
- Step 1: Create a Stream
- Create:
aws kinesis create-stream --stream-name Foo --shard-count 1
- Check:
aws kinesis describe-stream-summary --stream-name Foo
- CREATING –> ACTIVE
- Double check:
aws kinesis list-streams
- Create:
- Step 2: Put a Record
aws kinesis put-record --stream-name Foo --partition-key 123 --data testdata
- Step 3: Get the Record
- Step 3.1: GetShardIterator
- Before you can get data from the stream you need to obtain the shard iterator for the shard you are interested in.
aws kinesis get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name Foo
- Step 3.2: GetRecords
aws kinesis get-records --shard-iterator AAAAAAAAAAHSywljv0zEgPX4NyKdZ5wryMzP9yALs8NeKbUjp1IxtZs1Sp+KEd9I6AJ9ZG4lNR1EMi+9Md/nHvtLyxpfhEzYvkTZ4D9DQVz/mBYWRO6OTZRKnW9gd+efGN2aHFdkH1rJl4BL9Wyrk+ghYG22D2T1Da2EyNSH1+LAbK33gQweTJADBdyMwlo5r6PqcP2dzhg=
- Data needs Base64 decoding.
- UNIX-like OS:
SHARD_ITERATOR=$(aws kinesis get-shard-iterator --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON --stream-name Foo --query 'ShardIterator')
aws kinesis get-records --shard-iterator $SHARD_ITERATOR
- Step 3.1: GetShardIterator
- Step 4: Clean Up
aws kinesis delete-stream --stream-name Foo
aws kinesis describe-stream-summary --stream-name Foo
Deep Dive
The general direction is a path, but beware of pits on the ground.
Concurrrency - Producer/Consumer Pattern
- Lecture: Concurrency—Producer/Consumer Pattern and Thread Pools
- The Producer Consumer Pattern described in Java Concurrency and Multithreading Tutorial by Jakob Jenkov.
- Producer–consumer problem - Wikipedia
Amazon Kinesis Data Streams (KDS) vs. Amazon SQS
Comparison table.
Product | Amazon Kinesis Data Stream | Amazon SQS |
---|---|---|
Essential | Streaming service | Messaging service |
Use Cases | - Log and event data collection - Real-time analytics - Mobile app event data feed - IoT data feed | - Application integration - Decouple microservices - Scale jobs to multiple workers |
Scaling | Manually increase/decrease shards | AWS fully-managed |
Durability | Deleted on expiry, or up to reach max retention period. | Deleted by consumer, or up to reach max retention period. |
Default Retention Period | 24 hours | 4 days |
Min Retention Period | 24 hours | 60 seconds |
Max Retention Period | 8760 hours (365 days) | 1,209,600 seconds (14 days) |
Message Order | Ordered within the shard | Standard - Best effort FIFO - Ordered within the message group |
Message Size | 1MB | 256KB |
Message Replay | Yes | No |
Message Consume | Consumers can read the same data records | One message can be consumed by only one consumer at a time |
Amazon Kinesis Data Streams (KDS) vs. Amazon Kinesis Data Firehose (KDF)
Comparison table.
Product | Amazon Kinesis Data Stream | Amazon Kinesis Data Firehose |
---|---|---|
Essential | Streaming service | Delivering streaming data to destinations |
Scaling | Manually increase/decrease shards | AWS fully-managed |
How real time? | Real time (200ms for classic, 70ms for enhanced fan-out) | Near real time (lowest buffer time is 60 seconds) |
Message Storage | 24~8760 hours | No data storage |
Message Replay | Yes | No |
Message Producer | Using KPL, SDK (API), Kinesis Agent (on Amazon Linux, or RHE) | Kinesis Data Streams, Kinesis Agent, SDK, CloudWatch Logs, CloudWatch Events, AWS IoT |
Message Consumer | Using KCL, SDK (API), AWS Lambda, Kinesis Data Analytics, Kinesis Data Firehose. Can be multiple consumers. | Managed by Kinesis Data Firehose (You have decided destination when you create Kinesis Data Firehose Delivery Stream.) |
Message Transformation | (Handled by your consumers.) | Serverless data transformations with AWS Lambda |
Message Destination | Multiple destinations. (Handled by your consumers.) | Amazon S3, Amazon Redshift, Amazon ES, Splunk, and any custom HTTP endpoint, or HTTP endpoints owned by supported third-party service providers. |
參考資料 Reference
範例
- Tutorial: Using AWS Lambda with Amazon Kinesis Data Streams (Official docs)
- AWS Streaming Data Solution for Amazon Kinesis (Official reference design)
- Combining content moderation services with graph databases & analytics to reduce community toxicity, 2023-09-14, by Andrew Thomas