筆記: Pokémon 將資料庫搬遷到 Aurora 與 DynamoDB

Published: 2020-04-02

Lastmod: 2025-07-12

(圖說：Aurora 極光之美。圖片來源。)

昨天看到一則 tweet 提到 Pokémon 將資料庫搬遷到 AWS Aurora PostgreSQL 與 DynamoDB，仔細一看是 AWS re:Invent 2019 的影片 (GAM304)，在現場時幾乎都略掉遊戲類的場子，現在回頭來補一下。

以下筆記內容有些段落是我閱讀過程之中或之後的觀察與補充，筆記的目的是增進我的學習效率，與保持分享的習慣。

內容大綱

AWS re:Invent 2019: Migrating the live Pokémon database to Aurora PostgreSQL (GAM304)

影片出處：https://youtu.be/2eEKuK5eOC4
Speakers
- Chris Finch, Sr. SA Game Tech Evangelist, AWS
- Jeff Webb, Development Manager, The Pokémon Company International
- David Williams, Sr. DevOps Engineer, The Pokémon Company International
Agenda
- Introduction
- A brief history
- The challenge
- The solution
- The results (想看總結 (省時間) 可以直接看這個段落即可)

Introduction

The Pokémon Company International (TPCi)
- Subsidiary of The Pokémon Company (TPC)
- Manages Pokémon property outside of Asia
Scopes
- Brand management
- Localization
- Trading card game
- Marketing
- Licensing
- PokemonCenter.com
- Engineering

A brief history

Before Pokémon GO (pre-2016)
- All the consoles are managed by TPC.
- Pokemon.com was the focus of the small tech team and included:
  - Pokemon.com
    - Marketing and user engagement
    - Pokémon TV
  - Organized Play
    - Trading card game league/tournament management
  - Pokémon Trainer Club
    - User registration
    - User profile management
    - Authentication
    - Millions of accounts
      - Used by Pokemon.com and a few smaller digital products
  - Pokémon Trainer Club Service (PTCS)
    - Purpose: User registration and login system
      - COPPA
      - GDPR
    - Size: Into the hundreds of millions
    - Usage: Millions of logins a day

Preparing for Pokémon GO
- Lift & shift from co-lo to AWS in Spring 2016.
- Split PTC data out to NoSQL DB in preparation for GO.
- Pokémon GO launched (July 2016)
  - And everything changed
    - 10x growth of PTCS users in 6 months
    - 100% increase in PTCS users by end of 2017
    - Additional 50% increase in users by end of 2018
  - Service and DB performance was good

The challenge

Service stability issues: 2017/2018
- Service and DB availability was not good
  - Downtime: 137 hours (down or degraded) in six-month period (約占總時間 3.17%)
- Licensing and infrastructure costs increasing
  - 300 nodes required to support
- Engineering time
  - Full-time support from 1-2 resources
Business drivers for change
- Instability of the DB platform was impacting customer and visitor experience.
- Future project success required achieving new goals
  1. Stabilize our services & infrastructure for reduced downtime & customer impact
  2. Reduce operational overhead of managing the DB platform
  3. Reduce costs associated with our DB platform
Infrastructure drivers for change
- Oversized EC2 isntances
- Duplicated infrastructure
  - Maintain multiple redundant copies of data and indexes
- Operational overhead
  - Routine activities no longer routine
  - Unexpected behaviors: Amazon EC2 restarts, capacity, etc.
  - DB backups
  - Patching/upgrading became unmanageable
Data tier architecture
- Hundreds of DB instances across serveral roles
- All deployed with Amazon EC2 Auto Scaling groups
- One datastore for all data

The solution

Design goals
- Leverage managed services
  - Optimize resource utilization for our core business
- Use appropriate datastore for data
  - Event data
  - User data
  - Configuration/TTL data
- High availability, stability, performance
- Reduce cost & right-size infrastructure
  - Only want to use and pay for what we need
Choosing Amazon Aurora PostgreSQL
- Amazon DynamoDB
  - Pros
    - Tier 1 AWS service
    - Multi-region
    - Highly scalable
    - Easier lift from JSON?
  - Cons
    - Encryption (at the time)
    - Internal expertise
- Aurora MySQL
  - Pros
    - Internal expertise
    - Encryption
    - Feature rich
  - Cons
    - Nascent tech
    - Multi-region
    - JSON to relational ETL
- Aurora PostgreSQL
  - Pros
    - Internal expertise
    - Encryption
    - DB internals
    - Native support for JSON
  - Cons
    - Feature lag
    - Multi-region
Data-driven approach
- Acceptance criteria
  - Authentication: 2k/sec
  - User signups: 60+/sec
  - Admin bulk queries
- Testing
  - User generation: 200m
  - Performance test suite: Burst, soak testing
- Iterate
  - Rework schema
  - Rerun tests
The migration plan
- An iterative approach
  - Improve cache stability
  - Migrate out TTL and configuration data
  - Stream event data
- Each phase should independently deliver value
- Migration phases
  - 看起來一開始所有東西都自架在 EC2 上，以下標示 (1) (2) (3)…
  - Application tier –> (2) 將 configuration data 往外搬到 DynamoDB: Auth config, TTL tables –> (3) Amazon Kinesis Data Streams 送進 Analytics storage (S3)，S3 相對比養 EC2 機器便宜。
    - PTC instances
    - Auth instances
    - Batch/Async instances
  - Data tier –> (4) 將 Data/Query/Index nodes 合併成 Profile data 放 Aurora PostgreSQL
    - Data nodes
    - Query nodes
    - Index nodes
    - Cache nodes –> (1) 改用 ElastiCache Memcached
Planning for Aurora PostgreSQL
- AWS Professional Services to bridge the knowledge gap
  - Validate our schema design and provide feedback
  - Advice on how to tune DB parameter groups
  - Tools for planning monitoring and tuning
- Aurora PostgreSQL cluster design
  - PTC instances –> Cluster
  - Auth instances –> Login
  - Batch/Async instances –> Admin/bulk
The migration: Extract-Transform-Load (ETL)
- 分成 NoSQL live cluster、NoSQL backup cluster 以及 NoSQL extraction cluster。
- Transform & load
  - “Pretty easy”
  - Abandon users that had never activated
  - Minor data changes, nothing structural
- Extract
  - Leverage NoSQL cluster architecture
  - Map-reduce to find users not extracted
  - Extration process marks users as such in backup cluster
  - Any user changes in production would overwrite changes in backup cluster
- Test
  - 11m user multi-cluster test setup
  - Dozens of test runs
  - Test cases - inactive users, user updates
- ~2% of documents were not overwritten
- And iterate
  - User profile change documents
  - Third cluster

Migration Day
- 先停掉 PTCS 的 profile maintenance activites。
- Auth 不能停。Auth (login) 持續打進 NoSQL，保持 extract 轉進 Aurora PostgreSQL。
- Aurora PostgreSQL 都完成資料同步後，Auth 從打進 NoSQL 改成打進 Aurora PostgreSQL。
- Testers 進來測試 PTCS，將 PTCS 打進 Aurora PostgreSQL。
- 最後停掉 NoSQL。
How did it go?
- The good
  - No authentication downtime
  - Plan worked as expected
- The bad
  - Patches
  - Some underperforming queries
- The ugly
  - Nothing
- 95% of users experienced no impact
- Performance was good and consistent

The results

Design goals revisited
- Leverage managed services (checked)
- Use appropriate datastore for data (checked)
- High availability, stability, performance (checked)
- Reduce cost & right-size infrastructure (checked)

Overall value
- Technology
  - Old platform: 3rd party NoSQL
  - New platform: Aurora, DynamoDB, S3
  - Benefits: Independent scaling / managed
- Infra/Licensing
  - Old: ~300 / Costly
  - New: 1020 / ~$0
  - Benefits: ~~$3.5~~4.5 million/year savings
- Dedicated resources
  - Old: 1.5 dev/engineer
  - New: None
  - Benefits: 1.5 dev/engineer savings (還在還在，被調去支援其他任務:p)
- Stability
  - Old: 137 hours (6 months)
  - New: 0
  - Benefits: Customer experience, priceless (這邊其實應該可以算出來商業價值對應的金額，但寫 priceless 比較多掌聲 (誤 XDD))
Project retrospective
- What went well
  - An agile approach to the project & problems we encountered
  - Leveraging the experts: AWS DB Professional Services (Critical point! Ask for help!)
  - Segmenting the data and how it was handled
- Key learnings
  - Deeper understanding of our data and services
  - Prior solution was more expensive (month and people) than we realized
Tech org moving forward
- Next phase
  - Monitor & optimize the platform and our services
  - Understand upcoming needs & changes to our data
  - Scope our advanced analytics requirements
- New architectural tenets
  - Define and use common design patterns and components
  - Simplify toolset, preferring managed services
  - Documentation must exist, be well-defined
  - Data standards & practices
  - Use data to make decisions
結束時全場歡呼，這場好嗨森！ XDD