(圖說:Aurora 極光之美。圖片來源。)
昨天看到一則 tweet 提到 Pokémon 將資料庫搬遷到 AWS Aurora PostgreSQL 與 DynamoDB,仔細一看是 AWS re:Invent 2019 的影片 (GAM304),在現場時幾乎都略掉遊戲類的場子,現在回頭來補一下。
以下筆記內容有些段落是我閱讀過程之中或之後的觀察與補充,筆記的目的是增進我的學習效率,與保持分享的習慣。
內容大綱
AWS re:Invent 2019: Migrating the live Pokémon database to Aurora PostgreSQL (GAM304)
- 影片出處:https://youtu.be/2eEKuK5eOC4
- Speakers
- Chris Finch, Sr. SA Game Tech Evangelist, AWS
- Jeff Webb, Development Manager, The Pokémon Company International
- David Williams, Sr. DevOps Engineer, The Pokémon Company International
- Agenda
- Introduction
- A brief history
- The challenge
- The solution
- The results (想看總結 (省時間) 可以直接看這個段落即可)
Introduction
- The Pokémon Company International (TPCi)
- Subsidiary of The Pokémon Company (TPC)
- Manages Pokémon property outside of Asia
- Scopes
- Brand management
- Localization
- Trading card game
- Marketing
- Licensing
- PokemonCenter.com
- Engineering
A brief history
- Before Pokémon GO (pre-2016)
- All the consoles are managed by TPC.
- Pokemon.com was the focus of the small tech team and included:
- Pokemon.com
- Marketing and user engagement
- Pokémon TV
- Organized Play
- Trading card game league/tournament management
- Pokémon Trainer Club
- User registration
- User profile management
- Authentication
- Millions of accounts
- Used by Pokemon.com and a few smaller digital products
- Pokémon Trainer Club Service (PTCS)
- Purpose: User registration and login system
- COPPA
- GDPR
- Size: Into the hundreds of millions
- Usage: Millions of logins a day
- Purpose: User registration and login system
- Pokemon.com
- Preparing for Pokémon GO
- Lift & shift from co-lo to AWS in Spring 2016.
- Split PTC data out to NoSQL DB in preparation for GO.
- Pokémon GO launched (July 2016)
- And everything changed
- 10x growth of PTCS users in 6 months
- 100% increase in PTCS users by end of 2017
- Additional 50% increase in users by end of 2018
- Service and DB performance was good
- And everything changed
The challenge
- Service stability issues: 2017/2018
- Service and DB availability was not good
- Downtime: 137 hours (down or degraded) in six-month period (約占總時間 3.17%)
- Licensing and infrastructure costs increasing
- 300 nodes required to support
- Engineering time
- Full-time support from 1-2 resources
- Service and DB availability was not good
- Business drivers for change
- Instability of the DB platform was impacting customer and visitor experience.
- Future project success required achieving new goals
- Stabilize our services & infrastructure for reduced downtime & customer impact
- Reduce operational overhead of managing the DB platform
- Reduce costs associated with our DB platform
- Infrastructure drivers for change
- Oversized EC2 isntances
- Duplicated infrastructure
- Maintain multiple redundant copies of data and indexes
- Operational overhead
- Routine activities no longer routine
- Unexpected behaviors: Amazon EC2 restarts, capacity, etc.
- DB backups
- Patching/upgrading became unmanageable
- Data tier architecture
- Hundreds of DB instances across serveral roles
- All deployed with Amazon EC2 Auto Scaling groups
- One datastore for all data
The solution
- Design goals
- Leverage managed services
- Optimize resource utilization for our core business
- Use appropriate datastore for data
- Event data
- User data
- Configuration/TTL data
- High availability, stability, performance
- Reduce cost & right-size infrastructure
- Only want to use and pay for what we need
- Leverage managed services
- Choosing Amazon Aurora PostgreSQL
- Amazon DynamoDB
- Pros
- Tier 1 AWS service
- Multi-region
- Highly scalable
- Easier lift from JSON?
- Cons
- Encryption (at the time)
- Internal expertise
- Pros
- Aurora MySQL
- Pros
- Internal expertise
- Encryption
- Feature rich
- Cons
- Nascent tech
- Multi-region
- JSON to relational ETL
- Pros
- Aurora PostgreSQL
- Pros
- Internal expertise
- Encryption
- DB internals
- Native support for JSON
- Cons
- Feature lag
- Multi-region
- Pros
- Amazon DynamoDB
- Data-driven approach
- Acceptance criteria
- Authentication: 2k/sec
- User signups: 60+/sec
- Admin bulk queries
- Testing
- User generation: 200m
- Performance test suite: Burst, soak testing
- Iterate
- Rework schema
- Rerun tests
- Acceptance criteria
- The migration plan
- An iterative approach
- Improve cache stability
- Migrate out TTL and configuration data
- Stream event data
- Each phase should independently deliver value
- Migration phases
- 看起來一開始所有東西都自架在 EC2 上,以下標示 (1) (2) (3)…
- Application tier –> (2) 將 configuration data 往外搬到 DynamoDB: Auth config, TTL tables –> (3) Amazon Kinesis Data Streams 送進 Analytics storage (S3),S3 相對比養 EC2 機器便宜。
- PTC instances
- Auth instances
- Batch/Async instances
- Data tier –> (4) 將 Data/Query/Index nodes 合併成 Profile data 放 Aurora PostgreSQL
- Data nodes
- Query nodes
- Index nodes
- Cache nodes –> (1) 改用 ElastiCache Memcached
- An iterative approach
- Planning for Aurora PostgreSQL
- AWS Professional Services to bridge the knowledge gap
- Validate our schema design and provide feedback
- Advice on how to tune DB parameter groups
- Tools for planning monitoring and tuning
- Aurora PostgreSQL cluster design
- PTC instances –> Cluster
- Auth instances –> Login
- Batch/Async instances –> Admin/bulk
- AWS Professional Services to bridge the knowledge gap
- The migration: Extract-Transform-Load (ETL)
- 分成
NoSQL live cluster
、NoSQL backup cluster
以及NoSQL extraction cluster
。 - Transform & load
- “Pretty easy”
- Abandon users that had never activated
- Minor data changes, nothing structural
- Extract
- Leverage NoSQL cluster architecture
- Map-reduce to find users not extracted
- Extration process marks users as such in backup cluster
- Any user changes in production would overwrite changes in backup cluster
- Test
- 11m user multi-cluster test setup
- Dozens of test runs
- Test cases - inactive users, user updates
- ~2% of documents were not overwritten
- And iterate
- User profile change documents
- Third cluster
- 分成
- Migration Day
- 先停掉 PTCS 的 profile maintenance activites。
- Auth 不能停。Auth (login) 持續打進 NoSQL,保持 extract 轉進 Aurora PostgreSQL。
- Aurora PostgreSQL 都完成資料同步後,Auth 從打進 NoSQL 改成打進 Aurora PostgreSQL。
- Testers 進來測試 PTCS,將 PTCS 打進 Aurora PostgreSQL。
- 最後停掉 NoSQL。
- How did it go?
- The good
- No authentication downtime
- Plan worked as expected
- The bad
- Patches
- Some underperforming queries
- The ugly
- Nothing
- 95% of users experienced no impact
- Performance was good and consistent
- The good
The results
- Design goals revisited
- Leverage managed services (checked)
- Use appropriate datastore for data (checked)
- High availability, stability, performance (checked)
- Reduce cost & right-size infrastructure (checked)
- Overall value
- Technology
- Old platform: 3rd party NoSQL
- New platform: Aurora, DynamoDB, S3
- Benefits: Independent scaling / managed
- Infra/Licensing
- Old: ~300 / Costly
- New:
1020 / ~$0 - Benefits:
$3.54.5 million/year savings
- Dedicated resources
- Old: 1.5 dev/engineer
- New: None
- Benefits: 1.5 dev/engineer savings (還在還在,被調去支援其他任務:p)
- Stability
- Old: 137 hours (6 months)
- New: 0
- Benefits: Customer experience, priceless (這邊其實應該可以算出來商業價值對應的金額,但寫 priceless 比較多掌聲 (誤 XDD))
- Technology
- Project retrospective
- What went well
- An agile approach to the project & problems we encountered
- Leveraging the experts: AWS DB Professional Services (Critical point! Ask for help!)
- Segmenting the data and how it was handled
- Key learnings
- Deeper understanding of our data and services
- Prior solution was more expensive (month and people) than we realized
- What went well
- Tech org moving forward
- Next phase
- Monitor & optimize the platform and our services
- Understand upcoming needs & changes to our data
- Scope our advanced analytics requirements
- New architectural tenets
- Define and use common design patterns and components
- Simplify toolset, preferring managed services
- Documentation must exist, be well-defined
- Data standards & practices
- Use data to make decisions
- Next phase
- 結束時全場歡呼,這場好嗨森! XDD