🧠 TOP 100 SYSTEM DESIGN Q&A FOR TPM (FAANG)

🔹 SECTION 1: SYSTEM DESIGN FUNDAMENTALS (1–15)

1. What is the TPM’s role in system design?

Answer:
A TPM ensures alignment between business goals, architecture decisions, delivery timelines, and risk management—facilitating decisions, not designing code.

2. How deep should a TPM go in system design?

Answer:
Deep enough to understand architecture trade-offs, scalability limits, and failure modes, without being the primary designer.

3. How do you start a system design discussion?

Answer:
Clarify goals, users, scale, constraints, success metrics, and non-functional requirements.

4. What are non-functional requirements?

Answer:
Scalability, availability, latency, reliability, security, compliance, and maintainability.

5. How do you think about scalability?

Answer:
By identifying bottlenecks, stateless components, horizontal scaling, and capacity planning.

6. What is high availability?

Answer:
Designing systems to minimize downtime using redundancy, failover, and fault isolation.

7. What’s the difference between scalability and performance?

Answer:
Performance is speed at current load; scalability is handling increased load gracefully.

8. What is fault tolerance?

Answer:
The ability of a system to continue functioning despite component failures.

9. What is eventual consistency?

Answer:
A model where data becomes consistent over time rather than immediately.

10. How do you manage trade-offs?

Answer:
By explicitly discussing cost, latency, reliability, and complexity impacts.

11. What is horizontal vs vertical scaling?

Answer:
Horizontal adds more machines; vertical increases machine capacity.

12. What are SLIs, SLOs, and SLAs?

Answer:
Metrics, targets, and customer-facing commitments for reliability.

13. Why is observability important?

Answer:
It enables early detection, diagnosis, and recovery from issues.

14. What is back-pressure?

Answer:
Mechanisms to prevent systems from being overwhelmed by traffic.

15. What is graceful degradation?

Answer:
Reducing functionality instead of total failure during overload.

🔹 SECTION 2: ARCHITECTURE & COMPONENT DESIGN (16–30)

16. Monolith vs Microservices?

Answer:
Monoliths simplify early development; microservices improve scalability and team autonomy but add complexity.

17. When should you use microservices?

Answer:
At scale, with independent teams, clear boundaries, and operational maturity.

18. What is an API gateway?

Answer:
A centralized entry point handling routing, auth, rate limiting, and monitoring.

19. How do you design for loose coupling?

Answer:
Clear interfaces, async communication, and contract-first design.

20. Sync vs async communication?

Answer:
Sync for real-time needs; async for resilience and scalability.

21. What is idempotency?

Answer:
Ensuring repeated requests produce the same result.

22. What is circuit breaker pattern?

Answer:
Preventing cascading failures by stopping calls to unhealthy services.

23. What is caching and where do you use it?

Answer:
Store frequently accessed data closer to users to reduce latency.

24. Cache invalidation strategies?

Answer:
TTL, write-through, write-back, and explicit invalidation.

25. Stateless vs stateful services?

Answer:
Stateless services scale more easily; stateful require careful coordination.

26. How do you handle configuration?

Answer:
Centralized config services with versioning and rollback.

27. What is feature flagging?

Answer:
Decoupling deployment from release.

28. What is service discovery?

Answer:
Dynamic lookup of service endpoints.

29. How do you manage schema evolution?

Answer:
Backward compatibility and versioning.

30. What is API versioning?

Answer:
Supporting multiple API contracts during transition periods.

🔹 SECTION 3: DATA & STORAGE DESIGN (31–45)

31. SQL vs NoSQL?

Answer:
SQL for strong consistency and relations; NoSQL for scale and flexibility.

32. When do you shard databases?

Answer:
When single-node capacity becomes a bottleneck.

33. What is replication?

Answer:
Maintaining copies of data for availability and read scalability.

34. Leader-follower replication?

Answer:
Writes go to leader; followers handle reads.

35. What is data partitioning?

Answer:
Splitting data across nodes based on keys.

36. How do you choose partition keys?

Answer:
Even distribution and access patterns.

37. What is CAP theorem?

Answer:
Trade-off between consistency, availability, and partition tolerance.

38. What is eventual vs strong consistency?

Answer:
Immediate correctness vs availability and scale.

39. How do you handle hot partitions?

Answer:
Re-sharding, caching, or load redistribution.

40. What is data denormalization?

Answer:
Optimizing read performance by duplicating data.

41. What is indexing?

Answer:
Speeding up queries at the cost of storage and write overhead.

42. How do you manage data migrations?

Answer:
Backward-compatible changes and phased rollouts.

43. How do you ensure data integrity?

Answer:
Constraints, validation, and monitoring.

44. What is eventual data reconciliation?

Answer:
Resolving inconsistencies over time.

45. How do you handle large-scale analytics?

Answer:
Separate OLTP and OLAP systems.

🔹 SECTION 4: SCALABILITY, PERFORMANCE & RELIABILITY (46–60)

46. How do you handle traffic spikes?

Answer:
Auto-scaling, caching, and rate limiting.

47. What is load balancing?

Answer:
Distributing traffic across instances.

48. How do you reduce latency?

Answer:
CDNs, caching, and geographic distribution.

49. What is CDN?

Answer:
Serving content closer to users.

50. How do you design for global users?

Answer:
Multi-region deployments and data locality.

51. What is failover?

Answer:
Switching to backup systems on failure.

52. Active-active vs active-passive?

Answer:
Active-active improves availability but adds complexity.

53. How do you test reliability?

Answer:
Chaos testing and fault injection.

54. What is disaster recovery?

Answer:
Restoring service after catastrophic failure.

55. RTO vs RPO?

Answer:
Recovery time vs acceptable data loss.

56. What is throttling?

Answer:
Limiting request rates to protect systems.

57. How do you manage retries?

Answer:
Exponential backoff and idempotency.

58. What causes cascading failures?

Answer:
Uncontrolled dependencies and retries.

59. How do you monitor system health?

Answer:
Metrics, logs, traces, alerts.

60. What is SRE’s role?

Answer:
Reliability engineering through automation and metrics.

🔹 SECTION 5: SECURITY & COMPLIANCE (61–70)

61. How do you design secure systems?

Answer:
Defense in depth and least privilege.

62. Authentication vs Authorization?

Answer:
Identity verification vs access control.

63. What is OAuth?

Answer:
Delegated authorization framework.

64. How do you protect APIs?

Answer:
Auth, rate limiting, and monitoring.

65. What is encryption at rest and in transit?

Answer:
Protecting data stored and during transmission.

66. How do you manage secrets?

Answer:
Centralized secret management systems.

67. What is zero trust?

Answer:
Never trust, always verify.

68. How do you ensure compliance?

Answer:
Policy enforcement, audits, and controls.

69. What is PII?

Answer:
Personally identifiable information requiring protection.

70. How do you handle security incidents?

Answer:
Detection, containment, communication, remediation.

🔹 SECTION 6: DELIVERY, OPERATIONS & TPM JUDGMENT (71–85)

71. How do you manage system dependencies?

Answer:
Explicit ownership, contracts, and monitoring.

72. How do you manage design reviews?

Answer:
Facilitate trade-offs and decision clarity.

73. How do you prevent late-stage surprises?

Answer:
Early risk identification and readiness checks.

74. How do you align architecture with roadmap?

Answer:
Continuous alignment between design and delivery milestones.

75. How do you handle technical debt?

Answer:
Make it visible and prioritize intentionally.

76. How do you manage platform migrations?

Answer:
Phased rollout with backward compatibility.

77. How do you ensure operational readiness?

Answer:
Runbooks, monitoring, and on-call readiness.

78. How do you handle breaking changes?

Answer:
Versioning and migration plans.

79. How do you drive decision-making?

Answer:
Clarify options, risks, and deadlines.

80. How do you balance speed vs stability?

Answer:
Risk-based delivery and guardrails.

81. How do you evaluate build vs buy?

Answer:
Cost, control, speed, and long-term scalability.

82. How do you manage cross-org technical alignment?

Answer:
Shared principles and governance forums.

83. How do you reduce operational load?

Answer:
Automation and simplification.

84. How do you measure system success?

Answer:
User impact, reliability, and scalability.

85. How do you handle post-incident reviews?

Answer:
Blameless learning and systemic fixes.

🔹 SECTION 7: FAANG-STYLE SYSTEM DESIGN SCENARIOS (86–100)

86. Design a URL shortening service (TPM view)

Answer:
Focus on scale, latency, storage, and operational risks.

87. Design a notification system

Answer:
Async processing, retries, and user preferences.

88. Design a metrics collection system

Answer:
High write throughput and aggregation pipelines.

89. Design a file storage system

Answer:
Chunking, replication, and metadata management.

90. Design a messaging system

Answer:
Ordering, delivery guarantees, and scale.

91. Design a search system

Answer:
Indexing, ranking, and freshness.

92. Design a recommendation platform

Answer:
Offline training and online serving separation.

93. Design a logging system

Answer:
High ingestion, retention, and querying.

94. Design an API rate limiter

Answer:
Token bucket or leaky bucket algorithms.

95. Design a global payment system

Answer:
Consistency, security, and fault tolerance.

96. Design a real-time collaboration system

Answer:
Conflict resolution and low latency.

97. Design a feature flag system

Answer:
Fast reads, consistency, and rollout safety.

98. Design a CI/CD system

Answer:
Automation, rollback, and observability.

99. Design a monitoring system

Answer:
Metrics, alerts, and dashboards.

100. How do TPMs evaluate system design success?

Answer:
When architecture enables scale, reliability, and predictable delivery.