TOP 100 SYSTEM DESIGN Q&A FOR TPM (FAANG)
🧠TOP 100 SYSTEM DESIGN Q&A FOR TPM (FAANG)
🔹 SECTION 1: SYSTEM DESIGN FUNDAMENTALS (1–15)
1. What is the TPM’s role in system design?
Answer:
A TPM ensures alignment between business goals, architecture decisions, delivery timelines, and risk management—facilitating decisions, not designing code.
2. How deep should a TPM go in system design?
Answer:
Deep enough to understand architecture trade-offs, scalability limits, and failure modes, without being the primary designer.
3. How do you start a system design discussion?
Answer:
Clarify goals, users, scale, constraints, success metrics, and non-functional requirements.
4. What are non-functional requirements?
Answer:
Scalability, availability, latency, reliability, security, compliance, and maintainability.
5. How do you think about scalability?
Answer:
By identifying bottlenecks, stateless components, horizontal scaling, and capacity planning.
6. What is high availability?
Answer:
Designing systems to minimize downtime using redundancy, failover, and fault isolation.
7. What’s the difference between scalability and performance?
Answer:
Performance is speed at current load; scalability is handling increased load gracefully.
8. What is fault tolerance?
Answer:
The ability of a system to continue functioning despite component failures.
9. What is eventual consistency?
Answer:
A model where data becomes consistent over time rather than immediately.
10. How do you manage trade-offs?
Answer:
By explicitly discussing cost, latency, reliability, and complexity impacts.
11. What is horizontal vs vertical scaling?
Answer:
Horizontal adds more machines; vertical increases machine capacity.
12. What are SLIs, SLOs, and SLAs?
Answer:
Metrics, targets, and customer-facing commitments for reliability.
13. Why is observability important?
Answer:
It enables early detection, diagnosis, and recovery from issues.
14. What is back-pressure?
Answer:
Mechanisms to prevent systems from being overwhelmed by traffic.
15. What is graceful degradation?
Answer:
Reducing functionality instead of total failure during overload.
🔹 SECTION 2: ARCHITECTURE & COMPONENT DESIGN (16–30)
16. Monolith vs Microservices?
Answer:
Monoliths simplify early development; microservices improve scalability and team autonomy but add complexity.
17. When should you use microservices?
Answer:
At scale, with independent teams, clear boundaries, and operational maturity.
18. What is an API gateway?
Answer:
A centralized entry point handling routing, auth, rate limiting, and monitoring.
19. How do you design for loose coupling?
Answer:
Clear interfaces, async communication, and contract-first design.
20. Sync vs async communication?
Answer:
Sync for real-time needs; async for resilience and scalability.
21. What is idempotency?
Answer:
Ensuring repeated requests produce the same result.
22. What is circuit breaker pattern?
Answer:
Preventing cascading failures by stopping calls to unhealthy services.
23. What is caching and where do you use it?
Answer:
Store frequently accessed data closer to users to reduce latency.
24. Cache invalidation strategies?
Answer:
TTL, write-through, write-back, and explicit invalidation.
25. Stateless vs stateful services?
Answer:
Stateless services scale more easily; stateful require careful coordination.
26. How do you handle configuration?
Answer:
Centralized config services with versioning and rollback.
27. What is feature flagging?
Answer:
Decoupling deployment from release.
28. What is service discovery?
Answer:
Dynamic lookup of service endpoints.
29. How do you manage schema evolution?
Answer:
Backward compatibility and versioning.
30. What is API versioning?
Answer:
Supporting multiple API contracts during transition periods.
🔹 SECTION 3: DATA & STORAGE DESIGN (31–45)
31. SQL vs NoSQL?
Answer:
SQL for strong consistency and relations; NoSQL for scale and flexibility.
32. When do you shard databases?
Answer:
When single-node capacity becomes a bottleneck.
33. What is replication?
Answer:
Maintaining copies of data for availability and read scalability.
34. Leader-follower replication?
Answer:
Writes go to leader; followers handle reads.
35. What is data partitioning?
Answer:
Splitting data across nodes based on keys.
36. How do you choose partition keys?
Answer:
Even distribution and access patterns.
37. What is CAP theorem?
Answer:
Trade-off between consistency, availability, and partition tolerance.
38. What is eventual vs strong consistency?
Answer:
Immediate correctness vs availability and scale.
39. How do you handle hot partitions?
Answer:
Re-sharding, caching, or load redistribution.
40. What is data denormalization?
Answer:
Optimizing read performance by duplicating data.
41. What is indexing?
Answer:
Speeding up queries at the cost of storage and write overhead.
42. How do you manage data migrations?
Answer:
Backward-compatible changes and phased rollouts.
43. How do you ensure data integrity?
Answer:
Constraints, validation, and monitoring.
44. What is eventual data reconciliation?
Answer:
Resolving inconsistencies over time.
45. How do you handle large-scale analytics?
Answer:
Separate OLTP and OLAP systems.
🔹 SECTION 4: SCALABILITY, PERFORMANCE & RELIABILITY (46–60)
46. How do you handle traffic spikes?
Answer:
Auto-scaling, caching, and rate limiting.
47. What is load balancing?
Answer:
Distributing traffic across instances.
48. How do you reduce latency?
Answer:
CDNs, caching, and geographic distribution.
49. What is CDN?
Answer:
Serving content closer to users.
50. How do you design for global users?
Answer:
Multi-region deployments and data locality.
51. What is failover?
Answer:
Switching to backup systems on failure.
52. Active-active vs active-passive?
Answer:
Active-active improves availability but adds complexity.
53. How do you test reliability?
Answer:
Chaos testing and fault injection.
54. What is disaster recovery?
Answer:
Restoring service after catastrophic failure.
55. RTO vs RPO?
Answer:
Recovery time vs acceptable data loss.
56. What is throttling?
Answer:
Limiting request rates to protect systems.
57. How do you manage retries?
Answer:
Exponential backoff and idempotency.
58. What causes cascading failures?
Answer:
Uncontrolled dependencies and retries.
59. How do you monitor system health?
Answer:
Metrics, logs, traces, alerts.
60. What is SRE’s role?
Answer:
Reliability engineering through automation and metrics.
🔹 SECTION 5: SECURITY & COMPLIANCE (61–70)
61. How do you design secure systems?
Answer:
Defense in depth and least privilege.
62. Authentication vs Authorization?
Answer:
Identity verification vs access control.
63. What is OAuth?
Answer:
Delegated authorization framework.
64. How do you protect APIs?
Answer:
Auth, rate limiting, and monitoring.
65. What is encryption at rest and in transit?
Answer:
Protecting data stored and during transmission.
66. How do you manage secrets?
Answer:
Centralized secret management systems.
67. What is zero trust?
Answer:
Never trust, always verify.
68. How do you ensure compliance?
Answer:
Policy enforcement, audits, and controls.
69. What is PII?
Answer:
Personally identifiable information requiring protection.
70. How do you handle security incidents?
Answer:
Detection, containment, communication, remediation.
🔹 SECTION 6: DELIVERY, OPERATIONS & TPM JUDGMENT (71–85)
71. How do you manage system dependencies?
Answer:
Explicit ownership, contracts, and monitoring.
72. How do you manage design reviews?
Answer:
Facilitate trade-offs and decision clarity.
73. How do you prevent late-stage surprises?
Answer:
Early risk identification and readiness checks.
74. How do you align architecture with roadmap?
Answer:
Continuous alignment between design and delivery milestones.
75. How do you handle technical debt?
Answer:
Make it visible and prioritize intentionally.
76. How do you manage platform migrations?
Answer:
Phased rollout with backward compatibility.
77. How do you ensure operational readiness?
Answer:
Runbooks, monitoring, and on-call readiness.
78. How do you handle breaking changes?
Answer:
Versioning and migration plans.
79. How do you drive decision-making?
Answer:
Clarify options, risks, and deadlines.
80. How do you balance speed vs stability?
Answer:
Risk-based delivery and guardrails.
81. How do you evaluate build vs buy?
Answer:
Cost, control, speed, and long-term scalability.
82. How do you manage cross-org technical alignment?
Answer:
Shared principles and governance forums.
83. How do you reduce operational load?
Answer:
Automation and simplification.
84. How do you measure system success?
Answer:
User impact, reliability, and scalability.
85. How do you handle post-incident reviews?
Answer:
Blameless learning and systemic fixes.
🔹 SECTION 7: FAANG-STYLE SYSTEM DESIGN SCENARIOS (86–100)
86. Design a URL shortening service (TPM view)
Answer:
Focus on scale, latency, storage, and operational risks.
87. Design a notification system
Answer:
Async processing, retries, and user preferences.
88. Design a metrics collection system
Answer:
High write throughput and aggregation pipelines.
89. Design a file storage system
Answer:
Chunking, replication, and metadata management.
90. Design a messaging system
Answer:
Ordering, delivery guarantees, and scale.
91. Design a search system
Answer:
Indexing, ranking, and freshness.
92. Design a recommendation platform
Answer:
Offline training and online serving separation.
93. Design a logging system
Answer:
High ingestion, retention, and querying.
94. Design an API rate limiter
Answer:
Token bucket or leaky bucket algorithms.
95. Design a global payment system
Answer:
Consistency, security, and fault tolerance.
96. Design a real-time collaboration system
Answer:
Conflict resolution and low latency.
97. Design a feature flag system
Answer:
Fast reads, consistency, and rollout safety.
98. Design a CI/CD system
Answer:
Automation, rollback, and observability.
99. Design a monitoring system
Answer:
Metrics, alerts, and dashboards.
100. How do TPMs evaluate system design success?
Answer:
When architecture enables scale, reliability, and predictable delivery.
Comments
Post a Comment