· Valenx Press · 12 min read
Amazon SRE Capacity Planning Interview: A Real Case Study from AWS
Amazon SRE Capacity Planning Interview: A Real Case Study from AWS
The candidates who prepare the most often perform the worst in Amazon SRE capacity planning interviews. I sat in a debrief in 2019 where a candidate with a PhD in operations research and five years at a top CDN company failed the loop. He had memorized Little’s Law, could recite Erlang C formulas from memory, and built a beautiful spreadsheet model. The hiring manager’s single comment in the doc: “No ownership, no customer obsession, no hire.” The candidate who replaced him—a former network tech with an associate’s degree—aced the same loop by describing how he had once stayed awake for 36 hours to prevent a Black Friday overload at a regional retailer. The bar raiser voted yes before the candidate even finished his story.
What Does Amazon Actually Test in SRE Capacity Planning Interviews?
Amazon tests whether you can own a problem that has no clean answer, not whether you can calculate the right answer.
In the Q2 2022 loop that I debriefed for an AWS EC2 placement team, the candidate received a classic: “Design capacity planning for a new EC2 instance family launch in a single AWS region.” The prompt was intentionally underspecified. No traffic estimates. No SLA definitions. No failure mode constraints. The interviewers who passed this loop did not produce the most mathematically elegant solutions. They produced the most paranoid ones.
The first counter-intuitive truth is this: Amazon’s capacity planning interview is not a math test disguised as an interview. It is a judgment test that uses math as a prop.
I watched a senior engineer in the Elastic Compute Cloud org fail this exact prompt by delivering a flawless queuing theory analysis. He calculated server counts, derived peak-to-average ratios, and proposed a graceful degradation curve. The bar raiser’s feedback: “Never once asked who the customer was, never questioned why the region, never pushed back on the timeline. He would have built the wrong thing beautifully.” The hire/no-hire vote split 2-3, and the bar raiser’s no carried.
The engineer who received the offer for that same requisition spent the first eight minutes of his 50-minute session asking clarifying questions. Not trivial ones. He asked whether this instance family was intended for bursty ML training workloads or sustained web serving. He asked whether the region had existing reserved capacity commitments that would constrain physical buildout. He asked what “launch” meant—GA to all customers, or limited beta with account allow-list? Each question telegraphed something specific: he understood that capacity planning at Amazon is a negotiation between engineering constraints and business commitments, not a pure optimization problem.
How Does the AWS SRE Interview Loop Structure Capacity Planning Questions?
The loop contains two technical rounds that test capacity planning, not one, and they test different failure modes.
Most candidates assume the system design round covers capacity planning and the behavioral round covers leadership principles. This is wrong. In AWS SRE loops, both technical rounds embed capacity planning scenarios, and both evaluate how you integrate technical depth with ownership and customer obsession.
In a 2023 debrief for a DynamoDB storage team, the first technical round presented a scenario: “Your service is seeing 15% week-over-week growth in request volume. The current headroom buffer is 30%. You have six weeks until predicted exhaustion. Walk me through your next 36 hours.” The candidate who passed did not open with calculations. He opened with: “I need to verify if that 15% growth is organic or if we have a metrics pipeline double-counting requests. I saw that exact failure mode in 2021.” This was not a scripted answer. It was a signal that he had lived through ambiguity and owned the aftermath.
The second technical round, typically with a principal engineer or senior manager, presents a scenario with no numerical anchor at all. “We’re launching in Jakarta. How much capacity do we need?” The candidates who fail here reach for population statistics, GDP per capita, or regional cloud adoption curves. The candidates who pass start with: “What service, what customer segment, and what’s the minimum viable launch commitment?” They treat the interviewer as a stakeholder whose requirements are half-formed, not as an oracle with hidden numbers.
The organizational psychology principle here is what Amazon calls “disagree and commit” in practice. The interview simulates the actual job: you will receive incomplete requirements from product managers, you will push back, and you will still need to deliver. The interviewers are not testing whether you can solve with missing information. They are testing whether you will blindly solve with missing information.
What Specific Numbers and Timelines Appear in Real AWS Capacity Planning Cases?
Real cases use specific, non-rounded figures that reflect actual AWS operational reality, and you should reference comparable precision.
In a 2021 loop for the S3 index team, the case involved these specifics: 2.3 million requests per second at peak, a requirement to maintain p99 latency below 100ms, and a constraint that any capacity addition required 14 days for physical server delivery plus 3 days for software provisioning. The candidate was expected to identify that the 14-day delivery meant automated reorder triggers at 21 days of remaining headroom, not at exhaustion.
The salary context for this role, confirmed across multiple offer negotiations I participated in, was approximately $165,000 base, $35,000 first-year sign-on, and 35 RSUs for an L5 SRE in the Bay Area. L6 positions in the same specialty ranged from $185,000 to $210,000 base with 55-75 RSUs. These figures vary by region and market conditions, but the precision matters for your negotiation positioning.
Timeline expectations: candidates typically complete the loop in 4-6 weeks from first recruiter screen to offer, with the on-site or virtual on-site comprising 5 rounds of 45-60 minutes each. The capacity planning scenarios appear in rounds 2 and 4, with the bar raiser in round 3 or 5. Two of the five interviewers will be from outside the immediate hiring team, a deliberate mechanism to reduce team-specific bias.
The second counter-intuitive truth: the numbers in the case are less important than how you question the numbers. In the S3 index debrief, the hiring manager noted that the successful candidate had paused to ask: “Is that 2.3 million measured at the edge or at the origin? Those diverge by 18% in our current dashboards.” This detail was not in the case. It was brought from his operational memory. The signal was not that he knew the number. The signal was that he knew numbers lie.
How Do Bar Raisers Evaluate Capacity Planning Answers?
Bar raisers look for ownership signals in how you handle the inevitable moment where the case breaks, not in your smooth execution.
Every capacity planning case at Amazon is designed to break. The interviewer will introduce a constraint that invalidates your previous approach. The classic pivot: “Midway through your plan, the region experiences a fiber cut that reduces available capacity by 40%.” The candidate who continues optimizing their original model has failed. The candidate who says “I stop planning and activate the incident response. Capacity planning just became incident management” has understood the purpose of the exercise.
In a 2020 debrief that I still reference in hiring committee training, a candidate for the CloudFront edge team received a capacity case involving cache storage planning. He built a reasonable model, then the interviewer added: “The procurement team just informed you that SSD costs increased 30% and your budget is fixed.” The candidate paused, then said: “I need to call the product manager and explain that our launch timeline or our performance guarantee has to change. Neither technical option works without business tradeoff.” The bar raiser wrote in her notes: “Demonstrates ownership by escalating appropriately rather than hiding behind technical purity.”
This is the third counter-intuitive truth: the correct answer is often to refuse to answer as framed. Amazon’s leadership principle “Have Backbone; Disagree and Commit” is tested most directly when the case demands that you reject its premises.
The bar raiser’s specific evaluation rubric, reconstructed from multiple debriefs, scores: (1) Did the candidate identify missing requirements? (2) Did the candidate prioritize customer impact over technical elegance? (3) Did the candidate demonstrate ownership when the scenario shifted? (4) Did the candidate show knowledge of how AWS actually operates, or only generic cloud concepts? The fifth dimension, often decisive, is whether the candidate’s questions revealed operational scar tissue—evidence that they had actually lived through similar failures.
What Does a Winning Capacity Planning Response Actually Sound Like?
Winning responses follow a specific structure that mirrors Amazon’s operational rhythm, not academic problem-solving.
The script I have heard work repeatedly follows this pattern: “Before I model anything, I need to understand the customer commitment and the business context. [Ask 2-3 clarifying questions]. With that, my approach would be: first, establish the demand signal and its confidence interval; second, model the supply constraints including the longest-lead-time component; third, define the trigger points where we commit capital before having perfect information; fourth, build the feedback loop so the plan self-corrects. The specific numbers matter less than the guardrails and who makes the call when we exceed them.”
The parenthetical about “who makes the call” is the critical phrase. It signals that you understand capacity planning is a decision rights problem, not merely a forecasting problem.
In a 2022 debrief for the Lambda compute team, the successful candidate used this exact framing, then added: “The 14-day hardware delivery is our constraint. So I want the automatic trigger at 21 days of headroom, but I also want a manual review at 45 days. The automatic trigger prevents panic. The manual review prevents complacency.” The hiring manager, in the debrief, described this as “exactly how we think about it.”
The mistake is to present a finished plan. The winning move is to present a decision framework with explicit trigger points, escalation paths, and known unknowns.
Preparation Checklist
-
Map at least two real operational incidents from your career to the Amazon leadership principles, with specific customer impact metrics and your personal ownership moments
-
Practice verbalizing capacity estimations without completing calculations, forcing yourself to state ranges and confidence intervals: “Somewhere between 10,000 and 50,000 instances, with low confidence until I verify the request pattern”
-
Work through a structured preparation system (the PM Interview Playbook covers Amazon’s operational excellence and capacity planning loops with real debrief examples from AWS infrastructure teams)
-
Memorize three specific AWS service operational details—actual instance families, actual regional launch patterns, actual published architecture patterns—to demonstrate domain fluency not generic cloud knowledge
-
Record yourself answering a capacity case in exactly 45 minutes, then review for: time spent clarifying versus solving, moments where you could have escalated ownership, and any point where you accepted the interviewer’s framing without questioning it
Mistakes to Avoid
BAD: “I would calculate the peak QPS, then multiply by redundancy factor, then divide by server capacity.”
GOOD: “I would first validate whether peak QPS is the right demand signal, or if we have a weekly pattern where a lower sustained rate actually drives our storage capacity constraint differently.”
BAD: Accepting the case constraints without pushing back. “With a 30% growth rate and six weeks to launch, I need X servers.”
GOOD: “A 30% growth rate with six weeks to launch suggests either the launch was poorly timed or the growth metric is misleading. I’d verify the growth calculation before committing capital.”
BAD: Treating capacity planning as a forecasting exercise divorced from business commitment. “My model predicts we need 500 servers.”
GOOD: “My model produces a range. The 500-server point estimate assumes 95th percentile demand. At 80th percentile, we need 320. The business needs to choose the cost of over-provisioning against the risk of customer-impacting throttling.”
FAQ
Why do candidates with strong operations backgrounds fail Amazon SRE capacity planning interviews?
The failure pattern is not lack of technical knowledge but misplaced technical purity. Candidates from traditional infrastructure roles often optimize for efficiency metrics that Amazon explicitly deprioritizes in favor of customer experience. I debriefed a candidate from a major financial exchange who calculated optimal server utilization at 78%. The bar raiser asked why 78%, and the candidate explained it minimized total cost of ownership. The correct Amazon answer would have been: “It depends on the tail latency requirement for the customer workload, and I’d rather over-provision than miss an SLA.” The candidate was declined. The problem was not his answer. It was his judgment signal.
How important are specific AWS service details versus general capacity planning theory?
General theory is table stakes; specific service details are the differentiator. In a 2023 loop for an EBS team, the candidate who mentioned that gp3 volumes decouple IOPS from throughput, and that this changed capacity planning from the gp2 model, demonstrated current operational knowledge. The candidate who treated “block storage” as a generic abstraction was rated “lacks AWS depth” by the hiring manager. The insight is not that you must know every service. It is that you must know at least one service deeply enough to discuss how its specific constraints shape capacity decisions.
What should I do if the interviewer introduces a constraint that makes my entire approach invalid?
This is the designed break point, and your response determines the interview outcome. In a 2021 debrief, a candidate for the CloudWatch metrics team had built an elegant time-series forecasting model. The interviewer then stated: “The CFO has frozen all capital expenditure for 90 days.” The candidate who passed paused, visibly recalibrated, and said: “Then my job changes from capacity planning to capacity triage. I need to identify which customer commitments are at risk, which can be renegotiated, and what technical debt we accept to free existing capacity.” The bar raiser’s note: “Demonstrates adaptability and customer focus under constraint.” The candidate who argued for an exception to the freeze was declined.amazon.com/dp/B0GWWJQ2S3).
You Might Also Like
- First-Time Manager Managing Former Peers at Amazon: Navigating the Transition
- Amazon Robotics Applied AI Engineer: Distillation Optimization Template for Fine-Tuning
- Amazon SRE vs Netflix SRE Interview: Operational Excellence vs Chaos Engineering
- Amazon LP Story Template for Underperformer Management: Engineering Manager Interview Playbook
- OpenAI vs Anthropic AIE Interview Questions: Key Differences You Must Know
- Anthropic Engineering Culture And Values: Insider Guide 2026