Cold email deliverability score: decide what segments to scale
Build a cold email deliverability score from bounce rate, complaint rate, and reply rate, so you can spot safe segments to scale and pause risky ones.

What a segment-level deliverability score solves
Deliverability is simple: do your emails reach a place where a real person can read them? Best case is the inbox. Worse is spam. Worst is no delivery at all because the address is bad, the domain rejects you, or you hit a block.
Most teams get misled because they stare at one number and ignore the mix of signals behind it. A low bounce rate can look “healthy” while complaints quietly rise and hurt your sender reputation. A high reply rate can feel like proof you should scale, even if a big chunk of your volume is landing in spam and only the most motivated people are finding you.
A deliverability score is useful when it forces the right question: is this slice of outbound helping our inbox placement, or quietly dragging it down?
The missing piece is segmentation. Results are rarely uniform across your whole list. One slice might be clean and responsive, while another is full of old data or people who hate your angle. The score should be calculated per segment you can actually act on, like:
- List source (for example, Apollo vs referrals)
- Recipient domain type (corporate domains vs free inbox providers)
- Audience group (founders vs SDRs)
- Region (US vs EU)
- Offer angle (demo ask vs content-first)
Once you score segments separately, you stop making “average” decisions that hurt you.
This score isn’t for reporting. It’s a decision tool with three outcomes: scale, fix, or stop.
- Scale segments that deliver, get replies, and stay clean.
- Fix segments that could work but have a specific issue (often list quality).
- Stop segments that are likely to damage deliverability even if they occasionally reply.
Example: your “US SaaS founders” segment gets steady replies with almost no complaints, while your “all sales titles” segment gets similar replies but more complaints and more bounces. Without segment scoring, you might scale both. With it, you scale founders and clean, reframe, or pause the sales-titles segment.
How to define segments you can actually act on
A segment is only useful if one clear change would improve it. If you can’t name the action you’d take based on the score, it’s not a useful segment.
Start simple. One-rule segments are easier to understand and fix. “Prospects from Apollo” is actionable: if it scores poorly, you can tighten filters, validate emails, or change the source. Multi-rule segments are fine when they represent a real go-to-market motion, but avoid stacking rules just to make the number look better. That usually creates noise.
Good segment cuts are the ones that lead to obvious decisions, such as:
- Lead source
- Industry
- Seniority
- Domain type
- Offer angle
Pick 2 to 4 segment types and stick with them for a month. If you define 20 segments, you won’t have the time to fix any of them.
Minimum volume (so you don’t act on noise)
Numbers need enough volume to be trusted. Don’t judge a segment on a tiny sample where a couple of bounces or one complaint can swing the rate.
A practical minimum before you act:
- At least 300 delivered emails in the segment
- Or at least 10 total replies if reply rate is part of your decision
Recalculate on a schedule that matches your sending speed. Weekly is a solid default: it smooths daily swings and still catches problems early. Daily recalculation only makes sense if you send a lot per segment and you will actually pause or adjust within 24 hours.
One important detail: define segments by who you targeted, not by how they replied. The score should help you fix targeting and sending health, not just chase last week’s responses.
The three inputs: bounces, complaints, replies
A segment-level score works only if the inputs are simple, reliable, and tied to what hurts deliverability. Three metrics do most of the work: bounce rate, complaint rate, and reply rate.
Bounce rate (delivery failure)
Bounces tell you whether you’re reaching real inboxes. A high bounce rate is usually not a copy problem. It points to list quality issues like bad enrichment, old data, or risky domains.
As a gut-check for cold outbound:
- Under 2% is usually fine.
- 2% to 5% is a warning sign that the segment data is getting messy.
- Over 5% is a stop-and-fix signal before you scale.
Even if you get replies, bounces can hurt sender reputation fast, especially when the segment is large.
Complaint rate (spam complaints)
Complaints are small in number but heavy in impact. One person hitting “Report spam” can hurt more than ten people ignoring you.
For most teams, the goal is simple: keep complaints close to zero. If a segment produces repeated complaints, treat it as a red flag even if other metrics look good.
Reply rate (deliverability plus relevance)
Replies are a useful combined signal. You only get replies if emails arrive and the message fits the audience.
But read reply rate carefully. A segment with a 6% reply rate that’s mostly “not interested” might be deliverable, just mismatched. If you can separate real replies from auto-responses (like out-of-office), do it. Otherwise, reply rate gets inflated and stops being comparable.
Where unsubscribes fit
Unsubscribes are best used as a guardrail, not part of the score. Track unsubscribe rate separately and set a hard cap. If it breaks the cap, don’t scale that segment until you adjust targeting and messaging.
A simple scoring model you can explain in one minute
A segment-level score should be boring: one number on a 0-100 scale that stays consistent over time. If the formula keeps changing, you’ll end up chasing the score instead of fixing the underlying problem.
This model keeps the priorities straight: complaints get the biggest penalty, bounces get a strong penalty, replies add a limited boost.
The score (0-100)
Use rates as decimals (so 2% is 0.02):
score = 100
- 800 * bounce_rate
- 2000 * complaint_rate
+ 100 * reply_rate
Clamp score to 0-100.
Why these weights?
- Complaints are rare but serious, so the penalty is steep.
- Bounces matter because they can quickly drag down reputation, and they’re often fixable by cleaning the list.
- Replies are positive, but they shouldn’t “buy” your way out of complaint risk.
Before you calculate, keep definitions consistent:
- Bounce rate: hard bounces only (invalid mailbox), not soft bounces
- Complaint rate: spam complaints, not unsubscribes
- Reply rate: replies from delivered emails, ideally excluding out-of-office
Guardrails and small-sample protection
Even a decent-looking score can hide a real problem if the segment is small. Add a few rules so you don’t scale on noise:
- Auto-fail: if complaint rate > 0.30%, set score to 0 and pause the segment
- Bounce cap: if bounce rate > 5%, cap score at 30 until you fix the list
- Minimum volume: only score segments with at least 200 delivered emails in the window
- Reply boost cap: limit the reply boost to +15 points so one chatty segment can’t mask risk
Example: complaints 0.10% (0.001), bounces 1.5% (0.015), replies 6% (0.06).
Score = 100 - 800(0.015) - 2000(0.001) + 100(0.06) = 100 - 12 - 2 + 6 = 92.
If complaints jumped to 0.4% (0.004), the segment would auto-fail regardless of replies.
Step-by-step: calculate the score for each segment
Pick one time window and stick to it. For most teams, the last 7 days catches problems quickly, while 14 days smooths out noise. Whatever you choose, keep it consistent so the score is comparable week to week.
For each segment, pull raw counts from the same window:
- Sent
- Delivered (sent minus hard bounces)
- Hard bounces
- Spam complaints
- Human replies
- Unsubscribes (tracked separately)
Then calculate the rates:
- Bounce rate = hard bounces / sent
- Complaint rate = spam complaints / delivered
- Reply rate = human replies / delivered
On replies: exclude auto replies like out-of-office if you can. They inflate reply rate but don’t mean your message is welcome.
Apply the formula, then label each segment with a simple next step:
- Scale: score 80-100 and stable or improving
- Watch: score 60-79 or flat trend
- Fix: score 40-59 or dropping week over week
- Stop: score below 40, or any complaint spike you can’t explain
Don’t treat the latest number as the whole story. Track the trend for each segment (this week, last week, two weeks ago). A segment at 72 and rising is often safer to scale than one at 82 that’s falling fast.
How to use the score to decide what to scale
A score is only useful if it leads to consistent actions. Treat your segment score like a traffic light: green segments earn more volume, yellow segments get watched and tested carefully, and red segments get fixed before you send another batch.
Keep rules stable so you don’t renegotiate every week:
- 80 to 100 (scale): increase volume in small steps and keep targeting and copy mostly the same.
- 60 to 79 (watch): hold volume steady and run one low-risk test (list quality, message fit, or offer).
- Below 60 (fix): stop scaling, fix the root cause, and only resume once the score climbs.
When a segment is in the scale band, avoid doubling volume overnight. Increase gradually (for example, 10% to 20% per week) and re-check after each change. Sudden spikes can raise bounces and complaints, which can drag down all your mailboxes.
Some signals should override the score and trigger an immediate pause:
- A sudden jump in hard bounces usually means the list is stale or too broad.
- A rise in complaint rate is even more serious and should be treated as an emergency brake.
Use the score to choose which experiments are safe. Strong segments are where you can test bolder ideas because you have a cushion. Weak segments are where you should focus on risk reduction first.
Example: scoring segments to choose a scaling winner
Imagine you run outbound for one product, but you have two list sources (Apollo and event attendees), three industries (SaaS, marketing agencies, healthcare), and two offers (a short teardown or a relevant case study). You don’t want to scale everything. You want one or two clear winners.
After the first 1,000 sends each (same sending setup, similar subject length, same mailbox pool), you calculate the score per segment:
| Segment (source + industry + offer) | Bounce | Complaints | Replies | What it means |
|---|---|---|---|---|
| Event + SaaS + Teardown | 1.2% | 0.02% | 3.8% | Strong scale candidate |
| Apollo + SaaS + Teardown | 2.8% | 0.04% | 2.4% | Clean list, then scale |
| Event + Agencies + Case study | 1.0% | 0.03% | 1.6% | Safe, but offer may be weak |
| Apollo + Agencies + Teardown | 4.6% | 0.08% | 1.9% | Fix bounces first |
| Event + Healthcare + Case study | 1.4% | 0.06% | 0.9% | Deliverable, not persuasive |
| Apollo + Healthcare + Teardown | 6.2% | 0.12% | 0.6% | Stop and rebuild segment |
Two things stand out:
- The event list is consistently healthier (lower bounces and complaints).
- Healthcare isn’t responding to the teardown offer, and the Apollo version is actively risky.
Actions based on this:
- Scale “Event + SaaS + Teardown” first by adding volume slowly, and keep it separate so it doesn’t hide problems elsewhere.
- For “Apollo + SaaS,” clean the list before adding volume (remove risky titles, validate emails, tighten company filters), then rerun a small batch.
- For “Event + Healthcare,” keep deliverability steady but change the offer and message since the issue is replies.
- For “Apollo + Healthcare,” pause and fix targeting and list hygiene before you burn sender reputation.
Common mistakes that make the score useless
A segment score only helps if it stays consistent and points to a real cause. Most failures come from mixing signals that mean different things, or trusting data that’s too thin.
Common ways teams break the score:
- Treating reply rate as the only signal. Replies can drop because of the offer, timing, or targeting. Deliverability problems often show up faster in bounces and complaints.
- Mixing brand-new mailboxes with warmed ones. New senders behave differently and usually need lower volume. If you average them together, the score explains nothing.
- Trusting tiny samples. Set a minimum volume before scoring. Otherwise one complaint can cause random decisions.
- Changing the weights every week. If the formula moves, you can’t compare this week to last week.
- Ignoring sending domains or mailbox groups. Deliverability is often domain-specific. One bad domain can drag down a combined view and make you pause the wrong segment.
If a segment “improves” only because you reclassified replies or changed definitions, that’s not deliverability. Keep definitions stable so the score reflects sender health, not shifting labels.
Quick checks before you scale a segment
Before you increase volume, take five minutes to sanity-check the segment. A score can look great on paper, but one hidden issue can hurt inbox placement when you scale.
Scan these signals in the same scoring window (for example, the last 7 days), then compare to the previous window:
- Bounce type mix: If bounces are mostly hard (bad address, domain doesn’t exist), scaling will only create more damage.
- Any complaints at all: If complaints appeared in the latest window but not the one before, treat the segment as unstable and hold off.
- Reply trend: Stable is good, rising is better. If replies are drifting down week over week, scaling often makes the drop worse.
- Recent changes: New segment rules, a new list source, or new copy can reset performance. Wait for enough sends to judge it, or roll back.
- Sudden volume shifts: If you doubled sends in the last 48 hours, the score may be lagging behind reality. Ramp in steps.
Example: your “US agencies” segment scores well, but volume jumped after adding a new prospect source and hard bounces rose. Even if the score still looks healthy, scaling now is risky. Fix list quality first, then re-score after a clean window.
Next steps: make the score part of your weekly routine
Treat your deliverability score like a weekly health check. Pick one day and time, review the same segments every week, and make one or two changes you can track.
Protect list quality first. If a segment shows higher bounces or complaints, don’t push through it with more volume. Pause, remove obvious invalids (role accounts, bad patterns, old exports), and be careful with sources you can’t verify.
Keep volume changes gradual. Ramp per domain and per mailbox, not just total sends, so you don’t spike bounces or complaints.
A simple weekly routine:
- Recalculate the score for each segment using the last 7 days (or last 1,000 sends).
- Stop sends to any segment that crosses your bounce or complaint red lines.
- Narrow targeting or rewrite the offer for low-reply segments that are otherwise clean.
- Scale only one winning segment per week and cap the increase.
- Log what changed (list source, copy, offer, volume) so you can learn.
If you want this to be easy to maintain, it helps when bounces, complaints, and reply types are visible in one place. LeadTrain (leadtrain.app) is built around that workflow, combining sending setup, warm-up, sequences, and reply classification so you can review segments quickly and act without jumping between tools.
FAQ
Why is a segment-level deliverability score better than one overall score?
A segment-level score stops you from making “average” decisions. Instead of one blended number, you see which specific slice of outreach is helping inbox placement and which slice is quietly adding bounces or complaints.
How do I pick segments that are actually actionable?
Start with segments where one clear action could improve results, like lead source, domain type, industry, seniority, or offer angle. If you can’t name what you’d change after seeing the score, the segment isn’t useful.
What’s the minimum volume I should have before trusting a segment score?
As a practical baseline, wait for at least 300 delivered emails in a segment, or 10 total replies if replies are part of your decision. Smaller samples swing too easily from one complaint or a couple of bounces.
What exactly should count as bounce rate, complaint rate, and reply rate?
Use hard bounces only for bounce rate, count spam complaints for complaint rate, and measure reply rate from delivered emails. Keep definitions stable so week-to-week changes reflect reality, not shifting labels.
What’s a simple scoring formula I can use right away?
Use a simple 0–100 formula that punishes complaints the most, penalizes bounces strongly, and gives replies a limited boost. The point is consistent decisions over time, not a perfect number.
Should unsubscribes be part of the deliverability score?
They’re best as a separate guardrail. Track unsubscribe rate on its own and set a hard cap; if a segment crosses it, don’t scale until you tighten targeting or adjust the message.
How do I use the score to decide “scale, fix, or stop”?
Scale segments that are clean and stable, fix segments that show a specific issue (often list quality or mismatch), and stop segments that keep producing complaints or high bounces. Treat the score like a traffic light, not a report.
Can a segment be “good” on replies but still too risky to scale?
Yes, and you should pause quickly. A practical rule is to auto-fail a segment if complaint rate goes above 0.30%, even if replies look good, because complaints can damage reputation fast.
How fast can I safely increase volume on a high-scoring segment?
Ramp volume in small steps (around 10% to 20% per week) and re-check after each change. Sudden spikes can increase bounces and complaints and drag down deliverability across your mailboxes.
What are the most common mistakes that make segment scoring useless?
The most common mistakes are trusting tiny samples, treating reply rate as the only signal, mixing new mailboxes with warmed ones, changing weights every week, and ignoring domain or mailbox group differences. Any of those can make the score misleading.