A/B Testing

A/B testing in NimbusOS is built on top of the standard campaign model. Every campaign can split traffic across multiple variants, measure outcomes with statistical confidence, and automatically promote the winner. The platform enforces a minimum sample size and a confidence threshold before declaring a winner, which prevents the false-positive problem of calling a winner too early on noisy data. This article covers the test setup, the variant model, the winner selection logic, and the patterns that produce real learnings.

What You Can Test

A/B testing supports four test dimensions.

Subject line. Two or more subject variants. Body is identical. Most common test type.

Opener line. Two or more first-line variants. Subject and rest of body identical. Useful for testing personalization approaches.

CTA. Different calls to action at the end of the email. Meeting ask vs question vs resource offer.

Send time. Same content, different send windows. Useful when timing matters but copy is fixed.

You can combine dimensions, but each additional dimension doubles (at minimum) the sample size needed for significance. Most teams test one dimension at a time.

The ABTest Model

A test is an ABTest object with:

campaign_id - the campaign this test is attached to
test_type - subject, opener, cta, send_time, or multi
metric - the outcome metric being optimized (open rate, reply rate, positive reply rate)
variants - list of ABTestVariant objects
minimum_sample_size - per variant. Default 100.
confidence_threshold - default 0.95
max_duration_hours - default 168 (7 days)
auto_select_winner - default true
auto_allocate_to_winner - default true

Each ABTestVariant has:

content_reference - subject, opener, or template variant
traffic_allocation - percentage of sends directed to this variant
Per-variant counters: sends, delivered, opened, clicked, replied, positive_replies

Traffic allocations across variants must sum to 100.

Setting Up a Test

From the Campaign detail, click Add A/B Test. Pick the test type. Add two or more variants.

Variant naming. Use descriptive names. subject_direct_ask beats variant_b because reports and notifications reference the name.

Initial allocation. Even split across variants is the default. Uneven splits (70/30, 80/20) are useful when you have a known-good default and are testing a risky variant; the allocation caps exposure to the experimental variant.

Minimum sample size. Default 100 per variant is the absolute floor. For reply rate optimization (the metric that matters most and is noisiest), consider 500 per variant. Subject line open rate is usually decisive at 200 per variant.

Metric. Open rate is easier to move and is a fine early-stage test metric. Reply rate is harder to move and is what you actually want to optimize for. Positive reply rate is the hardest and most valuable; use only on high-volume campaigns.

How Variants Are Rendered

At send time, the sequence engine checks whether the step has an A/B test attached. If yes, it picks a variant using weighted random selection (aligned with traffic allocation). The variant's content overrides the step's default content for this single send.

Per-contact consistency: the same contact always sees the same variant across follow-up steps, so a contact who opened subject A in step 1 will see the equivalent opener variant in step 2. This is called consistent bucketing and is enforced by a contact-to-variant mapping cached per campaign.

The ABTestInsight Model

As the test runs, the platform computes statistical inference and writes results to ABTestInsight:

p_value
effect_size - relative lift of winner over control
confidence_interval - typically 95 percent
statistical_test - chi-square or t-test, depending on metric
winner_variant_id - null until confidence threshold is crossed
status - running, winner_detected, concluded, inconclusive

The insight updates every 30 minutes during the test.

Winner Selection Logic

The winner is declared when all of:

Each variant has sent at least minimum_sample_size messages.
The highest-performing variant has p-value below (1 - confidence_threshold).
The relative lift is at least 10 percent (a floor that prevents calling a 0.5 percent lift a winner).
No variant is currently tied at the confidence threshold (prevents flip-flop winners during a tight test).

When a winner is declared and auto_allocate_to_winner=true, traffic shifts to 100 percent on the winner within the next send cycle. The loser variants are marked concluded. Existing contacts already enrolled in a loser variant continue to see it for consistency.

Inconclusive Outcomes

Not every test produces a clear winner. If max_duration_hours is reached without crossing the confidence threshold, the test is marked inconclusive. The original traffic split continues unless you manually pick a winner.

Inconclusive does not mean useless. It usually means the variants are too similar and a different test dimension would be more productive.

Test Duration

Typical durations by metric:

Open rate test with 200 sends per variant: 2 to 4 days.
Reply rate test with 500 sends per variant: 5 to 10 days.
Positive reply rate test with 1,000 sends per variant: 14 to 21 days.

Setting max_duration_hours shorter than the realistic duration produces inconclusive outcomes repeatedly. Err longer and let auto_select_winner fire whenever confidence is reached.

Multi-Variant Testing (A/B/C)

Three-way and four-way tests are supported. The sample size requirements increase by roughly the number of variants. A three-way test at 100 per variant needs 300 total sends, not 100.

Beyond four variants, the sample size quickly exceeds what a single campaign can produce. Consider splitting into sequential tests.

The Copy Standards Filter in A/B

Every variant passes through the Copy Standards filter at save. The filter blocks em dashes, banned AI phrases, and other anti-patterns. A variant that fails the filter cannot be activated; fix the variant first.

This is how the platform prevents a bad variant from leaking into production.

Common A/B Testing Mistakes

Four patterns that waste test cycles.

Testing on too-small sample

A test with 20 sends per variant is not a test, it is noise. The platform warns but does not block. Respect the minimum sample size.

Testing too many things at once

Changing subject, opener, and CTA in one variant makes it impossible to attribute the outcome to a single change. Test one dimension at a time.

Stopping early at the first signal

"Variant B is ahead after 50 sends, let's call it." Almost always wrong. The lift at 50 sends is usually not stable at 500 sends. Wait for the minimum sample size and the confidence threshold.

Not running new tests after a winner

The winner tells you what works now. What works now will not work in 6 months. Cold outreach copy fatigue is real. Treat each winner as the start of the next test, not the end.

Cross-Campaign Test Insights

The Growth Brain aggregates A/B test insights across campaigns and workspaces (anonymized). Patterns that emerge:

"Short subject lines (under 7 words) beat long subject lines in tier A campaigns 70 percent of the time."

"Openers referencing a specific mutual connection outperform generic compliments 2x in enterprise campaigns."

"Send times between 8 and 10 AM recipient local outperform 2 to 4 PM for first sends."

These cross-workspace signals are surfaced in the Growth Recommendations feed. They inform your next test design.

Test Archive

Every concluded test is archived with full data. The archive is queryable from the A/B Testing dashboard. Useful for:

Institutional knowledge (what have we already tested)
Re-running old winners to check for fatigue
Training new team members on what works in your context

Archives retain for 24 months by default.

Troubleshooting

"Test has been running for a week with no winner"

Either the variants are too similar or the sample size target is too high. Check the insight detail for current p-value and lift. If lift is under 5 percent, your variants are not meaningfully different; design a bolder variant. If lift is high but p-value is not crossing, volume is the bottleneck; let it run longer or increase send rate.

"Winner was declared but traffic did not shift"

auto_allocate_to_winner is false. Check the test configuration.

"Same contact is seeing both variants"

Consistent bucketing is off. This should not happen by default; if it is happening, a bug has been introduced. File a support ticket with the test ID.

"Test ended inconclusive but I saw a clear difference"

Lift was real but sample size was too small for statistical confidence. Re-run the test with larger sample size, or accept that the difference is small and pick based on qualitative judgment.

Frequently Asked Questions

Can I A/B test send time?

Yes. Configure variants with different send_time_mode or send_hour values. Metric is open or reply rate. Useful only on campaigns with enough volume to be decisive on timing.

Does A/B testing work with personalization variables?

Yes. Variables render per variant per contact. You can test "with personalized opener" vs "without personalized opener" to measure the actual lift the personalization engine produces.

Can I run a multivariate test (full factorial)?

Not directly. NimbusOS supports A/B/C but not full multivariate. The sample size requirements for full multivariate exceed what most campaigns produce.

Is there a minimum effect size to detect?

Default 10 percent relative lift. You can lower to 5 percent at the cost of longer tests, or raise to 20 percent to avoid declaring marginal winners.

What to Read Next

Useful next pages after this one: Campaign Analytics for the real-time view of test variants in an active campaign, Email Templates for the template variant model, and Reply Intelligence for the reply classification that drives positive reply rate metrics.