Splitstream
A/B testing and experimentation API with sticky bucketing, mutex experiment groups, Sample Ratio Mismatch detection, guardrail metrics, an Octane carve-out for the hot path, and a Bayesian stats engine whose decision rule is empirically calibrated — not asserted.

Overview
A portfolio build shaped for growth and experimentation teams. Pennant decides whether a feature ships; Splitstream decides which variant a user sees and whether the variant moved the metric. The pair (via the @philiprehberger/growth umbrella) is what closes the data-aware buyer evaluating self-host vs. Statsig / Eppo / GrowthBook.
The headline move — calibration as a deliverable, not a claim:
A common-but-incorrect intuition about Bayesian A/B testing: optional stopping is "free" under a posterior threshold. It isn't. Set posterior_threshold=0.95, min_sample=1,000, peek every 15 minutes — the realised false-positive rate runs 50-60%, not 5%. The always-valid-inference literature (Howard, Ramdas, e-values) has been saying so for a decade.
Before shipping a single endpoint, I ran 360,000 null-effect simulations across a 36-cell grid of (posterior_threshold × min_sample × cadence). The plan's draft defaults landed at 66% empirical FP. The shipped tuple — 0.995 / 20,000 / 240 min — lands at 4.77% empirical FP. That tuple is the v0.1 default every new experiment inherits. The simulation lives at stats/notebooks/00_calibration.py; the JSON is committed at stats/calibration_results.json; the docs site renders the table inline at /docs/concepts/bayesian-inference; the /stats-playground page exposes the sliders for prospects to verify the table in their browser.
One bucketing function, four implementations, byte-for-byte:
Every SDK (TypeScript, React, PHP-Laravel, Python-Django+FastAPI) plus the Laravel-internal server evaluator runs the same 10-row stats/corpus/bucketing.json in CI — including UTF-8 ideographic 新規ユーザー and the canonically-equivalent decomposed café. UTF-8 NFC normalisation has to land in every port or returning users get re-bucketed across locales. Drift in any implementation fails all SDK CI jobs simultaneously. Same discipline for the stats engine corpus (stats/corpus/stats.json, 37 rows of conjugate updates + quantiles + chi-squared + Monte-Carlo P(B>A) against scipy reference values).
Sticky-forever assignment + mutex groups:
The first /v1/assign call writes a row to a PostgreSQL table hash-partitioned by experiment_id into 16 children. PK = (experiment_id, unit_id) makes sticky lookup a single index seek. Per-partition secondary index on unit_id keeps the SDK reconciliation endpoint cheap. Mutex group enforcement uses SELECT FOR UPDATE so two simultaneous first-calls for the same unit converge to one claim — the other returns mutex_holdout. Force-rebucket atomically deletes the assignment AND sets superseded_at on every exposure event for the pair; the analysis worker filters tombstoned rows so the unit's pre-rebucket data doesn't contaminate the post-rebucket arm.
Hot-path Octane carve-out:
/v1/assign, /v1/events, /v1/exposures run on Laravel Octane (RoadRunner) behind an Apache mod_proxy split. The 20 ms p99 cache-hit and 30 ms p99 ingest targets aren't survivable on per-request PHP-FPM workers under burst — Octane's sticky-worker model is what makes the budget hold. Filament admin + the management API route through Apache → php8.3-fpm where the volume is low. PgBouncer in transaction-pooling mode fronts PostgreSQL for the PHP-FPM routes; Octane workers connect direct with persistent connections.
Buffered event ingest:
Synchronous endpoint, asynchronous persistence. POST /v1/events validates, deduplicates via Redis SETNX (30-day TTL covers the late-event acceptance window), pushes to a Redis list, returns 202 Accepted. A supervisord-managed worker drains via COPY ... FROM STDIN; on COPY failure a per-row INSERT fallback localises the bad row. Backpressure 429 with Retry-After when the buffer exceeds 100k. Late events 7–30 days old land in late_event_count; older than 30 days are rejected with 412.
Filament admin + Vue 3 chart island:
Filament v5 at /admin with resources for experiments (hypothesis-required form + inline variants Repeater + calibrated-defaults helper text), metric definitions (with JSON-Path subset validator for safe property_path extraction), mutex groups, API keys. A custom ViewResults page renders the calibrated tuple card + SRM warning + variants table + the methodology audit (peek count, late-event count, weights-changed flag) + a Vue 3 chart island that renders the Beta posterior distribution per variant. The island is Vite-built, 106 KB gzipped (under the 150 KB plan budget, asserted in CI by admin-vue/scripts/check-bundle-size.mjs).
SDKs (five shipped, all corpus-guarded):
- TypeScript core — sticky-cache + buffered tracking + offline-resilient fallback + pure-TS SHA-256 so
bucket()stays sync in both Node and the browser. - React adapter —
useExperimenthook with per-experiment subscription so a variant change to one experiment doesn't re-render the others. - PHP / Laravel — auto-registered service provider,
@experiment(checkout-v2,treatment)Blade directive, session-backed sticky cache. - Python / Django / FastAPI — per-request client attached to
request.splitstream/ yielded via FastAPIDepends, closes-and-flushes on response. @philiprehberger/growthumbrella — 150-LOC wrapper that reads the Pennant kill-switch flag, gates the Splitstream assign call, exposes a unifieduseExperiment. Same five-line code sample on both docs sites.
Docs site:
Next.js 16 + Tailwind 4. Statically rendered routes for the marketing surface; /reference embeds Scalar's API doc with try-it. /bucketing is an interactive visualizer — paste sample identifiers, watch the cumulative distribution converge. /stats-playground is the calibration story interactively — move sliders and watch the realised FP rate change. /with-pennant carries the joint quickstart for the growth umbrella.
Stack:
- Laravel 13, PHP 8.3, Laravel Octane (RoadRunner), PostgreSQL 16, PgBouncer (transaction pooling), Redis, Filament v5, Apache MPM event + php8.3-fpm
- Vue 3 (Vite + vue-chartjs) for the chart island
- Next.js 16 + Tailwind 4 + Scalar for the docs site
- TypeScript, React, PHP-Laravel, Python (Django + FastAPI), @philiprehberger/growth umbrella
- markrogoyski/math-php (with normal-approx fallback for α + β > 5000) for the stats engine; Python scipy/numpy as a one-off subprocess for calibration + corpus generation
What it proves:
Same person hand-authored the OpenAPI spec, ran the 360k-experiment calibration simulation, wrote the Laravel controllers + bucketing service + mutex-group race protection + buffered ingest worker + stats engine, ported the bucketing function to TypeScript / PHP / Python (corpus-guarded), built the Vue 3 chart island under Vite, scaffolded the Next.js docs site with the Scalar try-it + interactive bucketing visualizer + live stats playground, wired the growth umbrella for the Pennant pair, and configured the EC2 deploy with its Apache + Octane + PgBouncer + supervisord shape. The case for hiring me to build a calibrated experimentation platform instead of bolting metrics onto a feature flag system and hoping the math holds.
Results
Bayesian decision rule empirically calibrated — 360k null-effect simulations, plan’s draft 66% empirical FP → shipped 4.77%
47 tests across five implementations (PHPUnit corpus + TypeScript Vitest + PHP SDK PHPUnit + Python pytest + Vue island Vitest)
Five SDKs from one OpenAPI 3.1 spec — TypeScript / React / PHP-Laravel / Python (Django + FastAPI) / @philiprehberger/growth umbrella
10-row bucketing corpus + 37-row stats engine corpus round-tripped through every implementation in CI — byte-for-byte parity including UTF-8 NFC normalisation
Vue 3 chart island 106 KB gzipped vs the 150 KB plan budget, asserted by check-bundle-size.mjs at every build
Gallery


