Distill
AI-assisted CRM data cleanup — HubSpot duplicate detection via embeddings + clustering, surfaced in a human-review queue, with reversible writes and a full audit log.

Overview
Every "AI dedup" tool on the market is either a $500/month SaaS that runs once a week or a CSV-only batch job that doesn't talk to HubSpot. Distill is the demo that says here is the real plumbing — clustering, review UX, reversible writes, the merge log — running against your live CRM in under five minutes.
What rule-based dedup misses (and Distill catches):
jane@acme.comvsjane.doe@acme.comvsj.doe@acme.com— same person, three records.Acme IncvsAcme, Inc.vsACME— same company, three records.+1-415-555-0100vs4155550100vs(415) 555-0100— same phone, three formats.
The pipeline:
Blocking (cheap predicate groupings: domain, phone, name+company) avoids the O(N²) all-pairs comparison entirely. Within each block, pairwise scoring blends cosine on text-embedding-3-small with rapidfuzz edit distance on (name, email-local-part, company), weighted sum. HDBSCAN finds the clusters. Reviewers walk the queue in a Filament admin: side-by-side diff card with column-wise highlighting, association rollup, confirm / skip / reject keyboard shortcuts.
Reversible writes are non-negotiable:
Every merge persists the full pre-merge snapshot of every involved record before calling HubSpot's merge API. Undo recreates the merged contacts via batch-create + re-associates owned deals via the Associations API. The known limitation — associations created against the winner after the merge cannot be perfectly restored — is surfaced inline in the undo confirmation modal, not buried in a footer.
Read-only by default. No silent auto-merges:
Every workspace starts with writes_enabled = false. The merge button is hidden in the UI until an admin explicitly flips the toggle with an "I understand Distill will write merges back to HubSpot when I confirm them in the review queue" confirmation. The flip is logged to the audit trail. Auto-merge is off by default; when enabled it only fires above a configurable similarity threshold (0.98 default) and is daily-capped (50 default).
Reject-pair memory:
Once a reviewer rejects a pair as "different people / intentional duplicate / needs more info / other," Distill remembers it permanently. Re-clustering after a new sync invalidates stale pending clusters via a version stamp but preserves human verdicts — the queue doesn't loop.
Stack:
- Laravel 13 + Filament 5 (review queue, audit log, HubSpot connection management)
- FastAPI Python 3.12 cluster worker (embed, score, HDBSCAN; supervisord-managed)
- PostgreSQL 16 + pgvector 0.8 (HNSW on the embedding column for in-cluster similarity refinement)
- Next.js 16 + Scalar (docs, marketing, 1,200-contact live demo, OpenAPI reference)
- Apache + php-fpm; PM2 for the docs site; supervisord for the cluster worker
Quality:
81 PHPUnit tests / 275 assertions in 1.4s — covers HealthCheck, ApiKey scope + token, PhoneNormalizer, WriteGuard auto-merge guards, OAuth PKCE, EmbeddingText, ClusterRejection pair key, DemoFixtureGenerator, Block/Cluster/Merge/SyncRun constants + transitions, HubSpotConnection expiry buffer, WorkspaceMembership capabilities, ContactMirror merged-flag, route auth, DemoController, ClusterController, SyncController, MergeController, AuditController, HubSpot webhook signature verification, WorkspaceScope cross-tenant isolation. Plus 11 vitest tests for the TypeScript SDK, 3 pytest tests for the Python SDK, 8 pytest tests for the FastAPI worker, Playwright spec covering the load-bearing demo controls, and a k6 nightly perf gate with p95 SLOs on the clustering endpoints.
What it proves:
Same person hand-authored the OpenAPI spec, wrote the Laravel controllers and Filament admin, built the HubSpot OAuth + PKCE flow with rate-limit-aware backfill, designed the embedding + blocking + clustering pipeline, wrote the merge writer and the undo orchestrator with the association-recreation caveat surfaced honestly, built the Next.js docs and the 1,200-contact live demo, packaged the read-only TS + Python SDKs, and shipped all three processes live with atomic releases. The case for hiring me to ship a complete HubSpot-integrated dedup product instead of a CSV uploader stitched together with a fuzzy-match library.
Results
81 PHPUnit / 275 assertions in 1.4s — every API surface + HubSpot OAuth + writes-guard + merge writer + undo orchestrator + webhook signature + cross-tenant isolation
11 vitest tests for @philiprehberger/distill SDK; 3 pytest tests for distill-crm (pydantic + httpx); 8 pytest tests for the FastAPI cluster worker
Read-only by default — workspaces start writes-disabled; merge button hidden until admin explicit enable; audit-logged toggle
Reject-pair memory — once a reviewer rejects a pair, Distill never re-proposes it; versioned re-clustering supersedes stale pendings but preserves human verdicts
Reversible merges — every merge persists the full pre-merge snapshot; undo recreates contacts via batch-create + Associations API; post-merge-association limitation surfaced inline in the undo modal
Gallery


