Skip to main content
Back to Portfolio

Distill

AI-assisted CRM data cleanup — HubSpot duplicate detection via embeddings + clustering, surfaced in a human-review queue, with reversible writes and a full audit log.

Laravel 13PHP 8.3Filament 5PostgreSQL 16pgvector 0.8FastAPIPython 3.12HDBSCANrapidfuzzOpenAI text-embedding-3-smallNext.js 16React 19TypeScriptTailwind 4OpenAPI 3.1ScalarRedis 7ApachesupervisordPlaywrightk6
Distill preview

Overview

Every "AI dedup" tool on the market is either a $500/month SaaS that runs once a week or a CSV-only batch job that doesn't talk to HubSpot. Distill is the demo that says here is the real plumbing — clustering, review UX, reversible writes, the merge log — running against your live CRM in under five minutes.

What rule-based dedup misses (and Distill catches):

  • jane@acme.com vs jane.doe@acme.com vs j.doe@acme.com — same person, three records.
  • Acme Inc vs Acme, Inc. vs ACME — same company, three records.
  • +1-415-555-0100 vs 4155550100 vs (415) 555-0100 — same phone, three formats.

The pipeline:

Blocking (cheap predicate groupings: domain, phone, name+company) avoids the O(N²) all-pairs comparison entirely. Within each block, pairwise scoring blends cosine on text-embedding-3-small with rapidfuzz edit distance on (name, email-local-part, company), weighted sum. HDBSCAN finds the clusters. Reviewers walk the queue in a Filament admin: side-by-side diff card with column-wise highlighting, association rollup, confirm / skip / reject keyboard shortcuts.

Reversible writes are non-negotiable:

Every merge persists the full pre-merge snapshot of every involved record before calling HubSpot's merge API. Undo recreates the merged contacts via batch-create + re-associates owned deals via the Associations API. The known limitation — associations created against the winner after the merge cannot be perfectly restored — is surfaced inline in the undo confirmation modal, not buried in a footer.

Read-only by default. No silent auto-merges:

Every workspace starts with writes_enabled = false. The merge button is hidden in the UI until an admin explicitly flips the toggle with an "I understand Distill will write merges back to HubSpot when I confirm them in the review queue" confirmation. The flip is logged to the audit trail. Auto-merge is off by default; when enabled it only fires above a configurable similarity threshold (0.98 default) and is daily-capped (50 default).

Reject-pair memory:

Once a reviewer rejects a pair as "different people / intentional duplicate / needs more info / other," Distill remembers it permanently. Re-clustering after a new sync invalidates stale pending clusters via a version stamp but preserves human verdicts — the queue doesn't loop.

Stack:

  • Laravel 13 + Filament 5 (review queue, audit log, HubSpot connection management)
  • FastAPI Python 3.12 cluster worker (embed, score, HDBSCAN; supervisord-managed)
  • PostgreSQL 16 + pgvector 0.8 (HNSW on the embedding column for in-cluster similarity refinement)
  • Next.js 16 + Scalar (docs, marketing, 1,200-contact live demo, OpenAPI reference)
  • Apache + php-fpm; PM2 for the docs site; supervisord for the cluster worker

Quality:

81 PHPUnit tests / 275 assertions in 1.4s — covers HealthCheck, ApiKey scope + token, PhoneNormalizer, WriteGuard auto-merge guards, OAuth PKCE, EmbeddingText, ClusterRejection pair key, DemoFixtureGenerator, Block/Cluster/Merge/SyncRun constants + transitions, HubSpotConnection expiry buffer, WorkspaceMembership capabilities, ContactMirror merged-flag, route auth, DemoController, ClusterController, SyncController, MergeController, AuditController, HubSpot webhook signature verification, WorkspaceScope cross-tenant isolation. Plus 11 vitest tests for the TypeScript SDK, 3 pytest tests for the Python SDK, 8 pytest tests for the FastAPI worker, Playwright spec covering the load-bearing demo controls, and a k6 nightly perf gate with p95 SLOs on the clustering endpoints.

What it proves:

Same person hand-authored the OpenAPI spec, wrote the Laravel controllers and Filament admin, built the HubSpot OAuth + PKCE flow with rate-limit-aware backfill, designed the embedding + blocking + clustering pipeline, wrote the merge writer and the undo orchestrator with the association-recreation caveat surfaced honestly, built the Next.js docs and the 1,200-contact live demo, packaged the read-only TS + Python SDKs, and shipped all three processes live with atomic releases. The case for hiring me to ship a complete HubSpot-integrated dedup product instead of a CSV uploader stitched together with a fuzzy-match library.

Results

  • 81 PHPUnit / 275 assertions in 1.4s — every API surface + HubSpot OAuth + writes-guard + merge writer + undo orchestrator + webhook signature + cross-tenant isolation

  • 11 vitest tests for @philiprehberger/distill SDK; 3 pytest tests for distill-crm (pydantic + httpx); 8 pytest tests for the FastAPI cluster worker

  • Read-only by default — workspaces start writes-disabled; merge button hidden until admin explicit enable; audit-logged toggle

  • Reject-pair memory — once a reviewer rejects a pair, Distill never re-proposes it; versioned re-clustering supersedes stale pendings but preserves human verdicts

  • Reversible merges — every merge persists the full pre-merge snapshot; undo recreates contacts via batch-create + Associations API; post-merge-association limitation surfaced inline in the undo modal

Gallery

Distill screenshot 2
Distill screenshot 3
Distill screenshot 4

Interested in working together?

Let's discuss how I can help with your project

Get in Touch