Methodology

How we score. What we don't measure.

An assessment is only as honest as the rubric it's measured against. This is the long version · what we measure, why we measure it, and what we deliberately leave out.

01

Signal · what we measure

Six dimensions per skill: code quality, debugging methodology, testing instinct, error handling, reasoning depth, communication. Plus per-session work-style: ownership, craftsmanship, first principles, pragmatism.

Each dimension has a 0–10 rubric defined per role (Senior FE has a different bar than Mid Backend). Scores are deterministic given the rubric · same submission, same score.

02

Calibration · what we're calibrated against

Initial role rubrics anchored on widely-published seniority ladders from FAANG and other major tech companies (Google leveling, Meta engineer ladder, Amazon SDE tiers, Stripe operating principles) and a small set of design-partner conversations · refined per quarter against scoring drift and external review.

Validation in progress. We are currently running blind inter-rater studies with senior engineers grading the same submissions; we will publish the κ scores and any predictive-validity signal once design partners have made enough hires for the comparison to mean something. We don't want to ship a number that hasn't been earned.

03

Evidence · what every score links to

Reports cite specific moments. "Bug Hunt 8.5 · fixed the cleanup leak at line 49 in 4m32s · explained the React render cycle in the 14:32 transcript". Hiring teams (and candidates!) can replay any score's underlying evidence.

If a number can't be cited, it doesn't make it into the report. "Vibes" never get a number.

Tier ladder

Junior · Mid · Senior · Staff — same scenarios, different rubrics.

Tier sets the depth of the prompt and the bar of the rubric. We score the same submission against whichever tier you declared at intake — the report shows declared vs. detected side-by-side.

Junior to Staff seniority ladder rendered as ascending bars

What we don't measure

Things our score is not.

Calibration honesty matters more than coverage. Here's what's deliberately out of scope.

Three score distributions showing honest, calibrated scoring bands
  • Cultural or values fit

    We measure how someone thinks about work, not whether they'd "fit" your team. That's your call, not an algorithm's.

  • Speed alone

    We score quality and reasoning under a soft time budget. Solving fast doesn't outweigh the wrong answer fast.

  • Memorization recall

    If a question can be solved by remembering a Stack Overflow answer, we don't ask it. Reasoning over recall, always.

  • Personality scores or psychometrics

    We don't claim to predict who you'll like working with. Big Five, MBTI, etc. don't appear anywhere in the report.