ELO Ratings Questions

18/09/2025 3 min Episodio 223
ELO Ratings Questions

Listen "ELO Ratings Questions"

Episode Synopsis

Key ArgumentThesis: Using ELO for AI agent evaluation = measuring noiseProblem: Wrong evaluators, wrong metrics, wrong assumptions Solution: Quantitative assessment frameworksThe Comparison (00:00-02:00)Chess ELOFIDE arbiters: 120hr trainingBinary outcome: win/lossTest-retest: r=0.95Cohen's κ=0.92AI Agent ELORandom users: Google engineer? CS student? 10-year-old?Undefined dimensions: accuracy? style? speed?Test-retest: r=0.31 (coin flip)Cohen's κ=0.42Cognitive Bias Cascade (02:00-03:30)Anchoring: 34% rating variance in first 3 secondsConfirmation: 78% selective attention to preferred featuresDunning-Kruger: d=1.24 effect sizeResult: Circular preferences (A>B>C>A)The Quantitative Alternative (03:30-05:00)Objective MetricsMcCabe complexity ≤20Test coverage ≥80%Big O notation comparisonSelf-admitted technical debtReliability: r=0.91 vs r=0.42Effect size: d=2.18Dream Scenario vs Reality (05:00-06:00)DreamWorld's best engineersAnnotated metricsStandardized criteriaReality Random internet usersNo expertise verificationSubjective preferencesKey StatisticsMetricChessAI AgentsInter-rater reliabilityκ=0.92κ=0.42Test-retestr=0.95r=0.31Temporal drift±10 pts±150 ptsHurst exponent0.890.31TakeawaysStop: Using preference votes as quality metricsStart: Automated complexity analysisROI: 4.7 months to break evenCitations MentionedKapoor et al. (2025): "AI agents that matter" - κ=0.42 findingSantos et al. (2022): Technical Debt Grading validationRegan & Haworth (2011): Chess arbiter reliability κ=0.92Chapman & Johnson (2002): 34% anchoring effectQuotable Moments"You can't rate chess with basketball fans""0.31 reliability? That's a coin flip with extra steps""Every preference vote is a data crime""The psychometrics are screaming"ResourcesTechnical Debt Grading (TDG) FrameworkPMAT (Pragmatic AI Labs MCP Agent Toolkit)McCabe Complexity CalculatorCohen's Kappa Calculator
🔥 Hot Course Offers:🤖 Master GenAI Engineering - Build Production AI Systems🦀 Learn Professional Rust - Industry-Grade Development📊 AWS AI & Analytics - Scale Your ML in Cloud⚡ Production GenAI on AWS - Deploy at Enterprise Scale🛠️ Rust DevOps Mastery - Automate Everything🚀 Level Up Your Career:💼 Production ML Program - Complete MLOps & Cloud Mastery🎯 Start Learning Now - Fast-Track Your ML Career🏢 Trusted by Fortune 500 TeamsLearn end-to-end ML engineering from industry veterans at PAIML.COM