Tech »  With AI models clobbering every benchmark, it's time for human evaluation