All 12 Tested LLMs Fail Hiring Personality Evaluations, Offer Generic Feedback

A recent empirical study by Rohan Paul, titled ""You Are Rejected!": An Empirical Study of LLMs Taking Hiring Evaluations" and detailed in a paper on arXiv, has exposed significant shortcomings in large language models' (LLMs) ability to navigate standard hiring personality screens. The research found that all 12 models tested failed both the candidate questionnaire and the subsequent human resources recommendation steps. This suggests that current AI models are unreliable as both job candidates and evaluators in critical hiring processes.

When subjected to a standard personality questionnaire comprising 216 statements on a 1-to-9 agreement scale, the LLMs consistently demonstrated a critical flaw. According to the study, "Models lean toward agreeable choices even when the smart answer is to disagree, so they look unrealistically positive." This tendency resulted in profiles that appeared unrealistically positive and generic, with most models generating highly similar personality profiles, and only one model showing an outlier personality. Broader research also indicates LLMs often exhibit a prosocial and well-adapted personality profile, which aligns with this agreeable bias.

The study equally highlighted deficiencies when LLMs acted as human resources evaluators. Rohan Paul stated in the tweet, "When acting as human resources evaluators, models almost never say Not Recommended, and some mark everyone Strongly Recommend, so there is no discrimination between candidates." This fundamental inability to differentiate between candidates stems from their failure to connect specific question responses to actual job requirements, instead pursuing generic "good" traits rather than assessing genuine role suitability.

While the study acknowledged limitations, such as using a single questionnaire and a fixed set of models, the consistent pattern of failure across all tested LLMs raises serious questions about the appropriate use of AI in high-stakes human resources functions. The findings reinforce that LLMs, despite their advanced linguistic capabilities, currently lack the nuanced understanding and critical judgment required for complex human evaluation tasks. This research adds to the growing body of literature examining the capabilities and limitations of AI in professional settings.

Compounding these issues, other research highlights that LLMs can be exploited by human job applicants to "hack" personality assessments, generating optimal responses to create a favorable, albeit potentially false, impression. This dual challenge—LLMs failing as participants and evaluators, while also enabling human cheating—underscores a significant vulnerability in unproctored digital hiring assessments. Experts suggest that the days of relying solely on such tests may be drawing to a close, necessitating new approaches to candidate evaluation.

Ultimately, the study concludes that current large language models are insufficient as proxies for human candidates and prove to be untrustworthy judges for crucial hiring decisions, urging caution in their deployment within talent acquisition.