Evaluating Human-Like Qualities in Language Models

Ruijie Liu

doi:10.62051/yygprz73

Authors

Ruijie Liu

DOI:

https://doi.org/10.62051/yygprz73

Keywords:

Conversational AI, Empathy, Language Models, Benchmark.

Abstract

This paper investigates the human-like communication abilities of modern language models, comparing several open-source and proprietary systems. As LLMs are increasingly deployed in socially interactive roles—ranging from digital companions to mental health support tools—their ability to engage users naturally and expressively has become a critical yet underexplored dimension of evaluation. Traditional benchmarks tend to emphasize accuracy or reasoning, but they fail to capture the nuanced, subjective traits that define human conversation. To address this, seven LLMs were tested using both short and sustained dialogues, evaluated by five human raters using a multi-trait rubric. LLaMA 3.2 emerged as a standout, occasionally outperforming human responses in personality and creativity. Models were assessed on five human-oriented communication traits: naturalness, empathy, creativity, adaptability, and humor/personality. Results show significant variation across systems, with some matching or exceeding human performance in specific areas—suggesting that conversational quality may depend more on tuning and stylistic freedom than model scale alone.

Downloads

Download data is not yet available.

References

[1] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. “Training Verifiers to Solve Math Word Problems”. In: arXiv preprint arXiv:2109.08593 (2021).

[2] Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. “Measuring Massive Multitask Language Understanding”. In: arXiv preprint arXiv:2009.03300 (2021).

[3] Stephanie Lin, Jacob Hilton, and Owain Evans. “TruthfulQA: Measuring How Models Mimic Human Falsehoods”. In: arXiv preprint arXiv:2109.07958 (2022).

[4] Heung-Yeung Shum, Xiaodong He, and Di Li. “From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots”. In: Frontiers of Information Technology Electronic Engineering 19.1 (2018), pp. 10–26.

[5] Hannah Rashkin et al. “Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset”. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019).

[6] OpenAI. “GPT-4 Technical Report”. In: arXiv preprint arXiv:2303.08774 (2023).

[7] Long Ouyang, Jeff Wu, Xu Jiang, et al. “Training Language Models to Follow Instructions with Human Feedback”. In: Advances in Neural Information Processing Systems: 27730-27744. (2022).

[8] Cheng-Yu Chiang, Shiyue Cao, and Inioluwa Deborah Raji. “You Can’t Learn if You Don’t Look: Auditing LLMs for Social Bias and Alignment Tradeoffs”. In: arXiv preprint arXiv:2306.09301 (2023).

[9] Gabriel Volkel, Maarten Sap, Chandra Bhagavatula, et al. “Learning Interpretable Personality Traits from Conversations”. In: Findings of the Association for Computational Linguistics: EMNLP 2021 (2021).

[10] Yi-Lin Chiu, Shuyang Gao, and Noah A. Smith. “Stylistic Control for Empathetic Response Generation”. In: arXiv preprint arXiv:2211.08910 (2022).