Evaluating the Usability of AI Chatbots

This case study presents an independent research project conducted as part of my Master’s degree in Human-Computer Interaction at SUNY Oswego. The study evaluates the usability of three major AI chatbot platforms: ChatGPT, Microsoft Copilot, and Google Gemini. The goal was to better understand how interface design influences user behavior and satisfaction in AI chatbots.

This paper was written for submission to the International Journal of Recent Trends in Human Computer Interaction (IJHCI).

Link to PDF download of full publication 'Evaluating the Usability of Chatbots'

introduction

As AI chatbots become embedded into everyday workflows, their usability depends not only on speed, but on visibility, interaction patterns, and perceived response quality.
‍
This study evaluated the usability of three major AI chatbots: ChatGPT, Microsoft Copilot, and Google Gemini.
‍
Through controlled usability testing and interviews with 15 participants, I examined how interface design features influence:

⏱️ Task efficiency
‍
📲 Interaction behavior
‍
🔎 Feature discoverability
‍
☑️ User satisfaction

▶️ Research Problem:

Most existing chatbot research focuses on:
‍
⏱️ Task efficiency
‍
📍 Response accuracy
‍
📈 Performance metrics

But less research examines:
‍
❇️ How interface design elements (icons, visibility, accessibility) influence actual user behavior and satisfaction.
‍
Additionally, traditional usability frameworks were developed for graphical user interfaces, not conversational AI.

▶️ Research Question:

How do interface visibility, accessibility, and response quality influence user satisfaction and behavior across AI chatbot platforms?

methodology

▶️ Participants:

15 adults (ages 21 – 65)
‍
They had varied experience with AI chatbots.
‍
After removing one extreme outlier, final analysis included 14 participants.

▶️ Study Design:

Within-subjects repeated measures design
‍
I used Latin Square counterbalancing to prevent order effects
‍
Each participant completed two tasks on each of the 3 chatbots:

Solve a personal or work-related problem
Copy the chatbot’s response

▶️ Data Collected:

🔢 Quantitative

⏱️ Task Completion Times in Seconds

📈 Descriptive Statistics Test of Completion Times in Seconds on JASP

🔠 Qualitative

🕹️ User Interactions with Icons vs. Keyboard & Mouse to Submit & Copy

⭐ Participants’ Average Rating for Each Chatbot

ChatGPT - 4.4 out of 5
Gemini - 3.9 out of 5
Copilot - 3.6 out of 5

🗣️ Participants’ Verbal Feedback on Each Chatbot Categorized

12 out of 14 participants mentioned satisfaction with ChatGPT's response quality and/or length
3 out of 14 participants mentioned satisfaction with Copilots's response quality and/or length
2 out of 14 participants mentioned satisfaction with Gemini's response quality and/or length