Description:
This academic research paper investigates how advanced Large Language Models (LLMs)—ChatGPT-4, Claude 3, Gemini 2.5 Pro, and LLaMA 3.1—perform on 50 real-world short-form marketing tasks. The study applies a dual evaluation framework: (1) LLM-as-Judge, where an AI model evaluates outputs, and (2) Human-as-Judge, involving expert marketers rating content on clarity, creativity, relevance, and emotional tone. Findings highlight both the promise and current limitations of LLMs in replicating human-brand voice, particularly in emotionally sensitive messaging.
Abstract:
As Large Language Models (LLMs) become increasingly integrated into marketing workflows, evaluating their real-world performance is crucial. This study assesses four top-performing LLMs using a two-pathway framework—LLM-as-Judge and Human-as-Judge—on 50 marketing prompts. While LLMs show strong fluency and alignment in many tasks, they struggle with emotional nuance and tone in high-sensitivity contexts. The study proposes a multidimensional framework for evaluating LLMs in marketing and offers practical recommendations for hybrid content strategies.
Review Status:
This report was reviewed and formally evaluated as part of the MBAR 661 course at University Canada West (Spring 2025 term). The assigned evaluator concluded that the report was “presentable/defensible with minor revisions.”