University Canada West Evaluating the Performance of Large Language Models in Marketing By: Maria Fernanda Rodriguez Tamez MBAR 661: Academic Research Project (ONS-SPRING25-04) Mohsen Ghodrat University Canada West Presentation Overview 1.Why this research Matters 8. Process Flow 2. What Makes a Good Marketing Message 9. LLM-as-Judge Results 3. Proposed Framework 10. Human-as-Judge Results 4. Methodology 11. What LLMs Do Well & What They Miss 5. Evaluation Design 12. What This Means for Marketers 6. Participants 13. Limitations & Future Directions 7. Question Design 14. Conclusions Who wrote this headline? University Canada West "Run, don’t scroll. Everything is 30% off—yes, everything." Why This Research Matters Marketing is not just what you say — it’s how, when, and why you say it. LLMs can generate content at scale. But can they create good marketing content? University Canada West University Canada West What makes a good marketing message Clarity and Structure Emotional Tone Creativity Brand Voice and Credibility University Canada West Evaluation Framework 7Ps of Marketing – Product, Price, Place, Promotion, People, Process, Physical Evidence. Each prompt maps to one of these categories. Objective-Based Functional Framework – Focused on evaluating messages based on clarity, emotional tone, persuasive value, and strategic alignment. University Canada West Methodology University Canada West Evaluation Design First Trial Second Trial 6 human experts 11 human participants Evaluated 50 marketing questions Evaluated a sample of 10 of the same 50 Each had 5 anonymized answers (GPT-4, Claude, questions Gemini, LLaMA, Human) Same models, same human benchmark University Canada West Participants in the Evaluation Process LLMs are advanced AI models trained on GPT-4 (OpenAI) Claude 3 (Anthropic) Gemini 1.5 (Google) ~1 trillion parameters ~200–300 billion parameters ~500+ billion parameters massive text datasets They vary in size, with some having billions of parameters Each model has different training methods and architecture Performance is judged by output quality LlaMa (Meta) ~70 billion parameters Human Expert (written by a marketer classmate) —clarity, accuracy, and tone Question Design University Canada West Promotion (Q1–Q10): Flash sales, product blurbs, CTAs Product (Q11–Q16): USPs, product comparisons Price (Q17–Q21): Communicating value and offers Place (Q22–Q26): Local pickup, delivery messaging People (Q27–Q31): Apologies, inclusive tone Process (Q32–Q36): Return policies, customer journey Physical Evidence (Q37–Q41): Packaging and visual brand cues Purpose (Q42–Q50): Sustainability, DEI, authenticity Process Flow University Canada West LLM-as-Judge Findings University Canada West GPT-4 showed a strong preference for Claude’s responses It only selected its own responses twice, and humanwritten ones just once. Agreement among all LLMs occurred in only 14% of cases, suggesting inconsistency. Gemini’s selections were the most closely aligned with human preferences. LLaMA had the lowest alignment, especially on emotionally or ethically nuanced prompts.. Human-as-Judge Findings GPT-4 was selected 22% of the time Claude 19.6% Gemini 19.2% LLaMA 20.6% Human 18.7% Human response stood out in only 1 prompt (Q5: Apology). University Canada West University Canada West The Human-Likeness Effect Judges often couldn’t distinguish human vs. LLM. Why? “Honestly, I couldn’t tell which one was human.” Add a blurred or mixed response example. University Canada West Use GPT for fast, scalable content (promo, email, CTA) Use Claude for tonesensitive writing (apologies, values) Always keep human oversight for brand voice and recovery messaging LLMs are assistants, not brand guardians What This Means for Marketers University Canada West Limitations & Future Directions No senior human expert was included as a benchmark. Most participants were not native English speakers. Only four LLMs were tested — more could be included for broader comparison. Demographic diversity of participants was limited. Ethical and inclusivity angles (e.g., Indigenous, EEDI) were lightly touched but not deeply explored. Some ethical themes were present in prompts, but not systematically evaluated. A larger set of prompts could strengthen generalizability. Conclusions University Canada West LLMs perform strongly in clarity, structure, speed Still struggle with empathy, nuance, trust-building Framework bridges technical and strategic marketing evaluation Human + AI = strongest future collaboration References University Canada West Anthropic. (2024). Claude 3 family models. https://www.anthropic.com/index/introducing-claude Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2022). On the opportunities and risks of foundation models. Stanford Center for Research on Foundation Models. https://arxiv.org/abs/2108.07258 Federiakin, M. (2024). Evaluating LLMs beyond benchmarks: Toward human-centric metrics. Journal of AI Ethics & Applications, 11(2), 41–56. Google DeepMind. (2024). Gemini 1.5 technical overview. https://deepmind.google/technologies/gemini/ HELM Project Contributors. (2022). Holistic evaluation of language models. Center for Research on Foundation Models. https://crfm.stanford.edu/helm/latest/ Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv. https://arxiv.org/abs/2211.09110 Meta AI. (2024). LLaMA 3: Open foundation models. https://ai.meta.com/llama OpenAI. (2024). GPT-4 technical report. https://openai.com/research/gpt-4 Rodriguez Tamez, M. F. (2025). Evaluating the performance of large language models in marketing scenarios (MBA thesis, University Canada West). Spajić, M. (2023). Artificial empathy? The limits of AI in emotional branding. Marketing & Tech Review, 7(1), 12–20.