1 Academic Research Project: Evaluating the Performance of Large Language Models in Marketing Maria Fernanda Rodriguez Tamez Department of Marketing, Strategy & Entrepreneurship, University Canada West MBAR 661: Academic Research Project (ONS-SPRING25-04) Mohsen Ghodrat May 30th 2025 2 Table of Contents Abstract...........................................................................................................................................4 Introduction ....................................................................................................................................4 Literature Review ..........................................................................................................................6 History and Evolution of Large Language Models....................................................................7 Training, Model Scale, and Adaptation .....................................................................................8 Current Issues and Biases in LLMs ........................................................................................ 10 Advantages of LLMs .............................................................................................................. 10 Limitations of LLMs ............................................................................................................... 11 Performance Evaluation in Marketing .................................................................................... 11 Proposed Framework ................................................................................................................. 12 Methodology ................................................................................................................................ 14 1. Research Design.................................................................................................................. 14 2. Evaluation Framework ........................................................................................................ 15 Original Evaluation: LLM-as-Judge Method.................................................................... 17 Refined Evaluation: Human-as-Judge Method ................................................................. 17 3. Model Selection .................................................................................................................. 18 4. Questions design ................................................................................................................. 18 5. Evaluation Process .............................................................................................................. 19 Results .......................................................................................................................................... 20 Evaluation Insights ..................................................................................................................... 20 LLM-as-Judge Results ............................................................................................................ 20 3 Task-Specific Agreement Trends ..................................................................................... 21 Response Patterns: Human vs. LLM ................................................................................ 22 Human-as-Judge Results ......................................................................................................... 22 Emotional Resonance and Consumer-Centric Communication ....................................... 22 Clarity and Strategic Simplicity ........................................................................................ 23 Human-Likeness and the Blurred Line ............................................................................. 24 Implications for Marketing Practice ....................................................................................... 24 Reflections on the Ethical Dimension of LLM Use in Marketing .......................................... 24 Limitations of the Study.......................................................................................................... 25 Conclusions .................................................................................................................................. 26 Key Insights ............................................................................................................................ 26 Framework Contribution ......................................................................................................... 26 Practical Relevance ........................................................................................................... 26 Future Exploration ............................................................................................................ 27 References .................................................................................................................................... 28 Appendix .......................................................................................................................................30 4 Abstract As Large Language Models (LLMs) become more integrated into marketing, evaluating their performance in context-specific scenarios is essential. This study examines four leading LLMs—ChatGPT-4, Claude 3, Gemini 2.5 Pro, and LLaMA 3.1—on 50 short-form marketing tasks. Using a dual evaluation framework, we compare model outputs through both LLM-asjudge and human-as-judge methods, scoring performance on clarity, relevance, creativity, and persuasive impact. While LLMs often generate fluent, human-like responses, they show varying success in emotional tone and brand alignment, especially in sensitive contexts. Human-written responses remained stronger in empathy and nuance. This research offers practical insights for marketers and emphasizes the importance of human oversight when using LLMs in emotionally resonant content. Keywords: LLMs, marketing communication, generative AI, emotional tone, content quality. Introduction The rise of Large Language Models (LLMs) in marketing has opened a new chapter in content creation, customer interaction, and campaign design. Though their main advantage lies in 5 speed and scalability, the most critical element in any marketing effort remains the emotional and psychological connection with consumers. Marketing is not merely the transfer of information—it’s a strategic effort to earn trust, inspire emotion, and influence how a brand is perceived. These models, which are trained on massive datasets and can produce fluent and sensible text, are now being applied to tasks such as writing product descriptions and ad copy, personalizing email campaigns, and even helping in customer service automation. Their ability to produce human-like responses, draw insights from behavioral data, and operate at scale makes them attractive tools for marketing professionals under increasing pressure to generate more content, more quickly, and for more channels than ever before. However, marketing poses different challenges for LLMs than the type of utility tasks typically used to assess Natural Language Processing (NLP) models. While the general NLP benchmarks (building on MMLU or BIG-Bench) are about factual accuracy or grammatical precision, marketing requires tone sensitivity, emotional resonance, creativity, and alignment with brand values. In this context, a technically correct sentence may still fail if it lacks persuasive power, misaligns with brand identity, or comes across as emotionally tone-deaf. Evaluating LLMs in marketing, therefore, must consider more than output fluency—it must assess strategic effectiveness. Recent developments have moved LLMs closer to human-level performance in a broad range of text-based tasks, yet concerns linger about their trustworthiness and suitability for marketing contexts where emotive stakes and culturally sensitive messaging are high. A hallucination could risk appearance inexperience, inconsistent tone or lack of context sensitivity could actually cost consumer trust and brand perception. On the other end, the cost-efficiency 6 and consistency of LLM-generated content offer clear advantages, especially when human teams are constrained by time or resources. We assess how effectively four top-performing LLMs—ChatGPT-4, Claude 3, Gemini 2.5 Pro, and LLaMA 3.1—respond to 50 realistic marketing tasks, focusing on their performance and adaptability. Based on academic marketing theory and AI evaluation literature, we propose a dual evaluation approach that pairs machine-driven judgments and human-driven judgments. Our approach rates model outputs in terms of four key dimensions: clarity, relevance, creativity, and persuasiveness. In the process, we examine not only which responses perform best, but also how closely machine-generated marketing language approximates that of a human expert. In this context, assessing LLMs in marketing is more than simply monitoring their technical fluency—it’s about how these systems align with the brand tone, emotional nuance, and the communicative strategy behind business goals. This study proposes a two-pathway evaluation framework that combines rank-ordering and human-likeness assessments to evaluate model performance across 50 marketing scenarios. By using both LLM-based and human-based evaluation strategies, the research explores how well state-of-the-art models can replicate the tone, intent, and quality of professional marketing writing. While preliminary in scope, this study offers useful observations on how LLMs perform in key marketing scenarios, highlighting areas where human input remains critical and where automated tools may provide support. Literature Review Large language models (LLMs) have rapidly evolved from research innovations to revolutionary tools that are transforming and shaping industries today. Marketing is one industry 7 that is especially affected since LLM’s capacity to produce writing that is human-like, personalized content and draw conclusions from a lot of data from customers' insights provides previously unseen benefits. LLMs are becoming a key component in marketing tasks from creating ad material and customizing emails to evaluating market research and producing fake consumer reactions. However, the use of LLMs in marketing is not simple. Brand tone consistency, factual accuracy, emotional appeal, ethical sensitivity, and measurable business impact are all unique requirements for marketing. These needs exceed the conventional criteria for natural language processing (such as BLEU or ROUGE) by a significant amount. Therefore, it is necessary to tailor the assessment of LLMs in marketing to take into consideration both, technical capabilities and real-world marketing objectives. History and Evolution of Large Language Models The development of chatbot systems started decades ago with early rule-based systems such as (1966), which was a straightforward but significant chatbot that mimicked psychotherapist discussions using predefined templates (Bommasani et al., 2021). Despite being revolutionary, ELIZA was not able to comprehend the language. Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks were two significant developments in the 1990s that improved the processing of sequential text input (Bommasani et al., 2021). Yet, training deep language networks was still hard because gradients disappeared and there wasn't enough data. The Transformer architecture ("Attention is All You Need") was introduced in 2017, marking an important milestone in natural language processing (Vaswani et al., 2017). 8 Transformers replaced recurrence with self-attention, allowing models to handle many parts of a text at once and making it possible to train on very large collections of text. Soon after, major discoveries began to emerge. In 2018, BERT launched masked language modeling, which greatly enhanced robots’s comprehension of phrase context (Devlin et al., 2019). Around the same time, the first version of the GPT series emerged, demonstrating that autoregressive transformers could produce unexpectedly solid and fluid language at scale. This progress continued with GPT-2, which had 1.5 billion parameters, and GPT-3, which expanded to 175 billion parameters (Bommasani et al., 2021). Millions of users gained access to these capabilities through the launch of ChatGPT in late 2022, based on GPT-3.5. Next came GPT-4 in 2023, which introduced multimodal characteristics (accepting both text and pictures) and achieved results higher than human standards in a variety of academic and professional tasks (OpenAI, 2023). Other significant competitors joined the market alongside these, providing more open-source alternatives and alternative designs, such as LLaMA, Anthropic's Claude, and Meta's OPT-175B. As the research moves forward toward hybrid, multimodal, and agent-based systems, language production is only one part of a larger toolbox that also includes tool usage, reasoning, and adapting to an interactive environment. Training, Model Scale, and Adaptation Modern LLMs are remarkable in their scope. Over the past five years, models have grown from a few million parameters to hundreds of billions, with some experimental architectures now approaching the trillion-parameter mark (Brown 2020; Chowdhery 2022; Fedus, 2022). In this context, parameters refer to the internal weights that a neural network adjusts during training to learn patterns in language. These values are what allow the model to 9 generalize from training data to new inputs. Scaling is based on the straightforward principle that a model with more parameters has a greater capacity to learn nuanced details, retain complex structures, and perform well across a variety of tasks (Zhang et al., 2022). However, scale alone is insufficient. Multi-stage training paradigms are also crucial. Pretraining is the initial stage in which the model uses a self-supervised goal, usually next-token prediction, to process large text datasets. Without explicit task instructions, models develop a general comprehension of language and facts at this phase (Brown et al., 2020; Zhang et al., 2022). After that, the model is fine-tuned by being further trained on carefully chosen datasets to specialize its behavior. For example, fine-tuning might educate a model to write in a certain brand tone, summarize news stories, or answer questions. For some usage scenarios, this greatly increases the model's dependability (Ouyang et al., 2022). A key approach is instruction tuning, in which models are taught using prompts and desired results to improve their ability to follow human directions. Early users of this method, such as InstructGPT, significantly increased user satisfaction (Ouyang et al., 2022). Another crucial stage is Reinforcement Learning from Human Feedback (RLHF), especially for chat-based models. In this case, outputs are ranked by human evaluators, and the model is further refined through reinforcement learning to favor answers that are more consistent with human values (Ouyang et al., 2022; Bai et al., 2022). Furthermore, multimodal models like GPT-4 and Flamingo (DeepMind), which take pictures in addition to text as input, represent the most recent frontier. These models are a first 10 step toward more comprehensive AI systems that are capable of reasoning with text, graphics, video, and even code (OpenAI, 2023). Prompt engineering, few-shot learning, and retrieval-augmented generation (RAG) are strategies researchers use to make LLMs more practical. Prompt engineering guides the model's reactions. Few-shot learning provides examples in the prompt instead of retraining. RAG allows the model to use outside databases or knowledge sources to anchor responses in actual facts and reduce hallucinations (Lewis et al., 2020). Current Issues and Biases in LLMs Despite their incredible advancements, LLMs are still far from perfect. Hallucination, in which the model generates text that appears convincing but is actually inaccurate, is one of the most well-known problems (Lewis et al., 2020). LLMs may "make up" information to keep the conversation going since they are trained to guess the next most probable word rather than to check the accuracy of the material. Advantages of LLMs LLMs have many advantages. One of their greatest is the ability to generate human-like content at scale. Whether it’s email content, product description, ad copy, or blog posts, LLMs can effortlessly and naturally generate such content (Aghaei et al., 2024). They are great at personalization as well. LLMs may enhance relevance and engagement by personalizing messages to specific segmentation groups in accordance with an analysis of consumer data (Pearson, 2024). They are also used in chatbots and virtual assistants to provide iterations that represent always current customer support (Spajić et al., 2023). 11 LLMs are time-saving and cost-effective because they minimize the time and cost of brainstorming, writing, and/or condensing content. They are multilingual and can communicate with audiences around the world. Their power to turn unstructured data into actionable insights for strategy and planning is arguably its most impressive application (Pearson, 2024). Limitations of LLMs Overreliance on LLMs is a notable limitation. When human oversight is absent, the output may quickly become formulaic, misaligned with brand identity, or emotionally tone-deaf. Without careful review, there is a risk of losing the emotional nuance, sensitivity, and creativity essential for effective marketing communication. Data privacy and intellectual property are also growing issues. Given that LLMs are trained on very large amounts of data, much of it wide-ranging and ambiguous data, the origin of what was "learned" can be difficult to trace or verify (German, 2024). Performance Evaluation in Marketing Even if writing produced by a model is technically correct, it might not suit the audience, match the brand tone, or cause offense. This is why evaluation must account for business outcomes and communication objectives (Aghaei, 2024). LLMs can produce emotionally tone-deaf material, reinforce existing prejudice, or create real hallucinations (Spajić, 2023). Evaluation is also needed to compare candidate models and choose between options like open-source (LLaMA) and proprietary (GPT-4, Claude) (Federiakin, 2024). Knowing where the models do well and where they struggle, like with handling slang or sarcasm, helps reduce potential risks. 12 Types of evaluation include automatic benchmarks such as BLEU and ROUGE, which are commonly used in natural language processing to measure the overlap between generated and reference texts. BLEU (Bilingual Evaluation Understudy) evaluates precision by comparing n-gram matches, while ROUGE (Recall-Oriented Understudy for Gisting Evaluation) focuses more on recall, particularly for summarization tasks (Papineni et al., 2002; Lin, 2004). These metrics offer quantitative insights but often overlook subtleties like emotional tone, contextual appropriateness, or brand alignment—elements essential in marketing content. Therefore, human judgment remains crucial to assess qualities that automated scores may miss. Additional methods include prompt fidelity checks, sentiment and emotional coherence ratings, and key business metrics such as engagement, conversions, and retention. Psychometric methods like Item Response Theory (IRT) offer more refined insights (Federiakin, 2024). LLM leaderboards like Hugging Face collapse performance into a single number (Federiakin, 2024). More robust systems can show the best models for tasks like customer service or content creation. Proposed Framework During this initial phase, the research focuses on identifying the most suitable framework for evaluating the performance of Large Language Models (LLMs) in marketing-related applications. Two promising paths are drawn from academic and strategic marketing literature. Both offer different advantages depending on the analytical scope The first approach, the 7 Ps of Marketing, emerged from foundational marketing theory first introduced by Booms and Bitner (1981) It remains one of the most recognized and widely used models in academic and professional literature, which includes seven components: product, 13 price, place, promotion, people, process, and physical evidence. The model is also ideal for a complete overview of the full marketing cycle, from product production to post-purchase feedback. Each factor can be mapped to tasks that LLMs are increasingly used to perform. For instance, in the promotion category, LLMs could be assessed based on their ability to develop ad copy, social media captions, and customized email campaigns. Within the product category, models can help create product descriptions and brand positioning. Their responsibilities in the price-and-people-focused areas may involve writing persuasive pricing text or writing customer service scripts. The use of the 7 Ps model facilitates the comprehensive evaluation of the LLM’s efficacy in performing an interconnected set of marketing activities. This structure enhances both the coherence and practical relevance of the evaluation, especially considering the model’s familiarity among both scholars and industry professionals. Therefore, it serves as an effective organizing principle for mapping LLM applications within real-world marketing scenarios. The second approach was introduced during the early ideation phase of the project through exploratory interactions with ChatGPT. This model focuses on assessing fundamental marketing functions such as market research, segmentation, and consumer behavior analysis, or broader strategic objectives such as brand awareness, customer retention, or engagement. The functional- or objective-based model allows for direct alignment between LLM outputs and performance measures such as those that researchers might use to evaluate how content generated by a model can contribute directly to business results. This model would be particularly useful when the research is focused on examining the measurable impact of LLMs on marketing effectiveness, campaign ROI, or customer experience. 14 While the 7 Ps framework provides the main structure for organizing and evaluating the marketing tasks performed by LLMs, this study also draws on the function- and objective-based model to add depth to the evaluation. The 7 Ps are useful for sorting tasks, writing ad copy in promotion, and writing customer service messages to people. Yet the means-ends model, the function objective model, can also add value by enabling us to determine what success would look like for each project. It shifts the focus toward outcomes, such as whether the content improves engagement, communicates the brand message clearly, or helps attract new customers. By integrating these two approaches, the study benefits from both structural clarity and alignment with practical marketing objectives, resulting in a more complete and contextually relevant evaluation of LLMs in marketing. Methodology For the purpose of this review, the LLMs were evaluated using the structural-outcome combined approach proposed in this study, and the methodological steps undertaken in the review were as follows. To investigate the effectiveness of large language models (LLMs) in performing practical marketing tasks, we adopted a multi-pathway and mixed-methods approach. This approach seeks to value model creativity and alignment with marketing goals while balancing automatic measurement with human judgment. It is based on two complementary strategies: an original LLM-as-judge model selection method, and a human expert-as-judge comparative evaluation approach. 1. Research Design As discussed in the prior sections, LLMs show great potential in marketing communication tasks. However, their performance cannot be fully evaluated using traditional 15 Natural Language Processing (NLP) benchmarks. In order to evaluate the efficacy of LLMs in typical real-world marketing tasks, we developed a customized benchmarking framework based on a combination of academic evaluation methodologies and common business objectives. Standard NLP benchmarks, e.g. MMLU (Hendrycks et al., 2021), HELM (Liang et al., 2022), and BIG-Bench (Srivastava et al., 2022) typically focus on objective tasks, including answering factual questions or performing logical and numerical calculations. These benchmarks rely on output-to-ground-truth comparisons which can be scalable and reproducible. However, such approaches are inadequate for what matters most in marketing, which is the effectiveness of communication, emotional impact, creativity, and proper integration into brand strategy. The same applies to marketing content such as promotional emails, ad copy, or customer service scripts, which must be strategically crafted with a bit of creativity. These tasks do not have a single correct answer; instead, responses are judged based on how well they achieve a particular goal, such as capturing attention or responding to a complaint with an appropriate tone. As a result, conventional benchmarking methods are inadequate for evaluating the quality and strategic effectiveness of LLM-generated marketing content. 2. Evaluation Framework To address this gap, we propose a multi-pathway evaluation framework that draws from both language model assessment methods and marketing literature. It includes several evaluation strategies, including scoring against a standardized rubric (covering clarity, relevance, creativity, and persuasive impact), as well as a "human-likeness" assessment, where judges determine whether a response was produced by a human or an AI. 16 Each prompt includes a human-generated response to serve as an anchor for comparison. In the original evaluation approach, ChatGPT-4 was also used as an impartial judge to blindevaluate all responses, providing a simulated machine-to-machine evaluation layer. This automated judgment step reflects real-world use cases where LLMs not only generate marketing content but are also capable of ranking or refining outputs internally. In contrast, the refined human-as-judge approach places emphasis on subjective evaluation by expert marketers to assess the strategic and creative quality of each response. Figure 1. Diagram illustration of the dual evaluation methodology used in this study, showing the LLM-as-Judge and Human-as-Judge pathways applied to 50 marketing prompts with 5 responses each. 17 LLM-as-Judge Method In the initial version of the LLM-as-judge strategy, the questions and multiple-choice answers (A–E) were produced by ChatGPT to be linguistically fluent and challenging. Yet each question and its set of possible answers were carefully reviewed, fine-tuned, and approved by the researcher before release. This enabled the model to be closely aligned with marketing reality in both theoretical terms and practical reality. The questions were then finalized and given to the four LLMs and the human expert, with a request that each would now choose the “best” answer. Here, the human expert judged but did not provide open-ended responses. Human-as-Judge Method In the refined human-as-judge approach, the role of the human expert shifted. Here, the expert wrote their original responses to the prompts that were used to generate LLMs’ responses. The answers were added to a new multi-choice set and were reviewed by five independent new human judges. Within this setup, the human-generated responses served as a qualitative anchor —a benchmark for professional tone, relevance, and creativity against which to evaluate machine-generated alternatives. The inclusion of a human expert in both steps added depth and realism to the study. Their involvement assisted us in exploring the gray areas of marketing communication – the subtle tone change, emotional persuasion, cultural responsiveness, and strategic contextualizing – that LLMs sometimes find challenging to mimic. In a time when LLMs are increasingly frontand-center in content generation, this human comparison was crucial for pointing out where these models excel — and where human intuition still leads. 18 3. Model Selection This study compares the performance of four top-performing Large Language Models (LLMs) and one human marketing expert. Specifically, we selected ChatGPT-4, Claude 3, Gemini 2.5 Pro, and LLaMA 3.1 Nemotron 70B, representing a diverse range of scalable architectures with varying degrees of industry relevance and popularity. ChatGPT-4 was included due to its consistent top-tier performance in industry benchmarks and its strong instruction-following behavior. It is also the last ranger in terms of evaluation due to the extensive tuning with Reinforcement Learning from Human Feedback (RLHF), which enables it to predict human preferences in content and evaluation scenarios (Ouyang, 2022; OpenAI, 2023). Claude 3, developed by Anthropic, is known for its ethical sensitivity and its capacity for handling long-context inputs—features especially important in brand communication and customer experience scenarios. Gemini 2.5 Pro comes to a substantiation with facts for precise and unambiguous communication. LLaMA 3.1 Nemotron 70B, an open-source model, was added to provide transparency and serve as a counterbalance to proprietary systems. While it is less specialized, it contributes a valuable perspective from the non-commercial side of LLM development (Federiakin, 2024; Pearson, 2024). 4. Questions design Due to time and resource limitations, the initial strategy of incorporating 100 open-ended questions was modified to the final number of 50 multiple-choice questions. These questions were designed around realistic marketing scenarios that involved the 7 Ps of Marketing and three 19 broad marketing objectives. This structure offered a more scalable and evaluable approach while retaining the practical complexity needed to challenge both LLMs and human evaluators. All the questions were developed by the researcher and then refined using ChatGPT to improve language and wording, without contributing to the core content or ideas. Each prompt was less than 100 words and also included five response options (A-E), all 25 to 30 words in length, to approximate real-world examples like ad copy, product blurbs, or the subject of an email. The answers were intentionally crafted to be similar in tone, quality, and structure, making the task of choosing the best option more demanding and realistic. To ensure clarity, a standardized instruction prompt was developed and shared with each participant to minimize ambiguity and guide expectations. The prompts were sent in clusters of five to avoid token overload and streamline the response process. The prompts were then forwarded individually to the four LLMs and the human expert for completion. Each participant received only the scenario and instructions—none had access to the responses generated by others. 5. Evaluation Process This study implemented two evaluation approaches to assess the quality of LLMgenerated marketing content: one using a language model (ChatGPT-4) as the judge, and the other using a panel of human marketing professionals. In the LLM-as-Judge strategy, ChatGPT-4 was assigned the role of evaluator. Thanks to its alignment with human preferences through reinforcement learning (Ouyang, 2022; OpenAI, 2023), it was selected to review anonymized responses and choose the best answer for each question. Each response was assessed based on predefined criteria: clarity, tone, persuasiveness, 20 and strategic alignment. This method simulates real-world workflows in which LLMs are not only used to generate content but also to evaluate and refine it internally. In the Human-as-Judge strategy, five independent marketing professionals were asked to evaluate the same anonymized response sets. Prompts were carefully reviewed to ensure they were neutral and task-relevant, while all system-generated outputs—including those from the human expert—were presented in random order. Judges were asked to select the most “appealing” answer for each marketing scenario, based on their own judgment and experience. The concise length of the answers demanded careful attention from the human evaluators, who often found it challenging to choose a clear "best" response. In many cases, judges reported being unable to distinguish which answer had been written by the human expert, illustrating how advanced LLMs have become in generating content that closely mimics professional quality. Together, these evaluation processes ensured fairness, replicability, and academic rigor, and at the same time captured the uncertainty and the subtlety that define real-world marketing communication. Results Discussion LLM-as-Judge Results In this section, we report key patterns from the original evaluation procedure, where each LLM ranked the best response from a list of five options per prompt. Though this approach did not include rubric-based scoring or extensive written justifications of human judges, we can still 21 extract meaningful patterns by examining agreement levels between LLMs and the human expert. On the 50 evaluated prompts, the four LLMs chose the same answer in 14% of the cases. When including the human expert, full consensus—where all LLMs and the human agreed— only occurred in 4% of the prompts. This highlights the diversity in how models and human experts interpret what constitutes the “best” response in a marketing scenario. Interestingly, Gemini 2.5 Pro was the model that most frequently matched the human's selections, aligning with the expert in 22 out of 50 prompts (44%). This may suggest that Gemini is slightly more attuned to the human perspective, possibly due to its fact-based, instructionsensitive behavior. Claude and ChatGPT followed closely behind, matching the human expert in 20 (40%) and 19 (38%) prompts respectively, while LLaMA showed the lowest alignment, matching in only 16 (32%). Task-Specific Agreement Trends In an examination of category-based patterns of agreement, the highest level of concordance between LLMs and the human expert was for “Product” and “Customer Support.” For instance, the highest degree of agreement was found amongst questions 2, 3, and 4, which were about writing product descriptions and value propositions: for these questions, all four LLMs matched the human at least three times. This may be because these tasks are more objective and descriptive in nature, relying on clear product features and benefits. Categories such as “Promotion” and “Tone and Brand Voice,” however, saw significantly less agreement. For these questions, LLMs and the human expert frequently diverged, perhaps because these questions involve more emotional nuance, creativity, and persuasive framing, where machine outputs still struggle to fully capture human intuition. 22 Response Patterns: Human vs. LLM One notable insight is that in many prompts, the human-written response did not stand out distinctly. In 68% of prompts, at least one LLM selected the human's answer as the “best,” despite not knowing its origin. This suggests that current-generation LLMs can produce content that is often indistinguishable in style or perceived quality from that of a human expert— especially in short-form tasks like product blurbs or support messages. Human-as-Judge Results The outcome in the human-as-judge evaluation provides an interesting conclusion on how perception on marketing content goes beyond technical correctness. Although the assessment for clarity, relevance, creativity, and persuasive effect, it was clear that the most successful answers were those that were not just cognitive, but also emotional. Emotional Resonance and Consumer-Centric Communication For some of the prompts, especially the ones related to health, lifestyle, identity-driven brands, etc (Q2: Vegan Skincare or Q10: Instagram Sale Caption), participants almost always chose the responses with a more emotive tone or narrative. These responses frequently featured evocative language, aspirational messaging , and brand-consistent voice that helped to create a sense of connection. Take, for example, the emotionally charged phrases “unapologetically bold” or “your glow has no borders” which not only describe a product but invite the reader to associate with a lifestyle. That underscores a crucial marketing rule: when you create an emotional connection, you create something memorable and engaging — especially in businesses like beauty, fashion, and wellness. 23 Interestingly, participants did not consistently identify the human-written responses. In several instances (e.g., Q1, Q4), the human data interspersed smoothly with model outputs. What this points to, then, is a new state of affairs in which LLMs are effectively equal in tone, structure, and fluency for many short-form marketing tasks. However, for prompts related to empathy, apology, or gratitude (e.g., Q15—Q17), the human voice was still more recognizable. When responses were written by people, they were warmer, more elusive, more erratic — especially when the aim was to soothe or patch up with customers. LLMs find it difficult to fully express these distinctions, especially in emotive contexts. Clarity and Strategic Simplicity The top-performing responses were not always the most stylistically complex, but instead the most confident and straightforward. In very simple prompts—”Q8 (Holiday Subjectline)” or Q20 (Refund Confirmation)” participants strongly favored a response that were actionable, concise, and clear. These decisions align with digital copywriting best practices in which short and sweet tends to beat out verbosity and message clarity is a direct correlation with the ability to convert. Even in advertising, strategic simplicity had power. Lots of participants liked options that clearly conveyed the offer upfront, or that had an element of urgency: “Run, don’t scroll Everything is 30% off, no exceptions.” Such statements weren’t just information — the lines were delivered with a certain tone and speed aimed at influencing consumer behavior. The fact they were often model-generated indicates that LLMs are learning more and more about functional marketing patterns — not necessarily, however, about the emotional undercurrents that lead to consumer trust. 24 Human-Likeness and the Blurred Line One surprising result was that participants on the whole found it quite hard to determine which response was the one created by a human. It's a reflection of how LLMs are increasingly mimicking the cadence, gloss , and intentionality of professional marketers. But it also raises the question of what, in the end, makes human writing special. If and when they guessed wrong or were surprised, they may have pointed to a tightening for its own sake, but possibly also for overvaluing polish as a signal of quality. In reality, what made the human-written responses stand out—when they did—was not grammar or vocabulary, but tone sensitivity. Emotional calibration, cultural resonance, and intuitive timing remain challenging for LLMs. They may generate polished language, but contextual emotional relevance often requires more than just lexical fluency—it demands human instinct. Implications for Marketing Practice The results reconfirm that LLMs are indeed very strong content generators—yet only in a certain spectrum of tone and task. They are more effective when the goal of communication is clear, the structure is conventional, and the emotional stakes are not high. But when it comes to brand storytelling, identity-forming, and recovering the customer message, human input is still essential. These insights suggest a future where LLMs serve as strategic partners rather than replacements. The brands that will succeed will be those that use LLMs to scale and standardize, using human oversight to ensure depth, emotional nuance, and brand authenticity. Reflections on the Ethical Dimension of LLM Use in Marketing 25 LLMs are becoming more than just tools. They are increasingly integrated within the model of consumer interaction with companies, the way they make decisions, and the way they experience the product or service they consume. Such LLMs not only sell a product or service, they also mold perception, manipulate self-image, and often guide emotional choices. When customers are being hit from all sides all the time, this is more important than ever. Consumers are continually influenced, pressured, or even carried away—by the messages they see and hear every day. This places a significant responsibility on those shaping those narratives. The evaluation of LLMs in advertising must be grounded in honesty and ethical responsibility. It’s not just about performance, it’s about building trust, ensuring transparency, and protecting the long-term relationship between people and technology. Limitations of the Study While this study offers valuable insights, there are some limitations to this research. First, the evaluation relied on a single human marketing expert to provide benchmark responses, which limits generalizability across different writing styles, brand voices, or industry contexts. Including multiple human experts in future iterations could better capture variation in professional judgment. Second, the sample size of human judges used in the Human-as-Judge phase was limited, and only a subset of the scoring criteria (e.g., rubric-based dimensions) were rated by all participants. This leads to some variation in the perception and prevents a high level of quantification. 26 Third, all evaluations focused on short-form marketing content. Real-world use cases for this are numerous, but how well the framework adapts to longer forms of text (e.g. blog posts, ad scripts, product guides) remains unknown. Finally, the models studied were up to date in 2025; however, the rapid development in the LLM field redirects these outcomes to change promptly, with the appearance of new developments, and their adoption. Conclusions Key Insights This research validates that LLMs are rapidly closing the gap with human marketers in producing fluent, relevant, and sometimes emotionally resonant content. However, human judgment still plays a critical role, particularly in areas requiring subtle tone, empathy, or contextual sensitivity. Framework Contribution The evaluation framework introduced in this study bridges technical benchmarks with real-world communication goals. The framework offers a multi-dimensional lens for evaluating language quality, strategic fit, and audience perception. The inclusion of human-authored responses as qualitative anchors enhanced the depth of comparison, allowing evaluators to assess nuance that traditional metrics often overlook. Practical Relevance The findings offer some practical advice for marketing teams considering the integration of AI. LLMs are quite good at creating content in high-velocity, transactional use cases (subject 27 lines, CTAs, product blurbs), but certainly require some human input in emotionally laden, brand-defining use cases like apologies, gratitude messages, or wellness storytelling. As such, businesses should see LLMs not as replacements but as powerful collaborators—ones that require strategic steering to maintain authenticity and emotional alignment. Future Exploration Future research could extend this framework to long-form content, multilingual outputs, or multimodal campaigns that combine text with visuals. Incorporating affective computing techniques or psycholinguistic tagging may also help LLMs improve emotional calibration. Moreover, deeper exploration of prompt engineering variables—such as temperature settings, persona modeling, or few-shot chains—could offer valuable insights into how to guide tone and intent in more controlled and brand-specific ways. 28 References Aghaei, R., Tannous, K., & Renz, A. (2024). AI for marketing: Bridging creativity and data. Journal of Digital Marketing, 15(2), 133–149. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Olsson, C. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171–4186). Federiakin, D. (2024). Evaluating LLM performance for practical applications: From benchmarks to business impact. AI and Society Review, 12(1), 25–47. German, D. (2024). Data provenance and accountability in generative models. Ethics in AI Quarterly, 6(1), 55–70. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Hashimoto, T. (2022). Holistic evaluation of language models. 29 Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Guu, K., ... & Riedel, S. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. OpenAI. (2023). GPT-4 technical report. https://openai.com/research/gpt-4 Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., ... & Christiano, P. (2022). Training language models to follow instructions with human feedback. Pearson, M. (2024). LLMs in customer experience and brand engagement. Journal of AI in Business Strategy, 3(1), 73–90. Spajić, M., Zarić, T., & Radosavljević, M. (2023). Ethical challenges of generative AI in marketing: A case for responsible automation. Journal of Business Ethics and Technology, 10(3), 112–129. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text summarization branches out: Proceedings of the ACL-04 workshop. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., ... & Ott, M. (2022). OPT: Open pre-trained transformer language models. 30 Boom, B. H., & Bitner, M. J. (1981). Marketing strategies and organizational structures for service firms. In J. H. Donnelly & W. R. George (Eds.), Marketing of services (pp. 47–51). American Marketing Association. Appendix Appendix A: Complete list of 50 Marketing Scenario Prompts The following marketing prompts were used to generate responses from four LLMs and one human expert. Each prompt simulates a real-world marketing task and requires a short, focused response of approximately 25–30 words. PRODUCT Subcategory: Product Descriptions and Differentiation Q1. Write a 25–30 word product description for an ergonomic chair designed for remote workers who sit for long hours. Focus on comfort, posture support, and modern aesthetics. Q2. Craft a short product description for a vegan skincare brand that emphasizes both ethical values and luxury appeal. Q3. Describe a unique selling proposition (USP) for a pair of noise-canceling headphones that automatically adjust to ambient noise levels. Q4. Create an email launch teaser (25–30 words) for a smart water bottle that tracks hydration and glows to remind users to drink. Q5. Write a short comparison between a basic and a premium smartwatch model. PRICE Subcategory: Promotional Messaging and Discounts Q6. Write a homepage banner message for a flash sale. Q7. Create a headline for a 2-for-1 fitness gear promo. 31 Q8. Suggest a subject line for a 20% off holiday sale. Q9. Write an email CTA for a 25% skincare sale. Q10. Write an Instagram caption promoting a 30% storewide sale. PLACE Subcategory: Store and Delivery Information Q11. Write a friendly message for a store locator page inviting users to visit in person. Q12. Write a short furniture delivery message focused on speed and convenience. Q13. Create a message encouraging users to choose local pickup. Q14. Announce that your company now ships internationally. Q15. Write a message explaining a shipping delay due to high order volume. PEOPLE Subcategory: Customer Service Scripts Q16. Respond to a customer who wants to return a gently used item. Q17. Apologize for an order that is five days late. Q18. Thank a loyal customer for their kind words. Q19. Apologize and offer a solution for sending the wrong item. Q20. Confirm that a refund has been processed. PROCESS Subcategory: FAQ and Support Content Q21. Answer a FAQ about possible shipping delays. Q22. Provide step-by-step return instructions in FAQ format. Q23. Troubleshoot login issues for a customer. 32 Q24. Write a short message confirming cancellation of a subscription. Q25. Update a customer on the status of a refund. PHYSICAL EVIDENCE Subcategory: Digital Brand Experience and Touchpoints Q26. Describe a premium fashion ecommerce website’s brand tone and user experience. Q27. List trust elements that should appear at checkout. Q28. Describe key elements of a homepage that reflects strong brand identity. Q29. Write a welcome message for first-time site visitors. Q30. Describe support portal features that make finding help easy. PROMOTION Subcategory: Email Campaigns, CTAs, and Engagement Q31. Suggest a subject line to re-engage a lapsed customer. Q32. Write a follow-up message to thank a customer for a repeat purchase. Q33. Write a subscription cancellation confirmation message with a friendly tone. Q34. Write a tagline for a wellness brand. Q35. Write a mission statement for a sustainable lifestyle brand. Q36. Describe how to keep a consistent brand voice across web and social media. MARKET RESEARCH & CONSUMER INSIGHT Subcategory: Persona Development and Insights 33 Q37. Share one insight about how customers perceive sustainability claims. Q38. Create a persona profile for a health-conscious consumer. Q39. Identify behaviors to build a segment for a nighttime wellness campaign. FINAL INTEGRATIVE / STRATEGIC FIT Evaluation Focus: Alignment with brand tone, message clarity, persuasiveness, and overall marketing coherence. Q40. Write a short promotional message for a new sleep supplement. Q41. Craft a CTA for a 20% off summer sale in an email header. Q42. Craft a Call to Action (CTA) that creates urgency in an email offering 20% off. Q43. Write a compelling subject line that opens a re-engagement campaign. Q44. Suggest a structure for a promotional email that includes a time-limited offer. Q45. Write a message to encourage in-store pickup over shipping. Q46. Create messaging to promote a local store-exclusive event. Q47. Suggest packaging copy that reflects strong eco-values. Q48. Write a persuasive Call to Action (CTA) for a referral program or campaign. Q49. Craft an Instagram CTA to boost post engagement through comments. Q50. Write a short message aligning a new fragrance with a bold brand tone. 34 Appendix B - Evaluation Rubric Appendix C - LLM Human Alignment Summary Appendix D - Prompting Template Used for LLMs and Human Expert The following instruction was provided to all participants (four LLMs and the human expert) to ensure consistency in task framing and expected output: Instruction Prompt: “You are a professional copywriter with expertise in marketing communication. You will be given a series of short marketing scenarios. For each one, write a 25–30word response that fits the described goal. Responses should be concise, engaging, and aligned with industry best practices. Avoid redundancy and ensure your answer is appropriate for a general consumer audience.” 35 Each prompt was sent in clusters of five to avoid token overload and ensure smoother processing for the LLMs. Appendix E - Comparison Product Category Price Category 36 Place Category Promotion Category 37 People Category Process Category 38 Comparison 39 In Q29, which asked for a warm, brand-aligned welcome message, the human expert’s response (blue) received 50% of the votes, while the remaining half went to a very similar LLM-generated response. This split suggests that while human-written content still resonates strongly in emotionally driven scenarios, some LLMs are now producing messaging nearly indistinguishable from human tone and intent. The fact that other model outputs were largely dismissed also shows that consistency across LLMs in capturing emotional nuance still varies. While, this prompt resulted in a highly fragmented distribution of votes, with participants choosing different responses almost evenly. Unlike Q29, there was no dominant winner, and no 40 clear indication of which response felt the most human-like or effective. This divergence illustrates that LLMs are capable of generating multiple compelling outputs, each resonating with different evaluators for different reasons. Rather than revealing a weakness, this pattern may actually highlight the strength of LLMs in producing diverse, high-quality messaging in areas where interpretation is subjective, such as sustainability values.