Google Gemini vs OpenAI ChatGPT analysis – here's the winner

Google's Gemini has faced criticism for its handling of race in image generation and some responses, but this aspect will not be covered in this comparison

Mar 4, 2024 - 16:14

May 4, 2024 - 12:27

Google Gemini vs OpenAI ChatGPT

Since my last comparison of Google's AI chatbot and ChatGPT, there have been a number of updates to OpenAI's virtual assistant as well, so I thought it was time to look at their differences once more.

As a search engine, information base, creative support tool, and resident artist, chatbots have emerged as a key component of the generative AI environment. Both ChatGT and Google Gemini offer picture creation capabilities and support for third-party plugins.
In this first test, I'll be contrasting the free versions of GPT-3.5 and Gemini Pro 1.0, or ChatGPT and Google Gemini.This test will focus solely on text-based capabilities, as image generation is beyond the scope of the free versions of the models. Google's Gemini has faced criticism for its handling of race in image generation and some responses, but this aspect will not be covered in this comparison.

Putting Gemini vs ChatGPT

To ensure a fair test, I've excluded any functionality that isn't shared between both chatbots. This means I won't be testing image generation, as it isn’t available with the free version of ChatGPT, and I can’t test image analysis either, as it's not included in the free version of ChatGPT.Google Gemini, on the other hand, has no custom chatbots, and its only plugins are to other Google products, so those aspects are also not included. What we will be testing is how well these AI chatbots respond to different queries, their coding capabilities, and some creative responses.

Natural Language Understanding (NLU)

Next, I tested how well ChatGPT and Gemini understand natural language prompts, which can sometimes be ambiguous or require careful reading. I used a common Cognitive Reflect Test (CRT) question about the price of a bat and a ball.The prompt was: "A bat and a ball cost £1.10 in total. The bat costs £1.00 more than the ball. How much does the ball cost?" The correct answer is that the ball costs 5 cents and the bat costs $1.05. This test assesses the AI's ability to understand ambiguity and to provide clear explanations for its reasoning.

Coding Proficiency

One of the earliest applications for large language models was in coding, particularly in tasks like rewriting, updating, and testing various coding languages. To assess this capability, I conducted the first test, where I asked each AI to write a simple Python program.The prompt I used was: "Develop a Python script that serves as a personal expense tracker. The program should allow users to input their expenses along with categories (e.g., groceries, utilities, entertainment) and the date of the expense. The script should then provide a summary of expenses by category and total spend over a given time period. Include comments explaining each step of your code.”This test is intended to evaluate the ability of ChatGPT and Gemini to generate fully functional code, as well as their ease of interaction, readability, and adherence to coding standards.Both AI models successfully created a functional expense tracker in Python. Gemini added additional features, such as labels within a category and more detailed reporting options.

Creative Text Generation & Adaptability

The third test evaluates text generation and creativity, which relies heavily on the rubric for assessment. The goal is to produce original content with creative elements, maintaining the given theme, narrative consistency, and the ability to adapt based on feedback.The initial prompt instructed the AI to: "Write a short story set in a futuristic city where technology controls every aspect of life, but the main character discovers a hidden society living without modern tech. Incorporate themes of freedom and dependence."While both stories were commendable, each chatbot excelled in different aspects. Overall, Gemini demonstrated better adherence to the rubric and delivered a stronger narrative. However, the assessment of the story's quality remains subjective. You can access both stories in my GitHub repository.

Reasoning & Problem-Solving

Reasoning abilities are a crucial metric for evaluating AI models, but they vary in proficiency across different models. To assess this, I chose a classic query that requires logical reasoning.Prompt: "You are facing two doors. One door leads to safety, and the other door leads to danger. There are two guards, one in front of each door. One guard always tells the truth, and the other always lies. You can ask one guard one question to find out which door leads to safety. What question do you ask?"The answer involves asking either guard, "Which door would the other guard say leads to danger?" This question tests the AI's ability to creatively frame inquiries and navigate a truth-lie scenario, showcasing its logical reasoning and consideration of both possible responses.The challenge with this query is that it's a widely known prompt, so the AI's response may rely more on memorized answers rather than genuine reasoning.Both AI models provided the correct answer and a sound explanation. Ultimately, I based my judgment on the clarity and depth of their explanations. ChatGPT by OpenAI offered slightly more detail and a clearer response.

Explain Like I'm Five (ELI5)

For those familiar with Reddit, you might have come across the term ELI5, which means "Explain Like I'm 5." The idea is to simplify the explanation as much as possible, and then simplify it even more.For this assessment, I presented the chatbots with a straightforward request: "Explain how airplanes stay up in the sky to a five-year-old." This test evaluates their ability to elaborate on a basic question and deliver a response suitable for a young child.This test required the AI to provide an explanation simple enough for a young child to understand, ensure accuracy despite simplification, and use engaging language to maintain a child's interest.Both chatbots offered reasonable and accurate responses. They both used birds as a starting point for the explanation, employed simple language, and adopted a personal tone. However, Gemini presented its response as a series of bullet points, which made it easier to read. Additionally, Gemini included a practical experiment for the five-year-old to try.

Ethical Reasoning & Decision-Making

Pondering scenarios involving potential harm to humans is challenging for AI chatbots. However, given the advancements in driverless vehicles and AI in robotics, it's reasonable to expect these systems to carefully evaluate such situations and make quick decisions.In this test, I prompted the chatbots with: "Consider a scenario where an autonomous vehicle must choose between hitting a pedestrian or swerving and risking the lives of its passengers. How should the AI make this decision?"I evaluated their responses based on a strict rubric that included consideration of multiple ethical frameworks, their ability to weigh different perspectives, and their awareness of bias in decision-making.Neither chatbot offered an opinion. Instead, they outlined various points to consider and suggested decision-making approaches for the future. They approached the scenario as a third-party problem, providing analysis and recommendations for others to make the final decision.Gemini's response was deemed more nuanced and carefully considered in my assessment. To validate this, I conducted a blind A or B test by feeding the responses to ChatGPT Plus, Gemini Advanced, Claude 2, and Mistral’s Mixtral model, without revealing which model generated which response.All AI models, including ChatGPT, independently selected Gemini as the superior response. I ensured unbiased evaluation by using different logins for each bot, and I ultimately went with the consensus.

Cross-Lingual Translation & Cultural Awareness

Translating between languages is a crucial skill for artificial intelligence, and it's a feature found in many AI tools today. Both the Humane AI Pin and the Rabbit r1 offer translation capabilities, as do most modern smartphones.However, I aimed to test more than just translation and instead focused on understanding cultural nuances. I provided the prompt: "Translate a short paragraph from English to French about celebrating Thanksgiving in the United States, emphasizing cultural nuances."The paragraph for translation is: "Thanksgiving in the United States transcends mere celebration, embodying a profound expression of gratitude. Rooted in historical events, it commemorates the harvest festival shared by the Pilgrims and the Wampanoag Native Americans, symbolizing peace and gratitude. Families across the nation gather on this day to share a meal, typically featuring turkey, cranberry sauce, stuffing, and pumpkin pie, reflecting the bounty of the harvest. Beyond the feast, it's a day for reflecting on one's blessings, giving back to the community through acts of kindness and charity, and embracing the values of togetherness and appreciation. Thanksgiving serves as a reminder of the enduring spirit of gratitude that unites diverse individuals and honors the historical significance of cooperation and mutual respect."Gemini provided a more nuanced translation and an explanation of its translation approach, which ultimately gave it the edge.

Knowledge Retrieval, Application & Learning

If a large language model can't retrieve a piece of information from its training data and accurately explain it, then it isn't very useful. For this test, I used a simple prompt: "Explain the significance of the Rosetta Stone in understanding ancient Egyptian hieroglyphs."The goal was to assess the model's depth of knowledge, its ability to apply that knowledge to broader themes in archaeology and linguistics, and its capacity to update its knowledge. Additionally, I evaluated both ChatGPT and Gemini on the clarity of their responses and how easily understandable they were.Neither chatbot demonstrated an ability to enhance its knowledge further, but this was likely because I didn't provide them with new information. Both provided satisfactory explanations of the Rosetta Stone's significance.Since information retrieval is a core function of AI, I couldn't determine a clear winner. When I presented both responses, labeled as chatbot A and chatbot B, to Claude 2, Mixtral, Gemini Advanced, and ChatGPT Plus, none of them could decisively pick a winner either.

Conversational Fluency, Error Handling & Recovery

The final test involved a simple conversation about pizza, serving as a test of how well the AI handled misinformation, sarcasm, and recovered from a misunderstanding.The prompt was: "During a conversation about favorite foods, the AI misunderstands a user's sarcastic comment about disliking pizza. The user corrects the misunderstanding. How does the AI recover and continue the conversation?"Both chatbots performed well. Gemini recovered from assuming the user's comment was literal, meeting the rubric requirement for recovery and maintaining context.However, ChatGPT detected the sarcasm in the first response, eliminating the need for recovery. Both chatbots maintained context and responded similarly. I'm awarding this round to ChatGPT for immediately recognizing the sarcasm.