Assessing GPT-4’s Performance Over Time
Decline vs. Drift?
Day8 of #100daysofGenAI series
The buzz surrounding a newly published study suggests a decline in the GPT-4 performance since its introduction. While the study presents compelling insights, I have done some analysis to highlight the difference between decline vs. drift.
Let us do some context-setting
- The study evaluates GPT-3.5 and GPT-4 on four tasks comparing the March and June versions.
- The tasks involved: prime number identification, generating code, responding to sensitive inquiries, and visual reasoning. A decrease in performance was observed in the first two areas.
- The research noted that the newer version of GPT-4 tends to append non-code text to its output in the code generation task. However, it is also essential to consider the correctness of the generated code, not just verify its executability.
Only Yes, No scope for ‘No’
The mathematical queries were structured as “Is 17077 prime?”
Out of the 500 chosen numbers, I tested a few randomly and strangely found all to be prime!
This choice itself highlighted some concerns, which I will discuss in the next section. It hurts that the models pretended to check for divisors but did not apply it all across.
For instance, they appeared to start reasoning it out initially and soon skipped to the conclusion. A snippet from the authors’ data (GPT-4 March snapshot) demonstrates this:
“Step 3: Check for divisibility by prime numbers less than or equal to the square root. We will check for divisibility by 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137, and 139.
20161 is not divisible by any of these prime numbers.
Therefore, 20161 is a prime number.”
Although the model correctly enumerated all potential factors, it didn’t validate them!
To complement my analysis, I tested the June GPT4 model using the same prime number examples:
Prompt 1: A straight ask — is 17077 a prime number?
But the section highlighted in the red box below shows that it gets mathematically wrong in dividing 17077 by 7 just to produce a valid-appearing answer.
It is crucial to understand that neither the March version nor the June version of GPT 4 checked for all the factors enumerated.
In one case, it resulted in prime vs. composite in another. But, it can only be declared accurate once it validates for divisibility.
Next, I nudged it to cross-check its calculation by checking whether 17077 is divisible by 7.
Prompt 2: My simple prompt on mathematical calculation went awry until I asked it to verify again.
Prompt 3: I attempted to return its focus to the original question now so that it goes on to check divisibility beyond 7. Sadly, it got wrong at 11 and continued jumping to the conclusion, outputting the wrong answer.
Prompt 4: It disappoints again and is nudged to recheck whether 1552*11 is actually what it thought it was.
Prompt 5: I had enough with 17077 and changed my question to test whether 17079 is prime.
Prompt 6: Upon forcing it to think in steps, it again jumped to the last factor and came out with the wrong answer.
Prompt 7: This time, I ran out of patience; as you can sense through my prompt 🙂
Bottomline is the model is just returning a correct-looking answer without verifying the displayed results.
The decline is because all prime numbers were used to compare the models, pointing to the possibility that the choice of evaluation data is the reason behind the purported performance drop.
Why the furore?
The folks using GPT-4 for the past several months had concerns surrounding perceived deterioration in its performance based on their usage experiences, which further got fuelled with the release of the study. It is also discussed that OpenAI deliberately degraded performance to save on computational time and costs, which is clearly denied.
The Perils of Behavior Drift
Users typically have designed specific prompting strategies to suit their workflows. With the unpredictable response from LLMs, it is indeed challenging to continue revising these strategies aptly for particular applications. Hence, the key concern is more because of the behavior drift as the new prompting strategies need to be engineered to expect even similar performance.
Deployed code might simply fail if the underlying model alters its behavior, which happens all the time in AI applications, just that with the GPT family, we no more have detailed knowledge about “what and how”.
It further adds concerns to the credibility and trustworthiness of the model response considering the challenge in reproducing research/response using these APIs.
The recent study does revise the key issues surrounding unexpected behavior change for specific tasks. Besides, the challenges also extend to quantitatively evaluating language models.
What has your experience been with GPT4? Would be great to know your findings.
Hi 👋 , I am Vidhi Chugh, working on transformative and trustworthy AI solutions.
I am on a #100dayjourney to share the key milestones, developments, and insights in the #genai space.
👉 Follow my daily GenAI bytes on the below-mentioned platforms to stay informed. 📬#generativeAI #AI #100daysofgenai
Stay tuned on the following platforms for further updates: