Maybe-Mathematical Musings — I just read your post about how GPT-4 does a lot...

1.5M ratings
277k ratings

See, that’s what the app is perfect for.

Sounds perfect Wahhhh, I don’t wanna
nostalgebraist

marcusseldon asked:

I just read your post about how GPT-4 does a lot better on many trick questions posed to it. Do you suspect OpenAI is specifically finetuning/prompting/training (not sure right terminology on this) GPT on these trick questions? It seems like often when GPT flubbing a trick question goes viral, GPT magically starts getting the question right a week later. If this is what's going on, it makes GPT seem less impressive tbh.

nostalgebraist answered:

(Here’s the post in question.)

I doubt it, since ChatGPT 3.5 still gets the questions wrong – I checked. And I would assume it’s much cheaper and easier to re-tune that model than to re-tune GPT-4.

Also, we now have some degree of ability to check for these changes, because OpenAI’s API provides “snapshot” versions of these models from a particular date. As of this writing:

  • For ChatGPT 3.5, you can either use the latest model, or a snapshot from March 1.
  • For GPT-4, you can either use the latest model, or a snapshot from March 14.

Both of these dates are later than that twitter thread (Feb 22), but not by a huge amount.

I briefly tried comparing the snapshots to the latest models, on these questions, for GPT 3.5 and 4.

I didn’t observe any differences between snapshot and latest.

But I did notice that GPT-4, in the API, often got the same questions wrong!

I initially chalked this up to some difference between the API and the chat.openai.com experience (e.g. temperature, system message). However, after re-trying the questions on chat.openai.com, I found that GPT-4 sometimes fails there as well.

On the car crash and river crossing questions, it sometimes gives the right answer, but it also sometimes behaves like 3.5 and recites the solution to the more common version of problem.

I also some some odd cases like this, which aren’t easy to classify as success or failure:

image

This is not a failure of attentiveness, I think, though it is a failure to be helpful / to play along.

(Though there is a possible version of the situation in which the model’s response is correct: in real life, the resolution to an apparent paradox often involves the discovery of a flan in one of your assumptions, rather than devising an ingenious way to wriggle out of the paradox “as written.” From the model’s perspective, this might well be the case for the user here.)

It has a much higher success rate on the Monty Hall and feathers/bricks problems. On these, after trying a number of times, I was not able to elicit a failure.

Intriguingly, if I set temperature=0 in the API, the resulting (deterministic) answer is…

  • Wrong, for the questions it sometimes gets wrong at nonzero temperature (car crash and river crossing)
  • Correct, for the questions it always gets right at nonzero temperature (Monty Hall and feathers/bricks)

For example, in the car crash question, the temperature=0 answer is the single sentence

The surgeon is the boy’s biological mother.

We might interpret the screenshot above as the model first sampling a wrong answer very close to this, then noticing the problem, and attempting to work its way out of the hole.

I guess this underscores the necessity of running the same question multiple times and computing a success rate, if we want to seriously investigate the model’s performance.

(Even a single success shows that it’s possible for the model to succeed, which is something. But even then, it’s worth verifying that one didn’t just get lucky.

“It got the right answer on my first try!” sounds very different from “it gets the right answer 25% of the time.” Yet, for ~25% of questions for which the latter is true, someone who only tries once would report the former experience.)

jadagul

It seems like GPT4 does much better on these things, possibly better than a well-read but lazy and inattentive human, but still worse than a human who is paying attention. Which maybe is what I’d expected originally, so nice to see that maybe that’s true?

One question obviously is how much you can get better at this just by scaling up what we’re currently doing.

And then, as you’ve pointed out, how much we can scale up what we’re currently doing; we might be running out of text. The AI people I’ve talked to think they can solve that by using AI to generate text and then training the next generation of AI on that generated text, And I gotta say that I just don’t believe that can work for this.

ai gpt

See more posts like this on Tumblr

#ai #gpt