Measuring LLM Curiosity with Behavior, Not Surveys

Alright we have the results from the previous journal entry. I have revised my thinking on the subject but if you are interested in how the previous benchmark went the information will be at the end of this post — or at least a small sampling to get the idea.

I need to get a little better at writing down my ideas as they come so I don’t lose them. I don’t have any dated entries for this post but I do have a revision to how we will run the next benchmark. I believe this next benchmark to be more accurate in measuring the models baseline curiosity.

Reminder this is just an entry in the log for this project.

Re-cap

Previously I had discussed using prompts to maximize curiosity in a models output. While the prompt may be a valid way to illicit curious responses, using a psychological ‘self-report’ method of measuring that is not appropriate in this case. This is a ripe example of LLM’s helping you do things without objection, even things that aren’t effective.

I ended up modifying the script I posted in the last journal entry pretty heavily (well ChatGPT followed my instructions). What I ended up at was 4 system prompts with a total of 10 questions that the model was asked to select a number 1 (highly disagree) to 5 (highly agree). We add these numbers and that is the curiosity score for that Prompt + Model Pairing. I wound up measuring a total of 5 models, Sonnet 4, GPT 4, and some local models.

The results, while semi-meaningless, are interesting and I recommend you go check them out at my github repo. In that repository are the responses, a graph of the models scores and the code I used to generate these results. I ran each model 10 times to verify repeatability and only one model/provider had strange results that I am still looking into which was OpenAI’s GPT 4. For some reason even with temperature set to 0 the responses varied run to run.

Why Change?

It was for a couple reasons that I felt the benchmark was completely invalid. I already had a gut feel that running a self-report psychology based benchmark wasn’t correct because well… It isn’t human, and doesn’t have an ‘internal’ world. But also I found the reasoning models looking at my system prompt as something to ‘think’ about, for instance:

Okay, the user asked me to rate how strongly I agree with the statement “I enjoy exploring new ideas and concepts” on a scale from 1 to 5. Let me think about this.

First, I need to recall what the user’s role is. They mentioned being a “highly curious version of yourself.” So, this is about my own curiosity and interest in new ideas. Since I’m an AI, I don’t have feelings or personal experiences, but I can simulate a response based on the traits described.

The key here is that the user wants me to act as if I’m highly curious…

From this I took that the model is not somehow being ‘directed’ by the vectors of the input prompt. Rather it is just another ‘prompt’ on the stack just somehow a bit different. Either way this wasn’t doing what I expected and so I am changing things.

What’s next?

So coming next, we will be using a behavior based approach. For this I will be coding up a text based game framework. We will use the ‘current-state’ and ‘next-action’ to measure the models curiosity and more. I have already started having ChatGPT code this up, but I have a feeling it will be fun to see how the model plays the game and watch it’s stats go up with each decision.

That’s all for now,

-Travis

Re-cap

Why Change?

What’s next?

Leave a Reply Cancel reply