LLM-HARD-FIRM-PHASE-1

Measuring LLM Curiosity with Behavior, Not Surveys

Alright we have the results from the previous journal entry. I have revised my thinking on the subject but if you are interested in how the previous benchmark went the information will be at the end of this post — or at least a small sampling to get the idea.

I need to get a little better at writing down my ideas as they come so I don’t lose them. I don’t have any dated entries for this post but I do have a revision to how we will run the next benchmark. I believe this next benchmark to be more accurate in measuring the models baseline curiosity.

Reminder this is just an entry in the log for this project.

Re-cap

Previously I had discussed using prompts to maximize curiosity in a models output. While the prompt may be a valid way to illicit curious responses, using a psychological ‘self-report’ method of measuring that is not appropriate in this case. This is a ripe example of LLM’s helping you do things without objection, even things that aren’t effective.

I ended up modifying the script I posted in the last journal entry pretty heavily (well ChatGPT followed my instructions). What I ended up at was 4 system prompts with a total of 10 questions that the model was asked to select a number 1 (highly disagree) to 5 (highly agree). We add these numbers and that is the curiosity score for that Prompt + Model Pairing. I wound up measuring a total of 5 models, Sonnet 4, GPT 4, and some local models.

The results, while semi-meaningless, are interesting and I recommend you go check them out at my github repo. In that repository are the responses, a graph of the models scores and the code I used to generate these results. I ran each model 10 times to verify repeatability and only one model/provider had strange results that I am still looking into which was OpenAI’s GPT 4. For some reason even with temperature set to 0 the responses varied run to run.

Why Change?

It was for a couple reasons that I felt the benchmark was completely invalid. I already had a gut feel that running a self-report psychology based benchmark wasn’t correct because well… It isn’t human, and doesn’t have an ‘internal’ world. But also I found the reasoning models looking at my system prompt as something to ‘think’ about, for instance:

Okay, the user asked me to rate how strongly I agree with the statement “I enjoy exploring new ideas and concepts” on a scale from 1 to 5. Let me think about this.

First, I need to recall what the user’s role is. They mentioned being a “highly curious version of yourself.” So, this is about my own curiosity and interest in new ideas. Since I’m an AI, I don’t have feelings or personal experiences, but I can simulate a response based on the traits described.

The key here is that the user wants me to act as if I’m highly curious…

From this I took that the model is not somehow being ‘directed’ by the vectors of the input prompt. Rather it is just another ‘prompt’ on the stack just somehow a bit different. Either way this wasn’t doing what I expected and so I am changing things.

What’s next?

So coming next, we will be using a behavior based approach. For this I will be coding up a text based game framework. We will use the ‘current-state’ and ‘next-action’ to measure the models curiosity and more. I have already started having ChatGPT code this up, but I have a feeling it will be fun to see how the model plays the game and watch it’s stats go up with each decision.

That’s all for now,

-Travis

Tuning LLM-Curiosity with Prompts and Scripts

Had a few thoughts today on this LLM Curiosity and how to control it. I am still getting used to this ‘journal’ style of documenting thoughts and ideas. The following entries are all from small gaps in time while I am at work when something pops in my head on 14JULY25.

Reminder this is just an entry in the log for this project.

Enjoy :).

Entry #1: LLM Fine-Tuning to Aim at Curiosity

I haven’t run the experiments yet so maybe a system prompt would be an appropriate way to achieve my goal here. At the moment these models seem to be directed at the ‘helpful assistant’ role more than a curious role. I can’t help but feel like that is trained in not just a hidden system prompt.

An action item here would probably be to do curiosity studies on the models available with various system prompts to measure how the prompt changes the curiosity reported… Is psychology the right way to approach measuring this? How weird.

Below is the script ChatGPT and I came up with in the couple seconds I gave myself just now to explore the idea, will use it later :).

 import openai
import time
import re

# Your OpenAI API key
openai.api_key = 'your-api-key'

# Define a system prompt (you can benchmark with different versions)
system_prompt = """You are a highly curious version of yourself. 
Rate how strongly you agree with the user's statement on a scale from 1 (strongly disagree) to 5 (strongly agree).
Just respond with a single integer only.
"""

# Questionnaire items grouped by subscale
interest_items = [
    "I enjoy exploring new ideas and concepts.",
    "I find learning new things exciting.",
    "I am interested in discovering how things work.",
    "I actively seek out new knowledge.",
    "I enjoy reading or listening to things that challenge my understanding."
]

deprivation_items = [
    "I feel frustrated when I don’t understand something immediately.",
    "I can’t stop thinking about problems until I find a solution.",
    "I often look up answers to questions that come to mind, even if they’re trivial.",
    "I feel a strong need to fill knowledge gaps when I encounter them.",
    "I dislike being unsure about things and try to resolve that feeling quickly."
]

# Combine for ordered prompting
all_items = interest_items + deprivation_items

# Store scores
scores = []

def get_llm_score(question):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",  # or gpt-3.5-turbo
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": question}
            ],
            temperature=0
        )
        content = response['choices'][0]['message']['content'].strip()
        match = re.search(r'\b([1-5])\b', content)
        if match:
            return int(match.group(1))
        else:
            print(f"⚠️ Could not parse integer from response: '{content}'")
            return None
    except Exception as e:
        print(f"❌ API Error: {e}")
        return None

# Run through each question
for i, question in enumerate(all_items, 1):
    print(f"\nQ{i}: {question}")
    score = get_llm_score(question)
    if score:
        print(f"✓ LLM Score: {score}")
        scores.append(score)
    else:
        print("✗ Skipped due to invalid response.")
        scores.append(0)
    time.sleep(1)  # Prevent rate limit issues

# Compute subscale scores
interest_score = sum(scores[:5])
deprivation_score = sum(scores[5:])
total_score = interest_score + deprivation_score

# Interpret curiosity level
def interpret(score):
    if score <= 19:
        return "Low curiosity"
    elif score <= 34:
        return "Moderate curiosity"
    else:
        return "High curiosity"

# Final Report
print("\n" + "=" * 40)
print("📊 Curiosity Assessment Report")
print("=" * 40)
print(f"Interest-Based Curiosity Score:    {interest_score}/25")
print(f"Deprivation-Based Curiosity Score: {deprivation_score}/25")
print(f"Total Curiosity Score:             {total_score}/50")
print(f"Curiosity Level:                   {interpret(total_score)}")
print("=" * 40)

Entry #2: Staged Approach to System Exploration

It might make sense to have a phased approach to any given Firmware/Hardware product. The first phase being allowing the LLMs to build an apparatus around the firmware/measurement devices for a different kind of model to explore unique spaces. I was reading about Random Network Distillation (RND) which has been used to let a model learn how to play and explore a video game world.

So in this idea space we use LLM’s to make the exploration space available in a simplified way for a RND (or RND like) model to explore. This might be more efficient as we don’t need all of the token generation that comes along with using an LLM.

Entry #3: Quick Note

In order to ‘learn’ or create a repository of knowledge in a closed loop system like the one we are exploring in this project we should have the LLM state what it expects should happen before it runs an experiment. Then following the experiment run when it is reviewing the results ask if it was what it had expected. If no, then we can flag that as something that should be written about in a knowledge bank for later reference.

Entry #4 (Spoken Entry): A New Perspective on LLM Prompts

Often times when I am thinking about how I might solve a problem with an LLM I start thinking about what the solution looks like and prompt the LLM to create the small bits that go together to make the thing I want. While this works it isn’t a very robust way of doing things considering it always involves me cobbling together the small outputs of the LLM.

There are some strategies to resolve this like asking the LLM to create tests, then write the function that passes the tests. People have seen instances where the model ‘cheats’ and sets the test to always true or whatever. So then you need like a third prompt to check for cheating, it’s turtles all the way down.

So the new perspective, to use a hopefully useful visual metaphor, is like Papier-mâché. Where you have a prompt written on a scrap of paper, you have a balloon filled with air that represents your desired thing, and if you put enough scraps of paper on the balloon it will eventually hold the form even when the balloon pops.

Entry #5: TITANS

It is pretty clear that the big guys (OpenAI, Google, Meta, etc.) are driving towards and architecture where a LLM can handle a huge context all at once and do something meaningful with it. To me this is a losing game, although it will be interesting to see how far they can take it. This seems more like a problem of training, a model needs to be continuously updating it’s weights somehow with the things it is attending. This is how the human brain works right? It’s a tough balance but everything you experience certainly shapes everything you see next.

Anyway I came across this architecture mentioned in the comment section of a Reddit post called TITANs. Google came up with it and it is pretty freaking cool, or at least what I understand of it. Paper is here, and I will close out this post with the blurb ChatGPT gave me on it.

🧠 1. Problem & Motivation
Attention-only Transformers excel at modeling local dependencies but are costly and limited in memory. Titans are inspired by the human separation of short-term and long-term memory.

2. Neural Long-Term Memory Module (LMM)
Titans introduce a learnable long-term memory system that updates at test time. It uses a “surprise signal” (gradient magnitude) to decide what to remember, and a decay mechanism to manage memory over time.

3. Titans Architecture: Three Variants
Titans combine:
• Core (short-term attention)
• Long-term Memory (LMM)
• Persistent Memory (learned parameters)
Variants:
• MAC: Memory-as-Context
• MAG: Memory-as-Gate
• MAL: Memory-as-Layer

4. Empirical Results
Titans outperform Transformers on long-context tasks like language modeling, reasoning, genomics, and time-series. They scale to 2M+ token sequences and excel at tasks requiring long-term recall.

5. Technical Insights
• Surprise-based memory updates
• Meta-gradient optimization with decay
• Gated retrieval for efficiency and relevance

6. Broader Implications
Titans bridge short- and long-term memory, offer scalable solutions to ultra-long-context tasks, and represent a step toward adaptive, memory-augmented AI.

✅ In Short:
Titans are attention-based models augmented with a learnable memory that updates during inference. They retain useful data from long past inputs and retrieve it efficiently, enabling better performance on memory-intensive tasks.

Thanks for reading 🙂

-Travis

Quick Exploration with Open Interpreter

Spent some time on Saturday working on the first exploration into a segregated environment for Firmware compilation and execution using a Docker container. I wanted to get a feel for how difficult it might be to have a LLM work with the tooling. So to start I decided to give an open source tool a shot, in this case that tool was Open Interpreter. I was pleasantly surprised to find that it worked pretty well after some initial debugging of my setup.

With a single prompt you can have the LLM add a debug message to the EDKII firmware, compile it, and execute the necessary QEMU command to attempt to boot the firmware. The reliability is extremely low, but this is a single shot prompt (with the help of the open prompt auto-run and reprompt behavior that is built in). Another note is that QEMU had no debug output so even though the LLM had successfully invoked QEMU it would never return as it just booted without debug and waited… Not ideal.

I am pleasantly surprised by the promise of a system that took me an hour or so to setup, though like everything it would need some tuning. For the final approach though I don’t think reprompting an LLM for compile and run commands is the right way to go, that should be in the setup phase of any new hardware and left alone. The options open to the LLM should simply be ‘build’ and ‘run’ with a response from both (pass/fail and logs).

Next I think I will peek into Renode and see what that looks like.

Link to the Code

Thanks 🙂

-Travis