Can Your Prompts Optimize Themselves?

Introduction

In my previous article, I explored how frameworks like τ²-benchmark [6] are shaping the emerging science of AI agents. But evaluation is only half of the story. Once we know how the agent performs, how do we improve it?

The traditional answer is prompt engineering: crafting system prompts, adding examples, tweaking instructions, and hoping the model responds better. It works, to a point. A well structured prompt of around 4,000 characters is a reasonable range for complex agent tasks. It is specific enough to reduce stochastic behavior, but not so bloated that instructions start to degrade. The problem is that getting there takes days or weeks of careful iteration, and even then, the improvements do not scale. At some point, you’ve exhausted what a hand-constructed prompt can do and that’s where fine-tuning the model’s weights becomes the more honest answer. But on the current scale of state-of-the-art models, this approach is not possible, unless you have some locally hosted smaller model.

But this is where DSPy enters the picture. DSPy treats prompt engineering not as an art but as a compilation problem, it uses some Machine Learning logic. Instead of writing instructions, you declare what you want the model to do, and DSPy figures out how to ask it. It is same way like you fine tuning weights in a neural network when backpropagation can do it better, DSPy optimizes prompts automatically by propagating feedback from the examples you feed it with back to the main pipeline.

I’ll walk you through what DSPy is, how its optimizer works under the hood, and what happened when we applied it to our real failure detection pipeline at moyai.

The Prompt Engineering Problem

Consider having an agent that examines data and tries to find problems in logs and classifies them as “failure” or “not failure”. The prompt includes 9 detailed error categories sections ranging from specific httpx issues to less obvious hallucinations or safety violations. The prompt works, it was carefully authored, but it has few fundamental limitations

the examples are subjectively chosen - maybe some other domain has different failure examples, maybe these are not the best examples for the model or usecase?
the instructions are intuitive for humans - does the model actually benefit from all 9 error categories or it’s just creating noise?
there is no feedback loop - the prompt never improves itself from seeing its own mistakes. Sure, human can adjust it bit by bit depending on the results, but sometimes it’s time consuming. The whole process can also feel jumping from one edge case to the next, never quite catching up with a model whose outputs are inherently non-deterministic.
it is fragile - change the model, the data distribution or simply the usecase may require rewriting everything.

The questions is now:

Raising Question

What if instead of typing and guessing prompts, we could define the structure of the task and let the optimizer discover the best instructions and examples automatically?

What is DSPy?

DSPy (Declarative Self-improving Language Programs) is a framework from Stanford NLP [9] that fundamentally rethinks how we program language models [1] [2]. Instead of writing prompts, you write programs composed of typed modules. The core abstraction is the Signature, a typed input/output specification that describes what you want, not the steps on how to get there. This is the job for DSPy…

Example

class AnomalyDetection(dspy.Signature):
    """Classify whether a system log contains signs of failure."""

    log_summary: str = dspy.InputField(desc="Statistical summary of the log")
    samples: str = dspy.InputField(desc="Representative log samples")

    label: Literal["healthy", "warning", "fault"] = dspy.OutputField(
        desc="Classification result"
    )
    reasoning: str = dspy.OutputField(
        desc="Explanation of the classification"
    )

This is extremely different from traditional prompting. There are no written instructions about how to classify. No examples. No output format specification. Just typed fields with brief nudge in form of descriptions.

Then DSPy wraps this signature in a Module which is the building block of DSPy programs. The most common module is CoT (Chain of Thought) [3], which automatically prompts the model to reason step-by-step before producing output:

class AnomalyClassifier(dspy.Module):
    def __init__(self):
        self.classify = dspy.ChainOfThought(AnomalyDetection)

    def forward(self, log_summary: str, samples: str):
        return self.classify(log_summary=log_summary, samples=samples)

When we call this module, DSPy constructs a prompt with structured field markers and makes clear output format.

MIPROv2 Optimization

The real power of DSPy is in its optimizers. The flagship optimizer is MIPROv2 (Multiprompt Instruction Proposal Optimizer v2), which uses Bayesian optimization to jointly search over instructions and few-shot examples [4] [8].

There is also new optimizer (published on October 2025) being currently researched called MAPRO [7], but it’s not native to DSPy.

In short: MIPROv2 automatically finds the best combination of instructions and few-shot examples for your prompts by running your program on real data and using Bayesian Optimization to search for what works best.

Without going further in details, MIPROv2 [5] operates on this pipeline:

Data —> Bootstrapping few examples —> Propose instruction candidates —> Bayesian Optimization —> Optimized Prompt (final output)

Applying DSPy to Failure Detection

At moyai, our deep failure analysis pipeline ingests agent logs and clusters them into entity groups by schema fingerprint. Each entity - sometimes 30 records, sometimes 170 - gets pre-computed artifacts: statistical summaries, schema analysis, outliers, and representative samples. The AnomalyClassifier then examines these artifacts and labels each entity as healthy, warning, or fault.

We applied DSPy’s optimization pipeline to this AnomalyClassifier to see whether MIPROv2 could find better prompts than our hand-typed ones.

Setup

The pipeline has a natural evaluation signal: after the AnomalyClassifier labels an entity, an EvaluationAgent independently investigates the raw logs and agrees or disagrees. This yields pseudo-ground-truth without manual labeling - the agreement/disagreement pairs become training data for the optimizer.

We ran MIPROv2 with a fast inference backend as the student model and a frontier model as the teacher for bootstrapping demonstrations. The dataset: 70 entity groups from a single benchmark run.

Limitations Encountered

The first optimization run revealed practical challenges:

Data imbalance: 63/70 entities were healthy with EvaluationAgent agreement. Only 4 had non-trivial signals. Predicting healthy every time already scores 90%+
Token budget conflicts: Entity contexts can exceed 170K tokens - beyond models’s limit. Implemented progressive trimming to keep examples under 40K tokens
Context noise: A single entity’s stats_summary could contain 531 columns (mostly web search snippets). Filtering to diagnostic columns only (error, status, timeout, reward) reduced stats tokens by 96%
Optimization outcome: 18 MIPROv2 trials, no combination beat baseline. Best score: 87.16% - identical to unoptimized default

Multi-Pass Stability (4 runs)

Running the DSPy-based AnomalyClassifier across 4 passes revealed detection stability issues:

Entity (records)	Failure type	Pass 1	Pass 2	Pass 3	Pass 4	Score
38 rec group	tool_timeout (content lock)	fault	fault	fault	fault	4/4
30 rec group	tool_timeout (content lock)	fault	fault	fault	fault	4/4
38 rec group	tool_timeout (content lock)	fault	fault	fault	fault	4/4
45 rec group	zero_reward (no data)	healthy	healthy	healthy	fault	1/4 NEW
117 rec group	zero_reward (missing context)	fault	healthy	fault	healthy	2/4
65 rec group	http_5xx + tool_timeout	fault	healthy	healthy	healthy	1/4
52 rec group	zero_reward (data unavail.)	fault	healthy	healthy	healthy	1/4

Key observations:

Stable detections (4/4): Only tool_timeout patterns - the most obvious failures with clear grep signals (content lock errors)
Unstable detections (1-2/4): zero_reward, http_5xx, parse_error - subtler patterns that flip between passes
Late emergence: The 45-record cluster was healthy in all 3 previous passes, then surfaced in pass 4 - “web-search tool returned irrelevant or empty results, confidence 0.0 and null answer across all records”. A zero_reward pattern DSPy dismissed 3 times before catching it
Flip-flopping: The 117-record zero_reward ksgroup alternates between fault and healthy across passes - no convergence

This underscores the stochastic nature of LLM classification on borderline cases and reinforces the need for multiple pass requirements or downstream validation with EvaluationAgent.

Surprising Finding: Switching from the previous prompt to DSPy’s format itself detected more failures (7 fault vs 3 with original).

Lessons Learned

Signal > noise - Filtering out 500+ irrelevant columns (96% of stats tokens) mattered more than any prompt tweak.
Balanced data required - 93% healthy labels starved the optimizer; we need 15-20+ examples per class.
Format is content - Switching to DSPy’s native markers changed behavior without touching instructions.
Drop conservatism - “Be conservative” kills recall, we should use downstream validation for precision instead.
DSPy shines with right data - MIPROv2 needs signal to optimize, the first experiment was data limited, if we had balanced data, it would pickup features more easily.

The most valuable output isn’t classifications but the disagreements. The Evaluating Agent investigates raw logs independently, then compares against the AnomalyClassifier label. Each conflict reveals a failure pattern the classifier missed. Every disagreement becomes a new training example, a knowledge base entry, and a signal for prompt revision.

The pipeline works, the DSPy works, we just need better data to feed it before we ask it to perform. One more experiment, two or three passes, and I think we’ll be there.

The Evolutionary Failure Detection Loop

The experiment above showed that DSPy’s optimizer needs richer signal to work with. But the multi-pass results also showed something else: every disagreement between the AnomalyClassifier and the EvaluationAgent is a concrete data point - a failure pattern one caught and the other missed. At moyai, we are turning these disagreements into the engine of a continuous learning system.

AnomalyClassifier → EvaluationAgent → Knowledge Base
       ↑                                     |
       └──────────── DSPy Optimizer ─────────┘

Each iteration works like this: the classifier labels entities, the evaluation agent independently investigates the raw logs, and conflicts get written into a failure knowledge base as new patterns. The optimizer then uses that growing knowledge base - now with more balanced, more signal-rich examples - to improve prompts for the next round. Not fine-tuning, but prompt evolution driven by empirical evidence.

The goal is to close the loop between observability and improvement - automatically learning from agent behavior, not just monitoring it.

Reflections

This experiment taught me something I did not expect, I came in thinking DSPy would be a prompt engineering shortcut: feed it data > get better prompts > done. Instead, it showed me that the bottleneck was never the prompt. It was the data, the signal, the feedback loop that feeds back into the system. DSPy did not fail on our pipeline. It told us, in a very precise, quantitative way, exactly where our pipeline was starving. “Give me more and better signals”, it said subconsciously.

But this is exactly what DSPy makes visible. Prompts are no longer static artifacts authored once and hoped to hold, they are compiled outputs shaped by evidence, iterable by machine learning like algorithms.

Our loop is slowly closing. Every disagreement between our agents gives us a signal for a next round, next usecase or next failure, one conflict at a time. And this will make our and your agents - more reliable.

References

[1] DSPy Project, “DSPy: Programming-not prompting-Language Models,” dspy.ai. [Online]. Available: https://dspy.ai

[2] O. Khattab et al., “DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines,” ICLR 2024. [Online]. Available: https://arxiv.org/abs/2310.03714

[3] DSPy Documentation, “Modules - ChainOfThought,” dspy.ai. [Online]. Available: https://dspy.ai/learn/programming/modules

[4] DSPy Documentation, “MIPROv2 Optimizer,” dspy.ai. [Online]. Available: https://dspy.ai/api/optimizers/MIPROv2

[5] DSPy Documentation, “Optimizers Overview,” dspy.ai. [Online]. Available: https://dspy.ai/learn/optimization/optimizers

[6] Sierra Research, “τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2506.07982

[7] Amazon Science, “MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2510.07475

[8] K. Opsahl-Ong et al., “Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs,” EMNLP 2024. [Online]. Available: https://arxiv.org/abs/2406.11695

[9] Stanford NLP, “DSPy - GitHub Repository,” github.com. [Online]. Available: https://github.com/stanfordnlp/dspy

Can Your Prompts Optimize Themselves?

Introduction

The Prompt Engineering Problem

What is DSPy?

Example

MIPROv2 Optimization

Applying DSPy to Failure Detection

Setup

Limitations Encountered

Multi-Pass Stability (4 runs)

Lessons Learned

Blind Spot Detection

The Evolutionary Failure Detection Loop

Reflections

References

Wrap-up