Robustness through meaning - one triple at a time

Introduction

Working on our anomaly detection implementation, we realized that while the core pipeline was in place, it was missing one more component - we needed some sort of fingerprint or a pattern which could explain why something was happening, what structure it reflects and what potential remediation it might require.

Drawing from our prior work in knowledge management and data structures, we identified ontologies as a promising way to fill this gap. Ontologies have gained significant traction in several areas recently, most notably in Palantir’s stack, where they released a comparable capability exposed through OSDK [1]. However, in the end our use case was calling for something different, but I want to point out a few learnings which we gained while exploring Palantir’s library and open source RDF/OWL. It was a valuable exercise either way.

Ontologies explained

As enterprises accelerate their adoption of AI, the focus is shifting toward automation and agentic workflows. Over the past two years, agentic systems have gained momentum, moving beyond experimentation into genuine production use. Open source frameworks like LangChain, LlamaIndex, and CrewAI have made it possible to build sophisticated agent pipelines quickly, and with so many teams building in parallel, it’s encouraging to see OpenTelemetry emerging as a common standard for observability across agentic systems.

Across all this, while we are coining the term Agent Reliability, it is a way to prove that that agentic system is consistent, robust and able to recover from faults. The concept builds on the foundation of the research about Agentic Benchmarks, but extends it from raw performance measurements into something closer to a reliability discipline.

Our initial idea to tackle robustness was to give our agents something grounded in structured, shared meaning, some sort of pattern. This will give the system a foundation to handle ambiguity and edge cases without breaking the whole pipeline. Here comes the ontology layer.

A proper academic definition of ontology is:

Ontology is…

An ontology is a formal, explicit specification of a shared conceptualization. That one-liner from Gruber (1993):

Formal - machine-readable, not just prose in a wiki. It has a syntax and semantics a reasoner can process.
Explicit - the concepts, relationships, and constraints are declared, not implicit in application code.
Shared - it represents consensus across a community or domain, not one person’s mental model.
Conceptualization - an abstract model of some domain that identifies the relevant entities, their properties, and the relationships between them.

TLDR: Where a database stores records, an ontology stores meaning.

In simple terms, we first need to define some concepts and properties and rules:

What kinds of things exist (concepts) — “an AgentRun exists, a ToolCall exists, an Issue exists”
How they relate to each other (properties/relations) — “an AgentRun hasStep Step, an Issue occursInRun AgentRun”
What constraints govern them (rules) — “a FailedRun is an AgentRun that has at least one Issue”, “DeathLoop is a subclass of AgentBehavioralFailure”

Simple definition

Ontologies represent knowledge as simple triples of subject - predicate - object (SPO), where the predicate defines the relationship between two entities (for example, Aspirin - treats - Headache). By chaining these triples together, machines can reason over connected facts and infer new knowledge that isn’t explicitly stated.

When an ontology scan finds patterns for “likelyfailure” and one of the agents in the pipeline says “no_failure”, or vice versa, _those are the records worth investigating. The ontology’s rigid rules catch what the LLM’s soft attention misses, and the LLM catches semantic patterns the ontology can’t express. When they work in tandem they are each other support system.

Ontologies are used predominantly in pharmacology

In pharmacology, ontologies like ChEBI, DrugBank, and the Gene Ontology are used to formally link drugs, targets, diseases, and biological pathways — for instance, capturing that a compound inhibits an enzyme that regulates a disease process.

Our methodology for ontologies

OWL ontology as deterministic pre-filter: Written in SPO format, it runs before any LLM touches the data, encoding all of failure types across 9 categories (from GenAI operation failures to infrastructure errors), each tagged with default severity and a hard/soft signal classification.
Two-layer detection: First, regex-based grep hints defined in the ontology catch structural patterns in raw logs (for example HTTP 5xx, rate limits, timeouts); second, custom structural detectors handle compound rules regex can’t express - like death loops (same tool called multiple times with no answer) or ungrounded/hallucinated responses (confident assertions) and so on.
Fully auditable chain: Every detection traces back to a specific ontology rule, log line, and piece of evidence.
Core design principle - transparency over accuracy: When you can see exactly why a record was flagged, you can trust the classification or challenge it with evidence.

Consequently, the ontology approach tackles the robustness dimension while answering the question: “can our agent system handle semantically equivalent but syntactically varied inputs?”. The traces we analyze are rarely uniform. Users work with different frameworks and canonical templates, some of the traces are in OpenTelemetry while others are just fields, with expected structure, or include irrelevant context. We needed a way to catch them all, in a fast manner and with all these variations and map them to some underlying meaning.

While LLM and ontology systems independently classify every entity, the ontology rigid rules and the LLM-s probabilistic reasoning finally compare these two outputs. The most valuable signal isn’t where they agree, but where they disagree, when the ontology flags a failure that the LLM missed, a structural pattern the model attention skipped, or when the LLM catches a semantic failure that no regex could express (like forgetting instruction or attention degradation). This dual path cross-validation design means neither system needs to be perfect on its own, they just cover each other blind spots, and they also find new patterns for potential human review.

Ontology Landscape

The graph shows a trade-off: systems optimized for inference and reasoning (top) against the operational capability, while systems built for runtime operations (bottom) sacrifice semantic expressiveness — and no single approach covers all four quadrants.

Why Palantir’s Ontology Is a Different Thing Entirely

Palantir uses the word “ontology” prominently, but their implementation serves a fundamentally different purpose. In Palantir’s Foundry platform, the ontology is a typed operational layer - a governed API that sits between raw data and the applications that consume the data. Their Ontology SDK generates type-safe client libraries so developers can query, filter, and mutate business objects (customers, orders, assets) through a single interface, complete with atomic transactions (Actions), role-based access control, and real-time subscriptions. Think of it as an auto-generated ORM with enterprise governance built in. [1]

What it doesn’t have is an inference engine, subclass reasoning, or detection rules embedded in the schema. Palantir’s ontology represents what things are, it connects disjointed data sources into a unified object graph where you can see how entities relate and interact. It transforms spreadsheets and databases into a navigable network that mirrors real world relationships.

Moyai’s ontology does something different. It encodes detection logic, formal rules that define what constructs a failure, how to find it, and how confident we should be. Every finding traces back to a specific rule in the knowledge base, not just a query against a data model. Our approach sits in the opposite corner than Palantir’s: high expressiveness with formal detection rules, built for classification and root cause analysis rather than runtime data access.

How we can express the ontologies?

The chart above maps ontology approaches along two axes: how expressive the knowledge representation is (from flat schemas to full OWL reasoning), and whether the system is built for operational use or analytical detection.

The distinction matters: Palantir’s model answers “what exists and what can we do about it?” and ours answers “what went wrong and why should we believe it?” An enterprise reliability system ultimately needs both: deterministic detection with audit trails for diagnosis, and governed operational response with feedback for remediation. They share a name but almost no architecture.

Reflections

Recent research is starting to formalize what we have been learning through building our tools. The Reliability Bench framework [2] introduces a three-dimensional reliability surface that captures exactly the properties we care about: consistency across repeated runs, robustness to input perturbations, and fault tolerance during infrastructure failures.

What struck us most is how closely their dimensions map onto our own pipeline. Their consistency axis (pass^k) is what our multi-judge panel addresses, it is running the same verdict through three independent providers and requiring majority agreement. Their robustness axis (perturbation resistance) is exactly the problem our ontology was built to solve, it is catching the same anomaly regardless of whether the trace uses OpenTelemetry conventions, custom field names, or reordered structures. And their fault tolerance axis maps to the graceful degradation we built into every stage: if a judge times out, the panel continues with two; if the ontology and the LLM disagree, the disagreement itself becomes the signal.

We are starting to think of this as Agent Reliability, a discipline that extends beyond benchmark scores into something closer to how traditional engineering treats system correctness. Not just does the agent get the right answer, but does it get it consistently, across varied inputs, under stress, and can we prove why it decided what it decided. The ontology gives us that last part. Where monitoring tells you what happened, the ontology tells you what it means.

The loop is still closing. Every disagreement between the ontology’s rigid rules and the agentic LLM’s probabilistic reasoning surfaces a pattern one system missed. Every new pattern becomes a new rule, a new detection, a new piece of evidence for the next run. We are not building a static classifier. We are building a system that learns from its own blind spots, one conflict at a time, until it gets reliable.

References

[1] Palantir, “Ontology SDK (OSDK) Documentation,” palantir.com. [Online]. Available: https://www.palantir.com/docs/foundry/ontology-sdk/

[2] J. Wuolle et al., “ReliabilityBench: Assessing LLM Agent Reliability Through Consistency, Robustness, and Fault Tolerance,” arXiv:2601.06112, 2025. [Online]. Available: https://arxiv.org/abs/2601.06112