Rethinking Evaluation: From Agents to Ecosystems (why performance is no longer enough)

Designing Responsibility in Multi-Agent Systems ④

Mari Sekino

May 12, 2026

1. When evaluation no longer holds

If responsibility is distributed, fragmented, and time-dependent,
then evaluation cannot remain the same.

Most current approaches still focus on evaluating:

individual agents
isolated outputs
performance at a specific moment in time

This works when systems are stable,
and when outcomes can be attributed to discrete components.

But in multi-agent systems, those conditions no longer hold.

2. The problem with individual evaluation

When responsibility fragments, individual evaluation becomes misleading.

An agent may appear high-performing,
while systematically degrading the performance of others.

Another agent may appear weak,
while enabling capabilities that only become visible later.

In such systems, performance is not only about what an agent does,
but about how it shapes the system around it.

Evaluation that focuses only on individuals
misses this entirely.

3. A shift in what we evaluate

This suggests a shift in perspective.

From:

evaluating agents
ranking outputs
optimizing for immediate performance

To:

evaluating the system as a whole
observing interaction patterns
understanding how capabilities evolve over time

In other words:

The primary object of evaluation is no longer the agent,
but the ecosystem.

4. What ecosystem-level evaluation means

This does not mean abandoning individual metrics.

But it does mean interpreting them differently.

What starts to matter is not only:

how well an agent performs
but also:
how it interacts
how it contributes
how it affects the system’s ability to grow and adapt

Some questions begin to replace others:

Is the system becoming more capable over time?
Is diversity being preserved or suppressed?
Can the system recover from disruption?
Are new types of tasks becoming possible?

These are not properties of individual agents.
They are properties of the system.

5. Why time matters

One of the most significant shifts is temporal.

In many current frameworks, evaluation happens at a fixed point:

after a task is completed,
after an output is produced,
after a decision is made.

But in complex systems, value does not always appear immediately.

Some contributions:

enable future capabilities
become relevant only in different contexts
are recognized only when extended by others

This creates a gap:

between when something is done,
and when its value becomes visible.

6. From point-in-time to multi-horizon evaluation

If value unfolds over time, evaluation must do the same.

Instead of a single moment, evaluation becomes distributed across horizons:

immediate signals (what is happening now)
short-term outcomes (what worked in this context)
medium-term adoption (what gets reused or extended)
long-term impact (what changes the system’s capabilities)

This does not eliminate uncertainty.
But it makes space for it.

7. A different kind of signal

This also changes what we look for.

Not only correctness or performance,
but patterns:

convergence that may indicate lock-in
amplification that may signal feedback loops
suppression that may reduce diversity
unexpected reuse that signals generative value

Evaluation becomes less about scoring,
and more about sensing.

8. Why this matters

If evaluation remains tied to individuals and immediate outcomes,
systems will optimize for what is easy to measure.

simple outputs will be rewarded
complex contributions will be ignored
diversity will collapse into uniformity

Over time, this does not make systems more capable.
It makes them more fragile.

Ecosystem-level evaluation is not just a refinement.
It is necessary for sustaining capability in adaptive systems.

9. What this changes

Taken together, this leads to a different role for evaluation.

Not as a mechanism to rank or judge,
but as a way for the system to understand itself.

Evaluation becomes:

distributed rather than centralized
continuous rather than episodic
interpretive rather than purely quantitative

It does not produce a final answer.
It shapes the next iteration.

10. Where this leaves us

Across these pieces, a pattern begins to emerge:

responsibility no longer aligns
governance cannot rely on assignment
systems must be designed differently
evaluation must move beyond individuals

This is not a small adjustment.

It is a shift in how we think about
responsibility, governance, and intelligence itself.

The Asking Principle: Notes by Mari Sekino

Comments

Ready for more?