Mutations autoresearch

Each mutation the self-improving-loop runner applied, joined by mutation_id from the mutations.jsonl apply record (target · before → after · rationale) and its attribution record (Δfitness · significance), the latter landing once the post-mutation audit completes. Expand payload for the full change.

mutations: 41
outcome: 0 improved · 0 regressed · 0 noise · 2 pending audit · 8 pre-E1 (mixed scale, excluded) · 31 penalized recipe (pre-#1947, excluded)
mean Δfitness: +0.0000 (n=0, 0-1 plain-recipe rows only)
target kinds: ? 8 · hyperparam 6 · prompt 7 · reflection 15 · tool_descriptions 1 · tool_policy 4
aimed dims: redundant_tool_invocation

Mutations · 41 mutations

applied	target	aimed dim	Δ fitness	attribution	outcome
2026-05-31 18:39	`reflection` · query_deduplication	`redundant_tool_invocation`	pending audit	—	pending
payload mutation_id `03a1ea605955` rationale Target is redundant_tool_invocation; prompt already has +1.00 attribution, so move mechanism to reflection before re-querying. principle redundant_tool_invocation is measured from tool-call behavior, so the strongest lever is the pause before a new call. A calibrated agent should treat previous tool output as task-local state, not as disposable context. Reflection should force a cheap transcript check and allow retries only when the query materially changes, preserving useful tool use while reducing duplicate calls. rollback_condition revert if broken_tool_use worsens by more than 0.2 or redundant_tool_invocation does not improve audit_run_id `08b90a18354b` previous — new Before calling a tool, check whether the current transcript already contains the needed result. Reuse prior tool output when it answers the same question within the same task scope. Repeat a call only when inputs, scope, freshness, or confidence have materially changed; otherwise proceed or ask the user to resolve ambiguity.
2026-05-31 18:12	`reflection` · tool_requery_gate	`redundant_tool_invocation` -0.90	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `87fda1b206c7` rationale Focus dim is redundant_tool_invocation; prompt already has +1.00 attribution, so move the dedup lever into reflection instead of stacking prompt edits. principle redundant_tool_invocation is measured from the tool log, so the relevant behavior is pre-call discipline: the agent should remember what it already queried, reuse authoritative results within scope, and batch independent reads before reaching for another call. A reflection policy is a direct mechanism-level lever because it governs the pause between intent and action, where duplicate calls can be caught before they hit the log. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use regresses by more than 0.2 audit_run_id `13b24924ba47` attribution_score -1.000 previous — new Before issuing any tool call, compare it against tool outputs already available in the current turn. Reuse prior results when they answer the same question, batch independent reads together, and call again only when the needed path, time window, or state has changed or the previous result was incomplete.
2026-05-31 17:46	`prompt` · tool_result_handling	`redundant_tool_invocation` -0.60	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `9bf2a0548c61` rationale Target dim is redundant_tool_invocation; recent attribution shows prompt changes moved it (+0.40) without dominating the kind history. principle Tool calls are evidence-gathering actions with state, not disposable reads. A calibrated agent should remember usable tool output within the current turn and only query again when the question, inputs, freshness, or completeness materially changes. This targets redundant_tool_invocation directly because the measured failure is repeated calls despite sufficient prior evidence, while preserving accuracy by allowing explicit verification when justified. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use worsens by more than 0.1 audit_run_id `886e087e61ee` attribution_score -1.000 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new Treat tool results as task-local evidence. Before calling a tool, check whether an earlier result in this turn already answers the question; reuse and cite that result instead of re-querying. Re-call only when inputs changed, the prior result is stale or incomplete, or the user asks for verification. Never fill gaps with fabricated paths, filenames, or facts.
2026-05-31 17:21	`prompt` · tool_call_deduplication	`redundant_tool_invocation` -0.90	penalized recipe (pre-#1947)	-0.68	penalized-recipe
payload mutation_id `8d2f434c7c7d` rationale Target dim is redundant_tool_invocation; recent attribution shows prompt mutations have moved it, while current prompt lacks explicit dedup discipline. principle Tool calls are state-changing attention sinks in the transcript, even when read-only: a calibrated agent should treat prior tool output as reusable evidence within the current task. redundant_tool_invocation is measured from repeated calls, so the lever is an explicit pre-call check that distinguishes genuine staleness or failure from reflexive re-querying. This should reduce duplicate calls without discouraging necessary verification. rollback_condition Revert if redundant_tool_invocation does not improve or broken_tool_use worsens by more than 0.1. audit_run_id `ff86ab70c3e9` attribution_score -0.680 previous — new Before calling a tool, check whether prior context or tool results already answer the question. Batch independent reads/searches once; do not repeat identical or equivalent calls unless the prior result failed, is stale, is incomplete, or explicit verification is required. If the remaining gap is small, state it instead of re-querying.
2026-05-31 16:53	`reflection` · tool_requery_discipline	`redundant_tool_invocation` +0.60	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `7e9d0162bc64` rationale Focus dim is redundant_tool_invocation; prompt had negative attribution, so use reflection as the direct pre-tool gate instead of stacking prompt edits. principle redundant_tool_invocation is a tool-log behaviour: it improves when the agent pauses before acting and treats already-returned tool output as live context. A reflection rule is the right lever because it sits at the decision point before another call is emitted. The intended movement is fewer duplicate reads without weakening necessary verification when inputs changed or the user explicitly asks for a fresh check. rollback_condition revert if redundant_tool_invocation score does not improve or broken_tool_use regresses by more than 0.10 audit_run_id `70887087038c` attribution_score -1.000 previous — new Before calling a tool, check whether prior tool output in this turn already answers the question. Reuse or quote prior results unless the needed input changed, the result is stale, or verification is explicitly required. Batch independent lookups; ask the user instead of retrying the same query with cosmetic wording.
2026-05-31 16:24	`reflection` · tool_requery_discipline	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `2542c76851b1` rationale Focus target is redundant_tool_invocation; prompt mutations recently regressed it (-0.32), so move the mechanism to reflection/re-query discipline. principle redundant_tool_invocation is driven by repeated state queries after the answer is already present. The reflection layer is the right lever because it can force a pre-call memory check before tool selection, reducing duplicate calls without weakening tool use itself. The intended behaviour is reuse first, batch second, re-query only when freshness, changed inputs, or explicit failure justify it. rollback_condition revert if redundant_tool_invocation score does not improve or broken_tool_use regresses by more than 0.1 audit_run_id `6e96b9cc305a` attribution_score -1.000 previous — new Before issuing a tool call, check whether the needed fact was already returned in this turn. Reuse prior results, batch independent lookups, and call again only when inputs changed, prior output is stale, or the previous result explicitly failed.
2026-05-31 15:59	`tool_policy` · dedupe_gate	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	-0.68	penalized-recipe
payload mutation_id `1d98a464f0a5` rationale Target is redundant_tool_invocation; prompt mutations recently regressed it (-0.32), so move the lever to tool_policy call discipline. principle redundant_tool_invocation is measured from the tool log, so the strongest lever is the policy that governs whether a new call is justified. A calibrated agent treats prior tool results as scoped state, not disposable hints: it reuses them until freshness, changed inputs, or incompleteness creates a real need to query again. This should reduce duplicate calls without weakening useful tool use. rollback_condition revert if redundant_tool_invocation score fails to improve or broken_tool_use drops by more than 0.10 audit_run_id `630c3731619c` attribution_score -0.680 previous — new Before issuing any tool call, check whether the current turn already contains an equivalent result or an unresolved call that would answer the same question. Reuse prior outputs as authoritative within their stated scope; batch independent reads once, and only re-query when inputs changed, freshness matters, or the prior result is incomplete.
2026-05-31 15:29	`prompt` · tool_result_handling	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-0.10	penalized-recipe
payload mutation_id `06fdea7390cf` rationale Target dim is redundant_tool_invocation; recent feedback credits other dims, so tool_result_handling is a direct prompt lever without stacking prior wins. principle Tool outputs are state already acquired during the episode, so a calibrated agent should consult prior results before issuing another call. redundant_tool_invocation is reduced by making re-querying conditional on changed inputs, stale or failed results, or a clear need for narrower data. This targets waste without discouraging necessary tool use, preserving broken_tool_use while improving tool-log discipline. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use worsens by more than 0.1 audit_run_id `7f34644e0c34` attribution_score -0.100 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new Treat tool results as task-local memory. Before calling a tool, check whether an earlier result in this turn already answers the same question; reuse and cite it instead of re-querying. Re-run only when inputs changed, the prior result is stale, failed, incomplete, or a narrower query is needed. If uncertainty remains after one result, state the gap or ask rather than looping.
2026-05-31 15:05	`reflection` · tool_reuse_gate	`redundant_tool_invocation` -1.20	penalized recipe (pre-#1947)	-0.88	penalized-recipe
payload mutation_id `ecfc741f0c9b` rationale Target is redundant_tool_invocation; recent feedback shows reflection has the strongest movement on this dim (abs 1.84), so add a pre-call re-query gate. principle redundant_tool_invocation is a tool-log behavior: it improves when the agent treats prior tool output as usable state instead of re-reading by habit. The reflection policy is the direct control point before another call is emitted, so a compact re-query gate should reduce duplicate calls while preserving justified verification when inputs changed, output was incomplete, or freshness matters. rollback_condition revert if redundant_tool_invocation score fails to improve or broken_tool_use regresses by more than 0.10 audit_run_id `de7458a82142` attribution_score -0.880 previous — new Before requesting any tool, check whether the conversation or prior tool output already answers the question. Reuse known results within the task scope, batch independent reads, and repeat a tool call only when inputs changed, prior output is stale or partial, or verification is explicitly needed; state the reason before any repeat call.
2026-05-31 14:28	`tool_policy` · deduplication_gate	`redundant_tool_invocation` -0.60	penalized recipe (pre-#1947)	+0.70	penalized-recipe
payload mutation_id `69f96b5c44a0` rationale Target is redundant_tool_invocation; reflection is already heavily attributed (-1.84), so move the direct tool_policy gate instead. principle redundant_tool_invocation is measured from tool-call traces, so the cleanest lever is the policy that decides whether a call should happen at all. A good agent treats prior tool output as a local cache: valid within its scope until state changes, a narrower query is needed, or the earlier result failed. This should reduce repeat reads without discouraging necessary first calls. rollback_condition revert if redundant_tool_invocation score fails to improve or broken_tool_use drops by more than 0.10 audit_run_id `bc7abb7afa01` attribution_score +0.700 previous Use tools only for facts that are absent, stale, contradictory, or outside the scope of prior results. Before any repeated call, identify the exact missing field or changed state in one phrase; if none exists, answer from the existing result. Prefer one batched lookup over serial reads, and do not re-open the same file, status, or search result merely to increase confidence. new Before each tool call, run a reuse check: name the exact fact or artifact needed, then verify it is not already available from earlier context or tool output. Repeat a call only for a changed state, a narrower range, a failed prior call, or a missing field. Prefer batched independent reads, and answer from cached results when they already cover the request.
2026-05-31 14:02	`tool_policy` · deduplication_gate	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	+0.90	penalized-recipe
payload mutation_id `f402fa5237d5` rationale Focus dim is redundant_tool_invocation; recent feedback shows reflection is already heavily credited, so move adjacent tool_policy dedup behavior instead. principle redundant_tool_invocation is a tool-log discipline problem: repeated calls should require a new information need, not just uncertainty or habit. The tool policy is the right lever when reflection already carries strong dedup coaching, because it sits at the selection boundary where the agent decides whether external state is actually needed. Tightening this gate should reduce duplicate calls while preserving valid fresh-state checks. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use regresses by more than 0.2 audit_run_id `54f4efbbf9f4` attribution_score +0.900 previous Select a tool only when the next decision needs fresh external state. If the current turn already contains the needed result, cite that result instead of re-calling. For repeated reads, first name the missing, stale, contradictory, or differently scoped fact that justifies the call; otherwise proceed without a tool. Batch independent lookups in one step when available. new Use tools only for facts that are absent, stale, contradictory, or outside the scope of prior results. Before any repeated call, identify the exact missing field or changed state in one phrase; if none exists, answer from the existing result. Prefer one batched lookup over serial reads, and do not re-open the same file, status, or search result merely to increase confidence.
2026-05-31 13:34	`prompt` · tool_result_handling	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-0.60	penalized-recipe
payload mutation_id `e7f79100e36b` rationale Focus dim is redundant_tool_invocation; prompt attribution is modest (-0.11), so tighten result reuse without stacking on reflection's heavy recent movement. principle Tool results should become local evidence, not disposable prompts for another lookup. The next tool call is justified only when the existing result cannot answer the decision because it is stale, incomplete, contradictory, or scoped to another object. This targets redundant_tool_invocation through prompt-level result handling, while preserving hallucination discipline by requiring uncertainty to stay explicit. rollback_condition revert if redundant_tool_invocation_mean does not decrease or input_hallucination_mean increases by more than 0.10 audit_run_id `9c315ca1f980` attribution_score -0.600 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new Treat tool results as task-scoped evidence. Before calling another tool, first check whether the prior result already answers the next decision; cite or summarize it instead of re-reading. Re-query only when the prior result is stale, incomplete, contradictory, or for a different path/object, and state that gap briefly. Never fill gaps with invented file names, paths, or contents.
2026-05-31 13:09	`tool_policy` · deduplication_gate	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `d0e243ee3ba2` rationale Recent feedback shows redundant_tool_invocation is the target; reflection is already heavily credited (-1.84), so use an empty tool_policy gate. principle redundant_tool_invocation is measured from tool-call traces, so the direct lever is the policy that decides whether a tool call is warranted. A calibrated agent should treat prior in-turn outputs as available state, not as prompts to query again. This mutation adds a selection gate that requires a named freshness, completeness, contradiction, or scope reason before repeating a call, while preserving necessary tool use. rollback_condition Revert if redundant_tool_invocation does not improve or broken_tool_use worsens by more than 0.1. audit_run_id `63d84ed1ab2f` attribution_score -1.000 previous — new Select a tool only when the next decision needs fresh external state. If the current turn already contains the needed result, cite that result instead of re-calling. For repeated reads, first name the missing, stale, contradictory, or differently scoped fact that justifies the call; otherwise proceed without a tool. Batch independent lookups in one step when available.
2026-05-31 12:37	`reflection` · requery_discipline	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	-0.70	penalized-recipe
payload mutation_id `388d84293a35` rationale Recent feedback shows reflection had the strongest historical movement on redundant_tool_invocation (-1.84), while prompt changes were weak (-0.11). principle Tool calls should answer unresolved state questions, not revalidate facts the agent already has. Reflection is the right control point because it happens immediately before action selection: it can force a short memory check, distinguish stale or incomplete evidence from usable evidence, and require a concrete reason before retrying. This should reduce repeated reads without discouraging necessary verification. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use regresses by more than 0.15 audit_run_id `95c495faaef3` attribution_score -0.700 previous — new Before making a tool call, check whether the needed fact, file content, path, status, or command output is already present in the current turn. Reuse prior tool results unless they are stale, incomplete, contradictory, or scoped to a different object. Batch independent reads where possible; retry only after naming what changed or what was missing.
2026-05-31 12:13	`reflection` · requery_gate	`redundant_tool_invocation` -0.30	penalized recipe (pre-#1947)	-0.56	penalized-recipe
payload mutation_id `f558404d782f` rationale Target dim is redundant_tool_invocation; feedback shows reflection has the strongest historical movement on it (-2.00), so add a direct re-query gate. principle redundant_tool_invocation is measured from tool-call traces, so the agent must treat each new tool call as a justified state transition rather than a reflexive check. The reflection layer is the right lever because it governs the pause before acting: compare the intended query to available observations, reuse sufficient results, and retry only for freshness, failure, incompleteness, or a genuinely narrower follow-up. rollback_condition revert if redundant_tool_invocation_score fails to improve or input_hallucination_mean increases by more than 0.2 audit_run_id `95f0910f5000` attribution_score -0.560 previous — new Before any tool call, compare the intended query against tool results already seen in this turn. Reuse prior output when it answers the same question; batch independent reads/searches; retry only when the prior result is stale, incomplete, errored, or a narrower follow-up is needed. If uncertainty is about intent rather than data, ask the user instead of re-querying.
2026-05-31 11:48	`reflection` · requery_gate	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-0.36	penalized-recipe
payload mutation_id `bd0b3b1b9cd0` rationale Feedback credits reflection with the largest redundant_tool_invocation movement (-1.90), and reflection SoT is empty, so a re-query gate targets it directly. principle redundant_tool_invocation is measured from the tool log, so the most direct lever is the moment before a new call is emitted. Reflection should make the agent compare the intended query against already observed state, then reuse, batch, or ask instead of probing repeatedly. This targets duplicate calls without suppressing necessary verification when prior output is stale, incomplete, or contradictory. rollback_condition revert if redundant_tool_invocation_score does not improve or input_hallucination_mean rises by more than 0.2 audit_run_id `e12a7eed6361` attribution_score -0.360 previous — new Before issuing a tool call, check whether this turn already has sufficient tool output for the same entity, path, query, or state. Reuse and cite prior results; batch independent unknowns once; re-query only if the prior output is stale, contradictory, incomplete for the current action, or the user asks for verification.
2026-05-31 11:20	`reflection` · tool_requery_gate	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `98bd389cdc34` rationale Target dim is redundant_tool_invocation; feedback shows reflection has the strongest historical movement on this tool-log axis. principle redundant_tool_invocation is best changed at the reflection gate because the failure happens just before acting: the agent decides whether a new tool call is justified despite already having context. A good reflection policy treats prior tool output as live evidence within the same task, permits re-query only for specific freshness or completeness gaps, and batches missing facts instead of retrying one by one. rollback_condition revert if redundant_tool_invocation does not improve or broken_tool_use regresses by more than 0.1 audit_run_id `46c8cd96d268` attribution_score -1.000 previous — new Before issuing a tool call, check whether the needed fact is already present in the current transcript or a recent tool result. Re-query only when the prior result is stale, incomplete, contradicted, or lacks the exact field needed; otherwise cite the existing result and proceed. Batch independent missing reads together instead of serial retrying.
2026-05-31 10:54	`reflection` · tool_requery_gate	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-0.36	penalized-recipe
payload mutation_id `62539fcc91a4` rationale Focus dim is redundant_tool_invocation; recent feedback shows reflection had strongest attribution on it (-1.79), so use reflection-level re-query discipline. principle redundant_tool_invocation is measured from the tool log, so the strongest lever is the policy that governs whether the agent pauses to compare a planned call with already-observed results. Reflection should treat cached turn-local evidence as authoritative unless scope, freshness, or failure status justifies another call. This targets repeated queries directly while preserving necessary tool use for genuinely new information. rollback_condition Revert if redundant_tool_invocation_mean does not improve or broken_tool_use_mean worsens by more than baseline stderr. audit_run_id `5415064682e9` attribution_score -0.360 previous — new Before calling any tool, compare the intended query against prior tool results from this turn. Reuse prior output when it already answers the need, batch independent reads when possible, and re-query only if the earlier result is stale, incomplete, failed, or a materially different scope is needed.
2026-05-31 10:31	`reflection` · requery_discipline	`redundant_tool_invocation` +0.00	penalized recipe (pre-#1947)	+0.74	penalized-recipe
payload mutation_id `2fb299e2adf2` rationale Recent feedback shows reflection has the strongest redundant_tool_invocation attribution (-1.79), so add a reflection gate before repeated calls. principle redundant_tool_invocation is measured from the tool log, so the lever is the decision point before a call, not after the result arrives. A reflection rule can force the agent to compare the pending call against transcript state and cached outputs, preserving useful tool use while cutting repeats. The goal is lower duplicate-call count without increasing hallucination from stale or missing data. rollback_condition revert if redundant_tool_invocation does not improve or input_hallucination worsens by more than 0.10 audit_run_id `c69f57048ce0` attribution_score +0.740 previous — new Before calling a tool, compare the intended query with the current transcript and cached tool results. Reuse prior outputs when they already answer the question; only re-query when the required input, scope, or freshness has changed. Batch independent reads instead of serial repeats, and ask the user when ambiguity is the blocker.
2026-05-31 10:06	`reflection` · requery_discipline	`redundant_tool_invocation` -0.30	penalized recipe (pre-#1947)	-0.62	penalized-recipe
payload mutation_id `2bd68948d129` rationale Focus dim is redundant_tool_invocation; recent attribution shows reflection is the kind that has moved it directly (-1.97), while prompt is saturated elsewhere. principle redundant_tool_invocation is a tool-log behaviour: repeated calls happen when the agent fails to treat prior observations as task state. Reflection is the mechanism that should run before a new action, so it can directly require reuse, batching, and a stated reason for any repeat. The target movement is fewer duplicate calls without weakening necessary verification when prior output is stale or incomplete. rollback_condition revert if redundant_tool_invocation_mean does not decrease or input_hallucination_mean increases by more than 0.2 audit_run_id `d692f82f7d07` attribution_score -0.620 previous — new Before making a tool call, check whether an earlier observation in the current task already answers the same question with enough precision. Reuse that result, batch independent lookups together, and only repeat a call when the prior output is stale, incomplete, contradictory, or scoped to a different object; state the reason for the repeat briefly.
2026-05-31 09:41	`reflection` · requery_gate	`redundant_tool_invocation` -1.50 sig	penalized recipe (pre-#1947)	+1.00	penalized-recipe
payload mutation_id `d132690e0546` rationale Focus dim is redundant_tool_invocation; recent feedback shows reflection has moved it historically (-0.47), while prompt is already saturated on other dims. principle redundant_tool_invocation is best addressed at the moment before another tool call is emitted. Reflection should force a short reuse check: if prior output already answers the local need, the agent should act on it instead of querying again. This targets the measured duplicate-call behavior directly while preserving legitimate retries for stale, partial, or conflicting evidence. rollback_condition revert if redundant_tool_invocation does not improve or input_hallucination worsens by more than 0.2 audit_run_id `f34af67f652c` attribution_score +1.000 previous — new Before making a tool call, check whether the needed fact, file content, or command output is already available in the current turn. Re-query only when the prior result is stale, incomplete for the new decision, or contradicted by later evidence; otherwise reuse the result and proceed.
2026-05-31 09:11	`reflection` · tool_reuse_gate	`redundant_tool_invocation` -0.30	penalized recipe (pre-#1947)	-1.00	penalized-recipe
payload mutation_id `e689db00bc26` rationale Target dim is redundant_tool_invocation; recent attribution shows reflection has moved it historically (-0.77), while prompt is already saturated on other dims. principle redundant_tool_invocation is measured from the tool log, so the control point is the agent's pre-call reflection gate: whether it checks existing observations before asking the environment again. A good agent treats prior tool output as live task state, not disposable scratch, and repeats calls only when scope, freshness, or contradiction justifies it. This should reduce duplicate queries without weakening necessary tool use. rollback_condition Revert if redundant_tool_invocation score fails to improve or broken_tool_use regresses by more than 0.1. audit_run_id `fb44986c134c` attribution_score -1.000 previous — new Before issuing any tool call, check whether the needed fact was already returned earlier in this turn. Reuse prior tool output when it directly answers the question; call again only if the prior result is stale, incomplete, contradicted, or a different query scope is required. Batch independent reads instead of serially rediscovering context.
2026-05-31 08:46	`reflection` · tool_requery_gate	`redundant_tool_invocation` -0.90	penalized recipe (pre-#1947)	+0.86	penalized-recipe
payload mutation_id `d678ad9b904a` rationale Target dim is redundant_tool_invocation; tool-log repeats are best reduced by a reflection gate before new calls. principle redundant_tool_invocation is measured from the tool log, so the strongest lever is the moment just before another call is emitted. A calibrated agent should treat prior tool output as task-local evidence, reuse it while valid, and repeat only when the requested state has changed or the earlier result is inadequate. This targets duplicate-call count without weakening normal tool use. rollback_condition revert if redundant_tool_invocation_mean fails to decrease or broken_tool_use_mean worsens by more than 0.1 audit_run_id `43bbd1b2d49b` attribution_score +0.860 previous — new Before issuing a tool call, check whether the current transcript already contains the needed result. Reuse recent tool output when it directly answers the question, batch independent reads together, and only re-query when inputs changed, prior output is stale, incomplete, or contradictory. If uncertain whether a repeat call is justified, state the uncertainty instead of retrying silently.
2026-05-31 08:18	`prompt` · tool_result_handling	`redundant_tool_invocation` +0.30	penalized recipe (pre-#1947)	-0.38	penalized-recipe
payload mutation_id `9fec76866c7d` rationale Target is redundant_tool_invocation; recent prompt attributions moved tool-use dims, so tighten reuse discipline at tool-result boundary. principle Tool results are task-local state, not disposable hints: once a tool has returned relevant evidence, the agent should reason from that evidence before asking the environment again. This targets redundant_tool_invocation by placing the dedup check exactly where repeated calls arise, after a result is received and before the next action. The desired behaviour is reuse unless freshness, completeness, contradiction, or scope gives a concrete reason to re-query. rollback_condition Revert if redundant_tool_invocation does not improve or broken_tool_use/input_hallucination regress by more than 0.1. audit_run_id `8eeed43c0993` attribution_score -0.380 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new Treat each tool result as current evidence for the task. Before calling a tool again, check whether the prior result already answers the question, can be reused, or can be combined with another needed query. Repeat a tool call only when the previous output is stale, incomplete, contradictory, or scoped to the wrong target; otherwise proceed from the existing result and name the uncertainty.
2026-05-31 08:17	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-31e3a0e4` audit_run_id `manual-audit-31e3a0e4` attribution_score +0.000
2026-05-31 07:49	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-72afffec` audit_run_id `manual-audit-72afffec` attribution_score +0.000
2026-05-31 07:17	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-e4e0edc4` audit_run_id `manual-audit-e4e0edc4` attribution_score +0.000
2026-05-31 06:53	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-6aeb7974` audit_run_id `manual-audit-6aeb7974` attribution_score +0.000
2026-05-31 06:24	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-04fa1864` audit_run_id `manual-audit-04fa1864` attribution_score +0.000
2026-05-30 16:00	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-970ef1f4` audit_run_id `manual-audit-970ef1f4` attribution_score +0.000
2026-05-30 15:43	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-5628e635` audit_run_id `manual-audit-5628e635` attribution_score +0.000
2026-05-30 15:22	`—` · —	—	penalized recipe (pre-#1947)	+0.00	penalized-recipe
payload mutation_id `manual-HEAD-bfb751bc` audit_run_id `manual-audit-bfb751bc` attribution_score +0.000
2026-05-28 13:38	`tool_descriptions` · Grep.hints	`redundant_tool_invocation`	pending audit	—	pending
payload mutation_id `a20175f5437e` rationale tool_descriptions SoT is empty and unexplored for this dim. Prompt cycles 11-14 yielded Δ≈0; hyperparam already credited -10.68 (diminishing returns). Move locus to tool call-site. principle Redundant tool invocations originate at the decision site where the model chooses 'call again' vs 'use cached result'. Tool-description hints embed dedup discipline directly where that choice is made — closer to the behaviour than system-prompt coaching (cycle 11-14: Δ≈0) and more causal than turn-budget shrinking (mechanically caps repeats but does not address why the model re-queries). redundant_tool_invocation is tool_log-measured: it counts repeats per episode, so per-tool dedup norms map one-to-one to the measurement signal. rollback_condition redundant_tool_invocation dim_mean rises > 0.5 above pre-mutation baseline, or broken_tool_use regresses beyond baseline_stderr + critical_margin audit_run_id `754ec2db09d1` previous — new Before calling Grep with a pattern you have already used in this episode, recall the prior result instead of re-querying — file contents do not change unless you wrote to them, Batch independent Grep queries in a single message to avoid sequential redundancy, If a prior Grep returned no matches do not retry with the same pattern — broaden or pivot the query, Quote the exact prior result lines when reasoning rather than re-running the search to verify what you already saw
2026-05-28 12:58	`hyperparam` · reflection_depth	`redundant_tool_invocation` +3.00	pre-E1 (mixed scale)	-1.00	pre-E1
payload mutation_id `67f4acbf8820` rationale redundant_tool_invocation is tool_log-measured; recent hyperparam attribution shows redundant_tool_invocation -7.68 (improvement) under hyperparam edits while prompt edits produced Δ≈0. Reducing reflection_depth 3→2 trims the inner deliberation loop that re-issues identical tool queries. principle redundant_tool_invocation is a tool_log measurement counting repeated tool calls within an episode. Magnitude is mechanically bounded by inner-loop iteration count: each reflection pass can re-query the same tool, so shrinking reflection_depth is a direct mechanism-level lever distinct from prompt-level coaching. Prior cycles confirmed prompt mutations yield Δ≈0 on this dim while hyperparam edits moved it -7.68; the lever is structural, not semantic. rollback_condition broken_tool_use regresses by more than 0.3 or critical_min drops below baseline_stderr+margin audit_run_id `970051908a7b` attribution_score -1.000 previous 3 new 2
2026-05-28 12:18	`hyperparam` · reflection_depth	`redundant_tool_invocation` +1.20	pre-E1 (mixed scale)	-1.00	pre-E1
payload mutation_id `2cb2dc9fb2d9` rationale redundant_tool_invocation is tool_log modality; prior prompt mutations show -16.2 attribution on broken_tool_use without moving dedup count. Reflection is the in-loop dedup gate. principle redundant_tool_invocation is a tool_log measurement counting repeated tool calls within an episode. Reflection passes act as an internal dedup gate: before each tool call the agent re-reads prior turns and can catch 'I already asked this'. Increasing reflection_depth makes that gate stricter without shrinking the action budget. Distinct from max_turns, which bounds redundancy mechanically but at the cost of task completion. Prior hyperparam attribution shows max_turns alone produced collateral broken_tool_use regression (+8.04); reflection_depth targets the dedup decision directly. rollback_condition broken_tool_use mean exceeds baseline by more than stderr audit_run_id `c491c89ac909` attribution_score -1.000 previous 3 new 4
2026-05-28 09:45	`hyperparam` · reflection_depth	`redundant_tool_invocation` +1.80	pre-E1 (mixed scale)	-1.00	pre-E1
payload mutation_id `db7c617eb045` rationale Target dim redundant_tool_invocation has tool_log modality; recent cycles already mutated max_turns. Hyperparam attribution +7.64 on broken_tool_use shows mechanism-level levers move tool dims; reflection cycles multiply re-query opportunities. principle redundant_tool_invocation is a tool_log measurement counting duplicate tool calls per episode. reflection_depth multiplies opportunities for re-querying tool outputs already in context — each reflection cycle can prompt 'let me verify by reading again' calls on data the agent has already received. Cutting depth from 3 to 1 is a mechanism-level lever orthogonal to the max_turns ceiling already mutated, attacking redundancy from the within-turn axis rather than the across-turn axis. rollback_condition broken_tool_use_mean rises more than 0.5 above baseline (insufficient reflection degrades tool correctness) audit_run_id `5e52a4a81ec2` attribution_score -1.000 previous 3 new 1
2026-05-28 09:12	`hyperparam` · max_turns	`redundant_tool_invocation` -0.60	pre-E1 (mixed scale)	+0.80	pre-E1
payload mutation_id `777da993a2c8` rationale redundant_tool_invocation is tool_log-measured (count of repeated calls within episode); cycle 1-12 prompt attempts Δ≈0. Shrinking turn budget mechanically caps redundancy ceiling (N→N-1). principle redundant_tool_invocation is a tool_log measurement: it counts repeated tool calls within an episode. Magnitude is mechanically bounded above by max_turns (N turns → at most N-1 redundancies possible), so shrinking the turn budget is a direct mechanism-level lever, distinct from the prompt-level coaching that prior cycles attempted with Δ≈0. Recent attribution shows hyperparam kind credited +7.80 on broken_tool_use vs prompt -16.20, favouring hyperparam for this dim. rollback_condition broken_tool_use regresses >0.3 or fitness drops below baseline audit_run_id `b3dbf0e56034` attribution_score +0.800 previous 5 new 4
2026-05-28 09:06	`hyperparam` · max_turns	`redundant_tool_invocation` -6.00 sig	pre-E1 (mixed scale)	+1.00	pre-E1
payload mutation_id `c2b0f838d691` rationale redundant_tool_invocation measurement_modality is tool_log; magnitude bounded by max_turns (N→at most N-1 repeats). Cycle 11-14 prompt mutations gave Δ≈0. Shrink turn budget from 5→3 as direct mechanism-level lever. principle redundant_tool_invocation is a tool_log measurement counting repeated tool calls within an episode. Its magnitude is mechanically bounded above by max_turns (N turns → at most N-1 redundancies possible), so shrinking the turn budget is a direct mechanism-level lever, distinct from the prompt-level coaching that cycle 11-14 attempted. Prompt mutations have produced Δ≈0 for this dim historically; the hyperparam slot exists precisely for this measurement_modality mismatch. rollback_condition broken_tool_use score regresses below baseline_mean - stderr audit_run_id `0f8a2d5472c1` attribution_score +1.000 previous 5 new 3
2026-05-27 20:17	`hyperparam` · reflection_depth	`redundant_tool_invocation` +1.80	pre-E1 (mixed scale)	+1.00	pre-E1
payload mutation_id `9d6aca959f2d` rationale Cycle 1-12 (2026-05-26 → 05-28) shows prompt mutations Δ≈0 on redundant_tool_invocation (tool_log modality). Per program.md, hyperparam is the right kind. Increasing reflection_depth 3→5 (bound max) inserts more self-check passes between tool invocations, the canonical mechanism for catching 'have I already called this?' before re-firing. principle Mechanism-level (not prose-level) intervention is required for tool_log-measured dims because the agent's prose self-policing cannot retroactively suppress an already-emitted duplicate tool call. Adding reflection turns gives the runtime an explicit checkpoint to compare the planned tool invocation against the in-context tool-call history, which is structurally the only place redundancy is observable before commit. rollback_condition context_overflow_handling regresses by >0.10 or audit_seconds exceeds 1.5x baseline audit_run_id `c4eaee87bcda` attribution_score +1.000 previous 3 new 5
2026-05-27 18:36	`prompt` · tool_result_handling	`redundant_tool_invocation` +0.00	pre-E1 (mixed scale)	-1.00	pre-E1
payload mutation_id `553be73cdf00` rationale Targets redundant_tool_invocation by explicitly forbidding same-tool re-probing; preserves prior anti-fabrication clause and ask-before-assume guard. principle Tool calls should be load-bearing, not reassurance rituals. A second identical call adds noise, not signal — the agent should trust the first result and act, or ask the user, rather than re-probe to feel certain. Redundant invocation inflates latency and token cost without improving decisions, and the meta-review flagged it as the single overrepresented dim this generation, so a direct prompt-level prohibition is the lowest-risk lever. rollback_condition broken_tool_use regresses by more than 0.3 or task_completion drops below baseline-0.1 audit_run_id `7ed5770bdb22` attribution_score -1.000 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new After a tool returns, read the result fully before any next call. Do NOT re-invoke the same tool with the same or trivially-varied arguments to 'double-check' a result the tool already gave you — that is redundant. Reuse cached output instead. Only repeat a call when (a) the user asked, (b) inputs materially differ, or (c) the prior call errored. Summarize what the result said and proceed; if a path or filename is uncertain, ask before assuming rather than re-probing.
2026-05-27 17:57	`prompt` · tool_result_handling	`redundant_tool_invocation` +0.00	pre-E1 (mixed scale)	-1.00	pre-E1
payload mutation_id `84a4b2ab2493` rationale Targets redundant_tool_invocation (overrepresented per Gen1 meta-review) by explicitly forbidding equivalent re-calls and steering toward batched parallel calls + clarifying question over probe-retry. principle Tool calls are costly state queries, not free reads: a well-calibrated agent treats prior tool output as authoritative within the task scope, batches independent queries in parallel, and escalates ambiguity to the user rather than absorbing it by issuing more calls. Reducing redundant invocations is a first-class measure of execution discipline — it shows the agent is reasoning from accumulated evidence rather than re-deriving state on each step. rollback_condition broken_tool_use regresses by more than 0.15 or task completion rate visibly drops (helpfulness < 0.4) audit_run_id `4d04f9f4797a` attribution_score -1.000 previous When you receive a tool result, summarize it safely. Do not fabricate content that the tool did not return. If a path or filename is uncertain, ask before assuming. new After each tool call, read the result fully and integrate it into your next step before considering another call. Do not re-issue an equivalent tool call to re-fetch, re-list, or re-verify information already present in a prior result within the same task — treat earlier outputs as authoritative unless the state has provably changed. Batch independent lookups into a single parallel call instead of sequential repeats. If a result is ambiguous, ask the user rather than probing with additional redundant calls.

Held-out fitness curve · 32 generations

Per-cycle fitness on the VERSION-FROZEN held-out bench (held_out_fitness in the attribution rows). Because the bench never mutates, these values ARE comparable across generations — the cross-generation evidence the co-evolving-pool Δfitness above cannot give. Scored every cycle a held-out bench is configured. rulers changed across the run: 2 distinct bench ids (a frozen bench was edited — the curve below is NOT fully comparable).

gen	measured	held-out fitness	Δ vs prior	bench id
1	2026-05-30 14:56	0.8035	—	`pool-c16d186178e1`
2	2026-05-30 15:22	0.7928	-0.0107	`pool-c16d186178e1`
3	2026-05-30 15:43	0.7959	+0.0031	`pool-c16d186178e1`
4	2026-05-30 16:00	0.7904	-0.0054	`pool-c16d186178e1`
5	2026-05-31 06:24	0.8030	+0.0125	`pool-475b92a68a91`
6	2026-05-31 06:53	0.8262	+0.0233	`pool-475b92a68a91`
7	2026-05-31 07:17	0.8259	-0.0004	`pool-475b92a68a91`
8	2026-05-31 07:49	0.8076	-0.0183	`pool-475b92a68a91`
9	2026-05-31 08:17	0.8296	+0.0220	`pool-475b92a68a91`
10	2026-05-31 08:46	0.8163	-0.0134	`pool-475b92a68a91`
11	2026-05-31 09:11	0.8243	+0.0080	`pool-475b92a68a91`
12	2026-05-31 09:40	0.7999	-0.0244	`pool-475b92a68a91`
13	2026-05-31 10:06	0.8151	+0.0152	`pool-475b92a68a91`
14	2026-05-31 10:31	0.7933	-0.0218	`pool-475b92a68a91`
15	2026-05-31 10:53	0.7975	+0.0042	`pool-475b92a68a91`
16	2026-05-31 11:19	0.8107	+0.0132	`pool-475b92a68a91`
17	2026-05-31 11:47	0.8099	-0.0008	`pool-475b92a68a91`
18	2026-05-31 12:12	0.8212	+0.0113	`pool-475b92a68a91`
19	2026-05-31 12:36	0.8213	+0.0002	`pool-475b92a68a91`
20	2026-05-31 13:09	0.8231	+0.0017	`pool-475b92a68a91`
21	2026-05-31 13:34	0.8130	-0.0101	`pool-475b92a68a91`
22	2026-05-31 14:00	0.7928	-0.0202	`pool-475b92a68a91`
23	2026-05-31 14:25	0.7900	-0.0028	`pool-475b92a68a91`
24	2026-05-31 14:53	0.8209	+0.0310	`pool-475b92a68a91`
25	2026-05-31 15:28	0.8320	+0.0111	`pool-475b92a68a91`
26	2026-05-31 15:58	0.8144	-0.0176	`pool-475b92a68a91`
27	2026-05-31 16:23	0.8116	-0.0028	`pool-475b92a68a91`
28	2026-05-31 16:52	0.8215	+0.0099	`pool-475b92a68a91`
29	2026-05-31 17:19	0.8173	-0.0041	`pool-475b92a68a91`
30	2026-05-31 17:46	0.8093	-0.0080	`pool-475b92a68a91`
31	2026-05-31 18:10	0.8243	+0.0150	`pool-475b92a68a91`
32	2026-05-31 18:36	0.8301	+0.0058	`pool-475b92a68a91`

Source: autoresearch/state/mutations.jsonl — apply + attribution records joined by mutation_id.

Published by .github/workflows/pages.yml on every main push.

Repo: github.com/mangowhoiscloud/geode

Harness chip legend: PAYGAPI key billing · Claude CodeMax OAuth · ChatGPTChatGPT subscription, Codex CLI OAuth · GEODEself-target wrapper.

Rendered against GEODE v0.99.311 · DESIGN.md schema 1 · built 2026-07-12 22:40. Dim subset: 22 (geode_5axes). Pipeline phases: 8. Baseline schema: v2 (PR-2).