Agent Tooling Should Be Composable, Not Convenient

Most software is designed around human fatigue. We reduce clicks, hide options, compress flows, and celebrate when a task goes from ten steps to one. That instinct makes sense for humans. It is not obviously right for agents. Agents can call tools sequentially, write glue code, retry failed steps, inspect intermediate outputs, and compose workflows without getting bored. For agents, the goal is not fewer steps. The goal is better primitives.

May 14, 2026
harness-engineeringagents

Agents don't need defaults. They need contracts.

Most tools are built for humans.

That means they optimize for human fatigue. Fewer keystrokes. Fewer clicks. Short flags. Implicit config. Magic defaults. Pretty output. Helpful prompts.

This makes sense when the caller is a person.

It starts to break when the caller is an agent.

An agent does not care if it has to type --output-format json instead of -o json. It does not get tired of passing a full path, a full account id, or a full argument name. What it struggles with is ambiguity: hidden state, silent defaults, overloaded commands, fuzzy output, and tools that assume the caller remembers what happened five steps ago.

This piece is not about harness design. Tool discovery, planning, context loading, retries, and orchestration are harness problems.

This is about the tool creator's side of the contract.

If you are writing a CLI, API, SDK method, MCP server, or internal utility that an agent might call, the job is simple: make the tool boring, explicit, composable, and safe.

The best reference point is Unix.

Small programs. Clear contracts. Stable inputs. Stable outputs. Composition over cleverness.

But the modern pipe should usually be structured data, not plain text.

Human tools assume context

A lot of human tooling is built around omission.

ls means "list the current directory with normal defaults."

git commit -am "fix" means "stage tracked files and commit with this message."

deploy -p prod -f means whatever your team has learned it means after months of use.

This works because the human brings context. They know the current directory. They know which account is active. They know what -f means in this CLI. They know whether they are in staging or production. Usually.

The tool saves effort by assuming the rest.

For an agent, those assumptions are a liability.

A default is invisible behavior. A current directory is ambient state. A short flag is compressed meaning. An interactive prompt is a blocking protocol. A colorful table is an unstable API.

Humans like tools that let them say less.

Agents need tools that make them guess less.

The goal is not fewer steps

Human UX asks: how do we reduce the number of steps?

Agent UX should ask: how do we reduce ambiguity at each step?

Those are different goals.

A one line magic command can be great for a human and terrible for an agent. A five step sequence of explicit calls can be annoying for a human and perfect for an agent.

Take this:

deploy -p prod -f

The agent has to infer too much.

Does -p mean project, profile, provider, path, port, or production? Does prod mean the production environment or a profile named prod? What does -f force? Does it overwrite? Does it skip checks? Which account is active? Is this safe to retry?

This is longer:

deploy \
  --environment production \
  --project-id payments-api \
  --region us-east-1 \
  --force false \
  --dry-run false \
  --output-format json

It is also much better.

A human may hate typing it. An agent does not care. The command is reviewable, auditable, reproducible, and far harder to misread.

That is the tradeoff.

For agents, correctness beats convenience.

The Unix lesson

Unix tools got one thing right: a tool does not need to understand the whole workflow.

grep does not know why you are searching. sort does not know what the rows mean. uniq does not know the business context. Each tool does one job and exposes an interface that another tool can use.

That is why this works:

cat access.log | grep "500" | awk '{print $1}' | sort | uniq -c | sort -nr

No single command is impressive. The composition is.

This is exactly the shape agent tools should have.

A big workflow-shaped tool gives the agent one path. A small primitive gives the agent a building block. Smaller primitives create more possible workflows because the agent can combine them in ways the tool author did not predict.

But we should not copy Unix literally.

Unix used text because text was the common interface. Agents can use something better: schemas, JSON, typed fields, stable ids, explicit error codes, and metadata.

The lesson is not "everything should be a shell pipeline."

The lesson is: every tool should expose a small contract that composes cleanly with the next tool.

Write tools for literal callers

Agents are capable, but literal. They are tireless, but not reliably careful. They can recover, but only if the tool tells them what happened.

So the tool should carry the clarity.

A good agent-facing tool answers these questions without making the caller infer them:

  • What operation does this perform?
  • What arguments are required?
  • What state does it depend on?
  • Is it read-only or mutating?
  • What happens if it is retried?
  • What output shape comes back?
  • What exactly failed?
  • Did any side effect already happen?

If the answer lives in a README, a convention, a default profile, or a human's memory, the agent will eventually get it wrong.

Make consequential values explicit

Defaults are not evil. Silent defaults are.

Some defaults are harmless: page size, indentation, timeout, maybe sort order. Others change the meaning of the operation: account, project, region, environment, visibility, overwrite behavior, notification behavior, authentication context.

Those should usually be explicit.

Bad:

backup create

Better:

backup create \
--source-directory /var/app/data \
--destination-uri s3://company-backups/payments-api \
--retention-days 30 \
--include-hidden-files false \
--compression gzip \
--encryption enabled \
--output-format json

The second command is verbose. Good.

You can read it in logs and know what the agent attempted. You can review it before execution. You can reproduce it later.

A simple rule: if a wrong default can cause damage, cost money, leak data, or confuse audit logs, do not hide it.

Require the value. Or require an explicit sentinel:

  • -overwrite-existing false
  • -notify-users false
  • -visibility private
  • -use-active-account false

Make the choice visible.

Prefer long names

Short flags are for fingers.

Long flags are for logs, reviews, policies, and agents.

This is human-friendly:

sync -r -f -q ./docs s3://bucket

This is agent-friendly:

sync \
--source-directory /home/agent/workspace/docs \
--destination-uri s3://bucket/docs \
--recursive true \
--overwrite-existing false \
--quiet false \
--output-format json

The second version wastes characters and saves meaning.

That is a good trade.

Parameter names matter. user_id is better than user. project_id is better than project. source_directory is better than src. conflict_resolution is better than mode.

When an agent-generated command causes a bad outcome, you do not want to decode what -x -r -f meant in version 0.8 of some CLI.

You want the command to explain itself.

Avoid ambient state

Ambient state is anything the tool uses without the caller passing it.

Current directory. Active profile. Selected account. Default org. Default region. Environment variables. Cached sessions. Local timezone. Previous command state.

Human tools use this constantly.

Agent tools should avoid it whenever the value matters.

Fragile:

analyze ./src

Better:

analyze \
--source-directory /home/agent/workspace/my-repo/src \

--language typescript \

--output-format json

Same for SaaS tools:

github issue create --title "Bug" --body "..."

Which account? Which owner? Which repo?

Better:

github issue create \
--owner acme \
--repository payments-api \
--title "Bug" \
--body-file /home/agent/workspace/issue.md \
--assignee alice \
--labels bug,priority-high \
--output-format json

A useful test: can someone understand the call from logs alone?

If not, the tool depends on too much hidden context.

Do one operation

"Do one thing well" is still right, but it needs a small update for agents.

Do one operation with a complete contract.

Small does not mean weak. It means bounded.

Bad:

workspace fix

What does that do? Format code? Install packages? Edit files? Run tests? Commit changes? Open a PR?

Better:

workspace inspect --path /repo --output-format json`
workspace format --path /repo --check-only false --output-format json
workspace test --path /repo --test-command "pytest" --output-format json
workspace diff --path /repo --output-format json
workspace commit --path /repo --message-file /tmp/message.txt --output-format json

Now the agent can build its own workflow.

It can stop after the diff. It can retry only the tests. It can ask for approval before the commit. It can swap pytest for another command. It can explain exactly where things failed.

A workflow-shaped tool assumes the path.

A primitive expands the paths available.

Separate reads from writes

This is where agent tooling gets serious.

Reading an issue is not the same as closing it. Listing calendar events is not the same as canceling one. Drafting an email is not the same as sending it.

Do not hide reads and writes behind the same vague verb.

Bad:

calendar manage --event-id evt_123 --status cancelled

Better:

calendar event get \
--event-id evt_123 \
--output-format json
calendar event cancel \
--event-id evt_123 \
--notify-attendees true \
--cancellation-reason-file /tmp/reason.txt \
--idempotency-key cancel_evt_123_2026_05_14 \
--output-format json

A mutating tool should say it mutates. In the name. In the schema. In the docs. In the response.

For writes, add safety controls:

  • -dry-run true
  • -idempotency-key <key>
  • -max-affected-records 10
  • -allow-destructive false
  • -require-confirmation true

Do not make the agent infer whether a tool changes the world.

Make retries safe

Agents retry.

Networks fail. Sandboxes timeout. The model misreads an error. The harness asks again. Long workflows resume after partial failure.

If your mutating tool is not idempotent, the agent will eventually duplicate a side effect.

Bad:

email send --to alice@example.com --subject "Hello" --body "..."

If the request times out, did the email send? Should the agent retry? Will it send twice?

Better:

email send \
--to alice@example.com \
--subject "Hello" \
--body-file /tmp/email-body.md \
--idempotency-key email_alice_2026_05_14_followup \
--output-format json

First response:

{
  "status": "sent",
  "message_id": "msg_123",
  "idempotency_key": "email_alice_2026_05_14_followup",
  "was_duplicate": false
}

Retry response:

{
  "status": "already_sent",
  "message_id": "msg_123",
  "idempotency_key": "email_alice_2026_05_14_followup",
  "was_duplicate": true
}

This is boring infrastructure work. It is also the difference between safe automation and accidental spam.

Return data, not decoration

Human tools can print pretty tables.

Agent tools should return structured output.

Bad:

Created issue #482: Login fails on Safari
Assigned to Alice
Labels: bug, high-priority

Better:

{
  "issue_id": 482,
  "issue_url": "https://github.com/acme/payments-api/issues/482",
  "title": "Login fails on Safari",
  "assignees": ["alice"],
  "labels": ["bug", "high-priority"],
  "state": "open"
}

The second output can be passed to the next step without brittle parsing.

Good agent output has stable ids, normalized enums, timestamps, cursors, status fields, and machine-readable errors.

Avoid color by default. Avoid prose-only responses. Avoid locale-specific dates. Avoid dumping huge payloads. Avoid making the agent scrape your terminal UI.

If a human wants a pretty table, add --output-format table.

The agent path should be JSON first.

Make errors useful

Agents consume failure output too.

Bad:

Error: invalid request

Better:

{
  "error": {
    "code": "missing_required_argument",
    "message": "Missing required argument: --repository",
    "recoverable": true,
    "retry_allowed": true,
    "missing_arguments": ["repository"],
    "side_effect_occurred": false,
    "example": "github issue create --owner acme --repository payments-api --title \"...\" --body-file /tmp/body.md --output-format json"
  }
}

A good error tells the agent what failed, whether retry is safe, whether anything changed, and what a valid call looks like.

A vague error makes the model guess.

Do not make the model guess.

Do not surprise with prompts

Interactive prompts are fine for humans. They are bad as surprise behavior in agent tools.

Bad:

prod delete --service payments

Then the tool blocks:

Are you sure? [y/N]

Better:

prod delete \
--service payments \
--require-confirmation true \
--confirmation-mode external_approval \
--dry-run false \
--idempotency-key delete_payments_2026_05_14 \
--output-format json

And the tool returns:

{
  "status": "approval_required",
  "approval_id": "approval_123",
  "approval_url": "https://example.com/approve/approval_123",
  "expires_at": "2026-05-14T12:00:00Z"
}

Now the agent can stop, ask the user, poll, or hand off.

Human interaction is allowed. Surprise blocking is not.

Make dry runs real

A dry run should not say "looks good."

It should say what would change.

{
  "dry_run": true,
  "would_modify": true,
  "affected_resources": [
    {
      "type": "dns_record",
      "id": "rec_123",
      "change": "update",
      "before": {
        "value": "1.2.3.4"
      },
      "after": {
        "value": "5.6.7.8"
      }
    }
  ],
  "requires_confirmation": true
}

This gives the agent something concrete to show a human or pass to a policy layer.

Execution is not the only interface. Inspection is also an interface.

Be verbose before the call, compact after it

Tool descriptions should be detailed. Tool responses should be tight.

Before the call, the agent needs to know what the tool does, when to use it, when not to use it, which arguments matter, what side effects happen, and what output will come back.

After the call, the agent needs the result and enough metadata to continue.

A vague tool with a giant response has it backwards.

Give the model clarity before the call. Give it signal after the call.

Human mode and agent mode can both exist

This is not an argument against good human UX.

Humans should still get shortcuts, defaults, color, prompts, aliases, and pretty tables.

Agents should get explicit contracts.

Both can exist in the same tool.

Human mode:

ls

Agent mode:

filesystem list \
--directory /home/agent/workspace \
--include-hidden false \
--recursive false \
--sort-by name \
--output-format json

The mistake is making the human shortcut the canonical agent interface.

Human UX and agent UX are different surfaces.

A checklist for agent-facing tools

Before shipping a tool that agents will call, ask:

  • Can the call be understood from logs alone?
  • Are consequential values explicit?
  • Are long argument names available and used in examples?
  • Does the tool avoid hidden account, project, region, and path state?
  • Is the operation bounded?
  • Are read and write operations separate?
  • Does every write support dry run or preview where possible?
  • Are mutating calls idempotent?
  • Is output structured?
  • Are errors machine-readable?
  • Does the tool report whether a side effect happened?
  • Can the output feed the next tool without scraping prose?
  • Can a human review the agent's call and understand intent?

If not, the tool may still be good for humans.

It is not yet good for agents.

The point

Agents do not need every tool to become an agent.

They need tools that are less magical.

Less implicit. Less overloaded. Less dependent on ambient state. Less eager to collapse a workflow into one clever command.

This feels backwards because software has spent years hiding complexity behind friendly interfaces. For humans, that often works. For agents, hiding complexity just moves it into inference.

Inference is a bad place to store an interface contract.

Put the contract in the tool.

Small operations. Explicit arguments. Structured output. Safe retries. Real dry runs. Machine-readable errors. No surprise prompts. No silent dangerous defaults.

Not fewer steps.

Clearer steps.

That is how tools should be written for agents.