[otel-advisor] add OTel span events for individual agent errors (follow exception semantic convention)

### 📡 OTel Instrumentation Improvement: Add span events for individual agent errors

**Analysis Date**: 2026-04-12
**Priority**: High
**Effort**: Small (< 2h)

### Problem

When a gh-aw agent job fails, all error messages are concatenated into a single span attribute `gh-aw.error.messages` as a pipe-delimited string (`"msg1 | msg2 | msg3"`), capped at the first 5 errors. This makes individual errors **unqueryable** in OTel backends.

The root cause is in `sendJobConclusionSpan` (`send_otlp_span.cjs:699–724`): errors are joined into one string attribute rather than emitted as separate, structured span events following the [OpenTelemetry exception event semantic convention]((opentelemetry.io/redacted)

A DevOps engineer debugging a failed run today cannot:
- Filter spans where `exception.message` contains a specific error string
- Count distinct error types across failures in a Grafana dashboard
- See errors beyond the first one when `statusMessage` is truncated at 256 chars
- Use backend-native exception detection (Datadog, Honeycomb, Tempo) that looks for `exception` events

### Why This Matters (DevOps Perspective)

**MTTR impact**: Backends like Grafana Tempo, Honeycomb, and Datadog have first-class support for the `exception` span event convention. They surface these as structured error records — each error is individually searchable. With the current pipe-delimited string approach, an on-call engineer must manually parse `"error A | error B | error C"` from a raw attribute and cannot alert on individual error types.

**Dashboard gap**: You cannot currently build a Grafana panel that counts `failure` runs grouped by their error type. With span events, you get `exception.message` as a filterable field.

**Alert gap**: A useful alert would be: "page if any span has an `exception` event with message matching `rate limit`". With the current approach this requires regex on the concatenated attribute string.

### Current Behavior

```javascript
// Current: actions/setup/js/send_otlp_span.cjs (lines 699–724)
const errorMessages = outputErrors
  .map(e => (e && typeof e.message === "string" ? e.message : String(e)))
  .filter(Boolean)
  .slice(0, 5);                               // ← only first 5 errors

if (isAgentFailure && errorMessages.length > 0) {
  statusMessage = `agent ${agentConclusion}: ${errorMessages[0]}`.slice(0, 256);  // ← truncated to 256 chars
}
// ...
if (isAgentFailure && errorMessages.length > 0) {
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));  // ← all in one attribute
}
```

The span payload has no `events` array; individual errors are invisible to backends.

### Proposed Change

```javascript
// Proposed addition to actions/setup/js/send_otlp_span.cjs

// 1. Extend OTLPSpanOptions typedef (add after `@property` {number} [kind]):
//    `@property` {Array<{timeUnixNano: string, name: string, attributes: Array<{key: string, value: object}>}>} [events]
//    Span events following the OTel events spec (e.g. exception events).

// 2. Include events in buildOTLPPayload output (add after `attributes` in the span object):
function buildOTLPPayload({ ..., events, kind = SPAN_KIND_INTERNAL }) {
  // ...existing code...
  return {
    resourceSpans: [{
      // ...
      scopeSpans: [{
        scope: { name: "gh-aw", version: scopeVersion || "unknown" },
        spans: [{
          traceId, spanId,
          ...(parentSpanId ? { parentSpanId } : {}),
          name: spanName, kind,
          startTimeUnixNano: toNanoString(startMs),
          endTimeUnixNano: toNanoString(endMs),
          status, attributes,
          ...(events && events.length > 0 ? { events } : {}),  // ← add this
        }],
      }],
    }],
  };
}

// 3. Emit one exception event per error in sendJobConclusionSpan
//    (add after the existing error attribute block, around line 721):
const errorMs = nowMs();
const spanEvents = isAgentFailure
  ? outputErrors
      .map(e => (e && typeof e.message === "string" ? e.message : String(e)))
      .filter(Boolean)
      .map(msg => ({
        timeUnixNano: toNanoString(errorMs),
        name: "exception",
        attributes: [buildAttr("exception.message", msg.slice(0, 1024))],
      }))
  : [];

const payload = buildOTLPPayload({
  // ...existing fields...
  events: spanEvents,  // ← add this
});
```

### Expected Outcome

After this change:

- **In Grafana Tempo / Honeycomb / Datadog**: Each agent error appears as a named span event (`"exception"`) with `exception.message` as a queryable field. Backend exception detectors trigger automatically. Individual errors can be aggregated in dashboards (`count by exception.message`).
- **In the JSONL mirror**: The `events` array appears inside each failed-job span record, making post-hoc artifact debugging richer with no extra parsing.
- **For on-call engineers**: Full, untruncated error messages are visible per event (up to `MAX_ATTR_VALUE_LENGTH = 1024`). All errors (not just the first 5) can be recorded. No more guessing which part of `"err A | err B | err C"` is the root cause.

<details>
<summary><b>Implementation Steps</b></summary>

- [ ] In `actions/setup/js/send_otlp_span.cjs`:
  - Add `events` to the `OTLPSpanOptions` typedef JSDoc (optional array of OTLP event objects)
  - Spread `events` into the span object inside `buildOTLPPayload` (only when non-empty)
  - In `sendJobConclusionSpan`, build a `spanEvents` array from `outputErrors` and pass it to `buildOTLPPayload`
- [ ] In `actions/setup/js/send_otlp_span.test.cjs`:
  - Add a test asserting `buildOTLPPayload` includes `events` in the span when passed
  - Add a test asserting `sendJobConclusionSpan` emits one `exception` event per error when `GH_AW_AGENT_CONCLUSION=failure` and `agent_output.json` contains errors
- [ ] In `actions/setup/js/action_conclusion_otlp.test.cjs` (or `action_otlp.test.cjs`):
  - Add a test asserting the `failure` scenario produces span events with `exception.message`
- [ ] Run `cd actions/setup/js && npx vitest run` to confirm tests pass
- [ ] Run `make fmt` to ensure formatting

</details>

### Evidence from Live Sentry Data

> ⚠️ **Note**: No Sentry MCP tool was available during this analysis run. The recommendation is based entirely on static code review of the instrumentation files. The code gap is unambiguous: `buildOTLPPayload` has no `events` field and `sendJobConclusionSpan` never calls it with events. The OTel semantic convention for exception events (`exception` event name, `exception.message` attribute) is a standard that major backends implement natively.
>
> To validate against live data: query any recent failure span in Sentry/Grafana/Honeycomb and confirm that `events` is absent from the span payload (expected given the code), and that `gh-aw.error.messages` appears as a single concatenated string.

### Related Files

- `actions/setup/js/send_otlp_span.cjs` — `buildOTLPPayload` (add `events` support) and `sendJobConclusionSpan` (emit events)
- `actions/setup/js/send_otlp_span.test.cjs` — new test cases for span events
- `actions/setup/js/action_otlp.test.cjs` — extend failure scenario test to assert events
- `actions/setup/js/action_conclusion_otlp.cjs` — no changes needed (delegates to `sendJobConclusionSpan`)

---

*Generated by the [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24309756899) workflow*







> Generated by [Daily OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/24309756899/agentic_workflow) · ● 156.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on Apr 19, 2026, 3:19 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otel-advisor] add OTel span events for individual agent errors (follow exception semantic convention) #25912

📡 OTel Instrumentation Improvement: Add span events for individual agent errors

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[otel-advisor] add OTel span events for individual agent errors (follow exception semantic convention) #25912

Description

📡 OTel Instrumentation Improvement: Add span events for individual agent errors

Problem

Why This Matters (DevOps Perspective)

Current Behavior

Proposed Change

Expected Outcome

Evidence from Live Sentry Data

Related Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions