Skip to content

[otel-advisor] add OTel span events for individual agent errors (follow exception semantic convention) #25912

@github-actions

Description

@github-actions

📡 OTel Instrumentation Improvement: Add span events for individual agent errors

Analysis Date: 2026-04-12
Priority: High
Effort: Small (< 2h)

Problem

When a gh-aw agent job fails, all error messages are concatenated into a single span attribute gh-aw.error.messages as a pipe-delimited string ("msg1 | msg2 | msg3"), capped at the first 5 errors. This makes individual errors unqueryable in OTel backends.

The root cause is in sendJobConclusionSpan (send_otlp_span.cjs:699–724): errors are joined into one string attribute rather than emitted as separate, structured span events following the [OpenTelemetry exception event semantic convention]((opentelemetry.io/redacted)

A DevOps engineer debugging a failed run today cannot:

  • Filter spans where exception.message contains a specific error string
  • Count distinct error types across failures in a Grafana dashboard
  • See errors beyond the first one when statusMessage is truncated at 256 chars
  • Use backend-native exception detection (Datadog, Honeycomb, Tempo) that looks for exception events

Why This Matters (DevOps Perspective)

MTTR impact: Backends like Grafana Tempo, Honeycomb, and Datadog have first-class support for the exception span event convention. They surface these as structured error records — each error is individually searchable. With the current pipe-delimited string approach, an on-call engineer must manually parse "error A | error B | error C" from a raw attribute and cannot alert on individual error types.

Dashboard gap: You cannot currently build a Grafana panel that counts failure runs grouped by their error type. With span events, you get exception.message as a filterable field.

Alert gap: A useful alert would be: "page if any span has an exception event with message matching rate limit". With the current approach this requires regex on the concatenated attribute string.

Current Behavior

// Current: actions/setup/js/send_otlp_span.cjs (lines 699–724)
const errorMessages = outputErrors
  .map(e => (e && typeof e.message === "string" ? e.message : String(e)))
  .filter(Boolean)
  .slice(0, 5);                               // ← only first 5 errors

if (isAgentFailure && errorMessages.length > 0) {
  statusMessage = `agent ${agentConclusion}: ${errorMessages[0]}`.slice(0, 256);  // ← truncated to 256 chars
}
// ...
if (isAgentFailure && errorMessages.length > 0) {
  attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
  attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | ")));  // ← all in one attribute
}

The span payload has no events array; individual errors are invisible to backends.

Proposed Change

// Proposed addition to actions/setup/js/send_otlp_span.cjs

// 1. Extend OTLPSpanOptions typedef (add after `@property` {number} [kind]):
//    `@property` {Array<{timeUnixNano: string, name: string, attributes: Array<{key: string, value: object}>}>} [events]
//    Span events following the OTel events spec (e.g. exception events).

// 2. Include events in buildOTLPPayload output (add after `attributes` in the span object):
function buildOTLPPayload({ ..., events, kind = SPAN_KIND_INTERNAL }) {
  // ...existing code...
  return {
    resourceSpans: [{
      // ...
      scopeSpans: [{
        scope: { name: "gh-aw", version: scopeVersion || "unknown" },
        spans: [{
          traceId, spanId,
          ...(parentSpanId ? { parentSpanId } : {}),
          name: spanName, kind,
          startTimeUnixNano: toNanoString(startMs),
          endTimeUnixNano: toNanoString(endMs),
          status, attributes,
          ...(events && events.length > 0 ? { events } : {}),  // ← add this
        }],
      }],
    }],
  };
}

// 3. Emit one exception event per error in sendJobConclusionSpan
//    (add after the existing error attribute block, around line 721):
const errorMs = nowMs();
const spanEvents = isAgentFailure
  ? outputErrors
      .map(e => (e && typeof e.message === "string" ? e.message : String(e)))
      .filter(Boolean)
      .map(msg => ({
        timeUnixNano: toNanoString(errorMs),
        name: "exception",
        attributes: [buildAttr("exception.message", msg.slice(0, 1024))],
      }))
  : [];

const payload = buildOTLPPayload({
  // ...existing fields...
  events: spanEvents,  // ← add this
});

Expected Outcome

After this change:

  • In Grafana Tempo / Honeycomb / Datadog: Each agent error appears as a named span event ("exception") with exception.message as a queryable field. Backend exception detectors trigger automatically. Individual errors can be aggregated in dashboards (count by exception.message).
  • In the JSONL mirror: The events array appears inside each failed-job span record, making post-hoc artifact debugging richer with no extra parsing.
  • For on-call engineers: Full, untruncated error messages are visible per event (up to MAX_ATTR_VALUE_LENGTH = 1024). All errors (not just the first 5) can be recorded. No more guessing which part of "err A | err B | err C" is the root cause.
Implementation Steps
  • In actions/setup/js/send_otlp_span.cjs:
    • Add events to the OTLPSpanOptions typedef JSDoc (optional array of OTLP event objects)
    • Spread events into the span object inside buildOTLPPayload (only when non-empty)
    • In sendJobConclusionSpan, build a spanEvents array from outputErrors and pass it to buildOTLPPayload
  • In actions/setup/js/send_otlp_span.test.cjs:
    • Add a test asserting buildOTLPPayload includes events in the span when passed
    • Add a test asserting sendJobConclusionSpan emits one exception event per error when GH_AW_AGENT_CONCLUSION=failure and agent_output.json contains errors
  • In actions/setup/js/action_conclusion_otlp.test.cjs (or action_otlp.test.cjs):
    • Add a test asserting the failure scenario produces span events with exception.message
  • Run cd actions/setup/js && npx vitest run to confirm tests pass
  • Run make fmt to ensure formatting

Evidence from Live Sentry Data

⚠️ Note: No Sentry MCP tool was available during this analysis run. The recommendation is based entirely on static code review of the instrumentation files. The code gap is unambiguous: buildOTLPPayload has no events field and sendJobConclusionSpan never calls it with events. The OTel semantic convention for exception events (exception event name, exception.message attribute) is a standard that major backends implement natively.

To validate against live data: query any recent failure span in Sentry/Grafana/Honeycomb and confirm that events is absent from the span payload (expected given the code), and that gh-aw.error.messages appears as a single concatenated string.

Related Files

  • actions/setup/js/send_otlp_span.cjsbuildOTLPPayload (add events support) and sendJobConclusionSpan (emit events)
  • actions/setup/js/send_otlp_span.test.cjs — new test cases for span events
  • actions/setup/js/action_otlp.test.cjs — extend failure scenario test to assert events
  • actions/setup/js/action_conclusion_otlp.cjs — no changes needed (delegates to sendJobConclusionSpan)

Generated by the Daily OTel Instrumentation Advisor workflow

Generated by Daily OTel Instrumentation Advisor · ● 156.8K ·

  • expires on Apr 19, 2026, 3:19 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions