📡 OTel Instrumentation Improvement: Add span events for individual agent errors
Analysis Date: 2026-04-12
Priority: High
Effort: Small (< 2h)
Problem
When a gh-aw agent job fails, all error messages are concatenated into a single span attribute gh-aw.error.messages as a pipe-delimited string ("msg1 | msg2 | msg3"), capped at the first 5 errors. This makes individual errors unqueryable in OTel backends.
The root cause is in sendJobConclusionSpan (send_otlp_span.cjs:699–724): errors are joined into one string attribute rather than emitted as separate, structured span events following the [OpenTelemetry exception event semantic convention]((opentelemetry.io/redacted)
A DevOps engineer debugging a failed run today cannot:
- Filter spans where
exception.message contains a specific error string
- Count distinct error types across failures in a Grafana dashboard
- See errors beyond the first one when
statusMessage is truncated at 256 chars
- Use backend-native exception detection (Datadog, Honeycomb, Tempo) that looks for
exception events
Why This Matters (DevOps Perspective)
MTTR impact: Backends like Grafana Tempo, Honeycomb, and Datadog have first-class support for the exception span event convention. They surface these as structured error records — each error is individually searchable. With the current pipe-delimited string approach, an on-call engineer must manually parse "error A | error B | error C" from a raw attribute and cannot alert on individual error types.
Dashboard gap: You cannot currently build a Grafana panel that counts failure runs grouped by their error type. With span events, you get exception.message as a filterable field.
Alert gap: A useful alert would be: "page if any span has an exception event with message matching rate limit". With the current approach this requires regex on the concatenated attribute string.
Current Behavior
// Current: actions/setup/js/send_otlp_span.cjs (lines 699–724)
const errorMessages = outputErrors
.map(e => (e && typeof e.message === "string" ? e.message : String(e)))
.filter(Boolean)
.slice(0, 5); // ← only first 5 errors
if (isAgentFailure && errorMessages.length > 0) {
statusMessage = `agent ${agentConclusion}: ${errorMessages[0]}`.slice(0, 256); // ← truncated to 256 chars
}
// ...
if (isAgentFailure && errorMessages.length > 0) {
attributes.push(buildAttr("gh-aw.error.count", outputErrors.length));
attributes.push(buildAttr("gh-aw.error.messages", errorMessages.join(" | "))); // ← all in one attribute
}
The span payload has no events array; individual errors are invisible to backends.
Proposed Change
// Proposed addition to actions/setup/js/send_otlp_span.cjs
// 1. Extend OTLPSpanOptions typedef (add after `@property` {number} [kind]):
// `@property` {Array<{timeUnixNano: string, name: string, attributes: Array<{key: string, value: object}>}>} [events]
// Span events following the OTel events spec (e.g. exception events).
// 2. Include events in buildOTLPPayload output (add after `attributes` in the span object):
function buildOTLPPayload({ ..., events, kind = SPAN_KIND_INTERNAL }) {
// ...existing code...
return {
resourceSpans: [{
// ...
scopeSpans: [{
scope: { name: "gh-aw", version: scopeVersion || "unknown" },
spans: [{
traceId, spanId,
...(parentSpanId ? { parentSpanId } : {}),
name: spanName, kind,
startTimeUnixNano: toNanoString(startMs),
endTimeUnixNano: toNanoString(endMs),
status, attributes,
...(events && events.length > 0 ? { events } : {}), // ← add this
}],
}],
}],
};
}
// 3. Emit one exception event per error in sendJobConclusionSpan
// (add after the existing error attribute block, around line 721):
const errorMs = nowMs();
const spanEvents = isAgentFailure
? outputErrors
.map(e => (e && typeof e.message === "string" ? e.message : String(e)))
.filter(Boolean)
.map(msg => ({
timeUnixNano: toNanoString(errorMs),
name: "exception",
attributes: [buildAttr("exception.message", msg.slice(0, 1024))],
}))
: [];
const payload = buildOTLPPayload({
// ...existing fields...
events: spanEvents, // ← add this
});
Expected Outcome
After this change:
- In Grafana Tempo / Honeycomb / Datadog: Each agent error appears as a named span event (
"exception") with exception.message as a queryable field. Backend exception detectors trigger automatically. Individual errors can be aggregated in dashboards (count by exception.message).
- In the JSONL mirror: The
events array appears inside each failed-job span record, making post-hoc artifact debugging richer with no extra parsing.
- For on-call engineers: Full, untruncated error messages are visible per event (up to
MAX_ATTR_VALUE_LENGTH = 1024). All errors (not just the first 5) can be recorded. No more guessing which part of "err A | err B | err C" is the root cause.
Implementation Steps
Evidence from Live Sentry Data
⚠️ Note: No Sentry MCP tool was available during this analysis run. The recommendation is based entirely on static code review of the instrumentation files. The code gap is unambiguous: buildOTLPPayload has no events field and sendJobConclusionSpan never calls it with events. The OTel semantic convention for exception events (exception event name, exception.message attribute) is a standard that major backends implement natively.
To validate against live data: query any recent failure span in Sentry/Grafana/Honeycomb and confirm that events is absent from the span payload (expected given the code), and that gh-aw.error.messages appears as a single concatenated string.
Related Files
actions/setup/js/send_otlp_span.cjs — buildOTLPPayload (add events support) and sendJobConclusionSpan (emit events)
actions/setup/js/send_otlp_span.test.cjs — new test cases for span events
actions/setup/js/action_otlp.test.cjs — extend failure scenario test to assert events
actions/setup/js/action_conclusion_otlp.cjs — no changes needed (delegates to sendJobConclusionSpan)
Generated by the Daily OTel Instrumentation Advisor workflow
Generated by Daily OTel Instrumentation Advisor · ● 156.8K · ◷
📡 OTel Instrumentation Improvement: Add span events for individual agent errors
Analysis Date: 2026-04-12
Priority: High
Effort: Small (< 2h)
Problem
When a gh-aw agent job fails, all error messages are concatenated into a single span attribute
gh-aw.error.messagesas a pipe-delimited string ("msg1 | msg2 | msg3"), capped at the first 5 errors. This makes individual errors unqueryable in OTel backends.The root cause is in
sendJobConclusionSpan(send_otlp_span.cjs:699–724): errors are joined into one string attribute rather than emitted as separate, structured span events following the [OpenTelemetry exception event semantic convention]((opentelemetry.io/redacted)A DevOps engineer debugging a failed run today cannot:
exception.messagecontains a specific error stringstatusMessageis truncated at 256 charsexceptioneventsWhy This Matters (DevOps Perspective)
MTTR impact: Backends like Grafana Tempo, Honeycomb, and Datadog have first-class support for the
exceptionspan event convention. They surface these as structured error records — each error is individually searchable. With the current pipe-delimited string approach, an on-call engineer must manually parse"error A | error B | error C"from a raw attribute and cannot alert on individual error types.Dashboard gap: You cannot currently build a Grafana panel that counts
failureruns grouped by their error type. With span events, you getexception.messageas a filterable field.Alert gap: A useful alert would be: "page if any span has an
exceptionevent with message matchingrate limit". With the current approach this requires regex on the concatenated attribute string.Current Behavior
The span payload has no
eventsarray; individual errors are invisible to backends.Proposed Change
Expected Outcome
After this change:
"exception") withexception.messageas a queryable field. Backend exception detectors trigger automatically. Individual errors can be aggregated in dashboards (count by exception.message).eventsarray appears inside each failed-job span record, making post-hoc artifact debugging richer with no extra parsing.MAX_ATTR_VALUE_LENGTH = 1024). All errors (not just the first 5) can be recorded. No more guessing which part of"err A | err B | err C"is the root cause.Implementation Steps
actions/setup/js/send_otlp_span.cjs:eventsto theOTLPSpanOptionstypedef JSDoc (optional array of OTLP event objects)eventsinto the span object insidebuildOTLPPayload(only when non-empty)sendJobConclusionSpan, build aspanEventsarray fromoutputErrorsand pass it tobuildOTLPPayloadactions/setup/js/send_otlp_span.test.cjs:buildOTLPPayloadincludeseventsin the span when passedsendJobConclusionSpanemits oneexceptionevent per error whenGH_AW_AGENT_CONCLUSION=failureandagent_output.jsoncontains errorsactions/setup/js/action_conclusion_otlp.test.cjs(oraction_otlp.test.cjs):failurescenario produces span events withexception.messagecd actions/setup/js && npx vitest runto confirm tests passmake fmtto ensure formattingEvidence from Live Sentry Data
Related Files
actions/setup/js/send_otlp_span.cjs—buildOTLPPayload(addeventssupport) andsendJobConclusionSpan(emit events)actions/setup/js/send_otlp_span.test.cjs— new test cases for span eventsactions/setup/js/action_otlp.test.cjs— extend failure scenario test to assert eventsactions/setup/js/action_conclusion_otlp.cjs— no changes needed (delegates tosendJobConclusionSpan)Generated by the Daily OTel Instrumentation Advisor workflow