Skip to content

Latest commit

 

History

History
344 lines (257 loc) · 14.2 KB

telemetry.md

File metadata and controls

344 lines (257 loc) · 14.2 KB

Telemetry

Development

See aws-toolkit-common/telemetry for full details about defining telemetry metrics.

  • You can define new metrics during development by adding items to telemetry/vscodeTelemetry.json.
    • The generateClients build task generates new symbols in shared/telemetry/telemetry, which you can import via:
      import { telemetry } from '/shared/telemetry/telemetry'
      
    • When your feature is released, the "development" metrics you defined in vscodeTelemetry.json should be upstreamed to aws-toolkit-common.
  • Metrics are dropped (not posted to the service) if the extension is running in CI or other automation tasks.
    • You can always test telemetry via assertTelemetry(), regardless of the current environment.

Guidelines

  • Use run() where possible. It automatically sets the result and reason fields. See below for details.
    • run() gets the reason value from the Error.code property of any exception that is thrown.
    • Your code can throw ToolkitError with a code field to communicate errors, validation issues, or cancellation. See below.
  • The reason and result fields are standard metric fields shared by all Toolkits (VSCode, JetBrains, VisualStudio). They should be used instead of special-purpose metrics or fields.
  • result allows the Toolkits team to monitor all features for potential regressions.
  • reason gives insight into the cause of a result=Failed metric.
  • telemetry.record() called during a telemetry.foo.run(…) context will automatically annotate the current foo metric.

Incrementally Building a Metric

User actions or other features may have multiple stages/steps, called a "workflow" or just "flow". A telemetry "trace" captures a flow as tree of "spans".

For example, setupThing() has multiple steps until it is completed, ending with lastSetupStep().

function setupThing() {
    setupStep1()
    setupStep2()
    ...
    lastSetupStep()
}

If we want to send a metric event, lets call it metric_setupThing, then the code could look something like this:

function setupThing() {
    try {
        ...
        lastSetupStep()
        telemetry.metric_setupThing.emit({result: 'Succeeded', ...})
    }
    catch (e) {
        telemetry.metric_setupThing.emit({result: 'Failed', reason: 'Not Really Sure Why' ...})
    }
}

Here we emitted a final metric based on the failure or success of the entire execution. Each metric is discrete and immediately gets sent to the telemetry service.


But usually code is not flat and there are many nested calls. If something goes wrong during the execution it would be useful to have more specific information at the area of failure. For that we can use run() along with telemetry.record().

run() accepts a callback, and when the callback is executed, any uses of telemetry.record() at any nesting level during execution of that callback, will update the attributes of the "current metric". And at the end (that is, when run() returns) we will emit a single metric with the last updated attributes. Example

When an exception is thrown from a run() context, run() will automatically set the reason field based on the Error code field. You can explicitly set code when throwing a ToolkitError, for example:

throw new ToolkitError('No sso-session name found in ~/.aws/config', { code: 'NoSsoSessionName' })

Note: prefer reason codes with a format similar to existing codes (not sentences). You can find existing codes by searching the codebase:

git grep 'code: '

Example

setupThing()

function setupThing() {
    // Start the run() for metric_setupThing
    telemetry.metric_setupThing.run(span => {
        // Update the metric with initial attributes
        span.record({sessionId: '123456'}) // now no matter where the control flow exits after this line in this method, this attribute will always be set
        ...
        setupStep2()
        ...

        if (userInput.CancelSelected) {
            // By setting the `cancelled` attribute to true, the `result` attribute will be set to Cancelled
            throw new ToolkitError("Thing has been cancelled", { cancelled: true})
        }
    })
    // At this point the final values from the `record()` calls are used to emit a the final metric.
    // If no exceptions have been thrown, the `result` attribute is automatically set to Success.
}

function setupStep2() {
    try {
        // Do work
    }
    catch (e) {
        // Here we can update the metric with more specific information regarding the failure.

        // Also notice we are able to use `telemetry.metric_setupThing` versus `span`.
        // This is due to `metric_setupThing` being added to the "context" from the above run()
        // callback argument. So when we use record() below it will update the same
        // thing that span.record() does.

        // Keep in mind record() must be run inside the callback argument of run() for
        // the attributes of that specific metric to be updated.
        telemetry.metric_setupThing.record({
            workDone: // ...
        })
        // If this exception is allowed to propogate to the `run()`, then the `result` will be automatically set to Failed and the `reason` to the `code` set here
        throw new ToolkitError(e as Error, { code: "SomethingWentWrongInStep2"})
    }
}

Finally, if setupStep2() was the thing that failed we would see a metric like:

{
    "metadata.metricName": "metric_setupThing",
    "sessionId": "123456",
    "result": "Failed",
    "reason": "SomethingWentWrongInStep2",
    ...
}

Adding a "Stack Trace" to your metric

When errors are thrown we do not attach the stack trace in telemetry. We only know about the error itself, but not the path it took to get there. We sometimes need this stack trace to debug, and only have telemetry to get insight on what happened since we do not have access to logs.

Scenario

Common example: "I have a function, thisFailsSometimes() that is called in multiple places. The function sometimes fails, I know from telemetry, but I do not know if it is failing when it is a specific caller. If I knew the call stack/trace that it took to call my function that would help me debug."

function runsSuccessfully() {
    thisFailsSometimes(1) // this succeeds
}

function thisThrows() {
    thisFailsSometimes(0) // this fails
}

function failsDependingOnInput(num: number) {
    return telemetry.my_Metric.run(() => {
        if (number === 0) {
            throw Error('Cannot be 0')
        }
        ...
    })
}

Solution

On class methods, use the @withTelemetryContext() decorator to add context to the execution. Depending on the args set, it provides features like emitting the result, or adding it's context to errors.

NOTE: Decorators are currently only supported for methods and not functions

class MyClass {
    @withTelemetryContext({ name: 'runsSuccessfully', class: 'MyClass' })
    public runsSuccessfully() {
        failsDependingOnInput(1)
    }

    @withTelemetryContext({ name: 'thisThrows', class: 'MyClass', errorCtx: true })
    public thisThrows() {
        failsDependingOnInput(0)
    }

    @withTelemetryContext({ name: 'failsDependingOnInput' class: 'MyClass', emit: true, errorCtx: true})
    private failsDependingOnInput(num: number) {
        if (number === 0) {
            throw Error('Cannot be 0')
        }
        ...
    }
}

// Results in a metric: { source: 'MyClass#thisThrows,failsDependingOnInput', result: 'Failed' }
// Results in an error that has context about the methods that lead up to it.
new MyClass().thisThrows()

Separately if you must use a function, add a value to function in the options of a run(). This will result in a stack of functions identifiers that were previously called before failsDependingOnInput() was run. You can then retrieve the stack in the run() of your final metric using getFunctionStack().

function runsSuccessfully() {
    telemetry.my_Metric.run(() => failsDependingOnInput(1), { functionId: { name: 'runsSuccessfully' }})
}

function thisThrows() {
    telemetry.my_Metric.run(() => failsDependingOnInput(0), { functionId: { source: 'thisThrows' }})
}

function failsDependingOnInput(num: number) {
    return telemetry.my_Metric.run(() => {
        telemetry.record({ theCallStack: asStringifiedStack(telemetry.getFunctionStack())})
        if (number === 0) {
            throw Error('Cannot be 0')
        }
        ...
    }, { functionId: { name: 'failsDependingOnInput' }})
}

// Results in a metric: { theCallStack: 'thisThrows:failsDependingOnInput', result: 'Failed' }
// { theCallStack: 'thisThrows:failsDependingOnInput' } implies 'thisThrows' was run first, then 'failsDependingOnInput'. See docstrings for more info.
thisThrows()

Important Notes

  • If a nested function does not use a run() then it will not be part of the call stack.

    function a() {
        return telemetry.my_Metric.run(() => b(), { functionId: { name: 'a' } })
    }
    
    function b() {
        return c()
    }
    
    function c() {
        return telemetry.my_Metric.run(() => asStringifiedStack(telemetry.getFunctionStack()), {
            functionId: { name: 'c' },
        })
    }
    
    c() // result: 'a:c', note that 'b' is not included
  • If you do not want your run() to emit telemetry, set emit: false in the options

    function a() {
        return telemetry.my_Metric.run(() => b(), { functionId: { name: 'a' }, emit: false })
    }
  • If you want to add to the function stack, but don't have a specific metric to use, use the metric named function_call. Also look to set emit: false if you do not want it to emit telemetry.

    function a() {
        return telemetry.function_call.run(() => b(), { functionId: { name: 'a' }, emit: false })
    }
  • If your function name is generic, look to make it unique so there is no confusion.

    function a() {
        return telemetry.my_Metric.run(() => b(), { functionId: { name: 'aButMoreUnique' } })
    }

Tracing Telemetry Events

All telemetry events include a traceId in addition to other attributes. Traceids allow for improved tracking and correlation of related events across a single operation or user flow.

What is a traceId?

A traceId is a unique identifier that is generated for the top-level telemetry event in a flow and then propagated to all subsequent related events. This allows us to group and analyze all events associated with a particular operation.

How it works

  1. When a top-level telemetry event is created (e.g., vscode_executeCommand), a new traceId is generated.
  2. This traceId is then attached to all subsequent related telemetry events that occur as part of the same operation or flow.
  3. The traceId remains consistent for all events within the same flow

Example

Consider a flow where vscode_executeCommand triggers amazonq_enterFocusChat and amazonq_openChat. The resulting telemetry events would look like this:

vscode_executeCommand:
traceId: 'aaaaa-aaaaa-aaaaa-aaaaa-aaaaa'

amazonq_enterFocusChat
traceId: 'aaaaa-aaaaa-aaaaa-aaaaa-aaaaa'

amazonq_openChat
traceId: 'aaaaa-aaaaa-aaaaa-aaaaa-aaaaa'

allowing us to look up traceId=aaaaa-aaaaa-aaaaa-aaaaa-aaaaa in our telemetry instance and find all the related events.

For more information visit the OpenTelemetry documentation on traces: https://opentelemetry.io/docs/concepts/signals/traces/

Manual Trace ID Instrumentation

In certain scenarios you may need to manually instrument disjoint flows to track how a traceId propagates through them. e.g.

  1. Measuring the time it takes for a message to travel from Amazon Q chat, through VS Code, and back to the customer.
  2. Determining the duration for Amazon Q inline to display a message to the user.

In these cases, where there isn't a direct hierarchy of function calls, manual instrumentation of the traceId is necessary.

Implementation Options

1. When not currently running in a span

If you're not within an active span and you know the traceId you want to use:

telemetry.withTraceId(() => {
    // Code to be executed within this trace
}, 'myTraceId')

This method wraps the provided function with the specified traceId

2. When currently running in a span

If you're already executing within a span (e.g., vscode_executeCommand) and you know the traceId you want to use:

telemetry.record({
    traceId: 'myTraceId',
})

This approach records the traceId for the current span and all future spans within the same execution context.