Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dispatch the metrics via telemetry #503

Open
slashmili opened this issue May 11, 2022 · 20 comments
Open

Dispatch the metrics via telemetry #503

slashmili opened this issue May 11, 2022 · 20 comments

Comments

@slashmili
Copy link

Hello!

I'm wondering if you are open to a contribution to introduce telemetry into this project?

There lots of important metric that I'd like to tap into especially when consumer connect and rebalance and would be a great addition to this library.

@zmstone
Copy link
Contributor

zmstone commented May 11, 2022

Hi @slashmili

It'd have to be configurable though.
WDYT? @ieQu1 @qzhuyan @id
and @kjellwinblad @starbelly @vikas15bhardwaj inputs are appreciated.

@slashmili
Copy link
Author

@zmstone configureable to disable it completely? Ok fair enough.

Do you prefer it to be done via Macros or via a condition that only runs during runtime? I'm asking since if Macro is they way to go, I need to do some study first.

@zmstone
Copy link
Contributor

zmstone commented May 11, 2022

Hi @slashmili
Ideally the dependency should be optional for users who do not need beam-telemetry.
We can probably introduce a brod_telemetry module to confine the telemetry dependencies
and use a macro to control if it's a no-op or call telemetry instead.
e.g.

-ifndef(BROD_USE_TELEMETRY).
execute(_Name, _Measurements, _Metadata) -> ok.
-else.
execute(Name, Measurements, Metadata) -> telemetry:exeucte(Name, Measurements, Metadata).
-endif.

@starbelly
Copy link
Contributor

Telemetry in brod would be a big win 👍 💯

@spencerdcarlson
Copy link
Contributor

spencerdcarlson commented May 13, 2022

I'd love to help on this wherever I can. I think librdkafka's STATISTICS page could be helpful when trying to figure out exactly what metrics should be emitted.

@spencerdcarlson
Copy link
Contributor

spencerdcarlson commented May 13, 2022

@zmstone regarding

Hi @slashmili Ideally the dependency should be optional for users who do not need beam-telemetry.

Does this actually need to be the case? I thought by nature telemetry uses dynamic dispatching to delegate the execution of the handler or "work" process to the the subscriber so the framework has an "optional" built in. If you subscribe to the events then you "opt in" and some overhead is added, if you do not subscribe, then there is no handler for the given event and there is no overhead added.

It looks like the only overhead added to brod would be an ets lookup to see if a handler for an event exists -- This seems like a very small cost to pay in favor of the added complexity of an optional dependency and macro.

  • emit event - If there is no handler for an event then lists:foreach(ApplyFun, Handlers). will not even execute since Handlers =:= []
  • handler lookup - returns []

I don't see telemetry as an optional dependency for any large packages like ecto_sql which can have an extremely high throughput.

@zmstone
Copy link
Contributor

zmstone commented May 14, 2022

I’m thinking about it more from dependency hygiene perspective but not concerned about performance.

Given it’s essentially two apis to integrate, I assumed it’s not a big effort to implement.

Btw, the compression libs are also made optional (at runtime).

One org may use beam-telemetry, another may have a different.
Being able to “plug” different instrumentation code at compile time makes it easy to extend in the future, also less intrusive to the wrapping project’s dependency management

@victorolinasc
Copy link

Hi @zmstone ! I think the purpouse of the beam-telemetry is to be agnostic of the handling details. It is just a dispatch library that is maintained by the community and endorsed by the observability working group of the ErlEF https://github.com/erlef/observability-wg .

I think that it's goal is exactly to be this standard glueing layer of telemetry. It seems to be aligned with your goal of dependency hygiene (if no handlers are provided, it is almost a pass-through call with no side-effect) and to allow consumers to choose which instrumentation library. From the ones I know it seems they all integrate nicely with telemetry.

Compression is a different case as there is no standard endorsed by the ErlEF currently.

What do you think?

@zmstone
Copy link
Contributor

zmstone commented May 24, 2022

fair enough,
Make sense to have the instrumentation code defined as BROD_ prefixed macros?

So a wrapping project still has the chance to change to other calls without going though beam-telemetry

@slashmili
Copy link
Author

TBH I don't mind wrapping around a brod_telemetry module either way works for me. telemetry has 3 functions we are interested in :

-module(brod_telemetry).

-export([execute/2,
         execute/3,
         span/3]).

execute(EventName, Measurements, Metadata) -> telemetry:execute(EventName, Measurements, Metadata).
execute(EventName, Measurements) -> telemetry:execute(EventName, Measurements).
span(EventPrefix, StartMetadata, SpanFunction) -> telemetry:span(EventPrefix, StartMetadata, SpanFunction).

What I've been struggling to do are :

  1. Where to add these telemetries. At @highmobility we are interested in brod_group_coordinator activities. Like rebalancing and the time it takes to join the group and so on. I suppose others have interest in other parts. I was thinking to start introducing telemetry in places I'm familiar with and then the rest could be added as separate PRs by others(and me ofc).

let me know if you think the first PR should cover telemetry in any other module.

  1. The way to use telemetry:

We can use telemetry in two ways:
A. telemetrye:execute
B. telemetrye:span

While execute is easy and requires less change in the code. I found myself most of the time reaching for span. for example this block of code using span will be like :
FROM:

do_connect(Endpoint, State) ->
  ConnConfig = conn_config(State),
  kpro:connect(Endpoint, ConnConfig).

TO:

do_connect(Endpoint, #state{client_id = ClientId} = State) ->
  StartMetadata =  #{client_id => ClientId},
  brod_telemetry:span(
    [brod, client, connect],
    StartMetadata,
    fun() ->
      ConnConfig = conn_config(State),
      Result = kpro:connect(Endpoint, ConnConfig),
      Metadata = maps:merge(StartMetadata, connect_result_to_metadata(Result)),
      {Result, Metadata}
    end
  ).

connect_result_to_metadata({ok, _}) -> #{status => ok, reason => nil};
connect_result_to_metadata({error, Reason}) -> #{status => error, reason => Reason}.

It's easier to use span since it does keep track of time and also triggers: start, stop and exception if it happens.

do_connect/2 is really a small function, imagine if it's a more complex function. To keep the code simple I have to move a lot of codes from their main function to another function. e.g brod_group_coordinator:join_group/1 becomes like :

join_group(#state{ groupId                   = GroupId
                   , client                  = ClientId
                 } = State0) ->
  StartMetadata = #{client_id => ClientId, group_id => GroupId} ,
  brod_telemetry:span(
    [brod, group_coordinator, join_group],
    StartMetadata,
    fun () ->
        Result = do_join_group(State0),
        {Result, maps:merge(StartMetadata, join_group_to_metadata(Result))}
    end).

join_group_to_metadata({ok, #state{
                memberId     = MemberId
                , leaderId     = LeaderId
                , generationId = GenerationId
                , members      = Members
                }}) ->
  #{
    member_id => MemberId
    , leader_id => LeaderId
    , generation_id => GenerationId
    , members => Members}.

do_join_group....

The code becomes too much noisy that I know I'll hate my PR already 😭

What do you think? Is this kind of changes are inevitable ? or do you have an other suggestions?

@spencerdcarlson
Copy link
Contributor

spencerdcarlson commented May 27, 2022

Just looking at ecto/db_connection suite as a reference — their approach seems to be to time the relevant parts and emit the timed event with execute. sql.ex#L1101-L1103

@slashmili
Copy link
Author

Thanks for the pointer!

I think it was kinda easier choice for Ecto since it already has the data for logging.

I also looked at:
broadway which uses only span
Finch which uses execute to simulate span like events

Ok so it's not uncommon to prevent messing with the code run execute with start, stop manually!

I'll take that approach then

@btkostner
Copy link

This would be a very welcomed feature and would also help integrate open telemetry tracing similar to how they do the oban integration.

@spencerdcarlson
Copy link
Contributor

spencerdcarlson commented Jun 19, 2022

@slashmili let me know what you think of #512. Just starting to take a very light stab at adding telemetry.

@tsloughter
Copy link

On the topic of telemetry (the general term, not the library) there is also a need for propagating distributed trace context in kafka messages and instrumentation with telemetry does not help with this.

So if adding some abstraction that maybe calls telemetry depending on user configuration, it'd be great if there is also a way for the user to set a callback or middleware for updating messages being produced so they can add the context as kafka headers.

If a non-telemetry based way to instrument was provided then we'd probably also create an OpenTelemetry instrumentation instead of doing brod->telemetry->opentelemetry. But having to use a different module for calls to make them instrumented is annoying since it means you can't instrument libraries you are using without forking and changing them.

The way OpenTelemetry tries to make integration zero-cost for a user who doesn't use it is the separation of the API and SDK. Including only opentelemetry_api as a library, like brod or a new app opentelemetry_brod would do, results in calls to start spans or record metrics are no-op's. If the user includes the SDK, opentelemetry, then it hooks itself up so those calls actually do something.

And just in case, the Otel semantic conventions for Kafka may be of use https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/instrumentation/kafka.md

@zmstone
Copy link
Contributor

zmstone commented Jul 8, 2022

Thanks you all for the ideas.
I have pushed a branch dev/telemetry, let's start instrumenting the code in this branch,
when everybody is happy with the instrumentations, we can merge it to master and bump a minor.

@tsloughter
to make the instrumentation nearly no-op for people who do not need it,
my original naive idea was to use a compile flag to be switched on by the wrapping project.

I only had a quick look at opentelemetry_api, do I understand it correctly that: the overhead
for someone who does not need it is a few reads from persistent_term?

@zmstone
Copy link
Contributor

zmstone commented Jul 8, 2022

Hi again @tsloughter
I am not quite sure what would opentelemetry_brod do, or why you brought it up,
is there any reason why someone should not include opentelemetry_api but create a opentelemetry_appname instead ?
There is probably a link somewhere I can read up? :D

@tsloughter
Copy link

@zmstone compile time flag can be nice for some, but being able to switch between no-op and active without recompiling is useful/easier for some.

And yes, the opentelemetry_api macros will lookup the tracer to use with persistent terms and get the no-op tracer in the cases that there is no SDK installed.

There isn't a reason to not include opentelemetry_api. It is just common to have an opentelemetry_appname for various reasons, like the author not wanting to tie themselves to a particular library or just to have it faster/easier to iterate on the instrumentation.

@Graborg
Copy link

Graborg commented Mar 6, 2024

Any updates on this? It would be very valuable.

@dimitarvp
Copy link

Just recently I picked kaffe which steps on brod and I was wondering if OTel is still considered or in progress?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants