Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maint: prep for 2.9 release #1453

Merged
merged 6 commits into from
Dec 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,95 @@
# Refinery Changelog

## 2.9.0 2024-12-03

This release introduces a variety of enhancements and bug fixes. It has two major features: one that improves memory consumption reporting, and one experimental feature for configuring trace locality mode.
See full details in [the Release Notes](./RELEASE_NOTES.md).

### Features

- feat: rename DisableTraceLocality to TraceCache (#1450) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Add 'unpublished' flag to configs (#1446) | [Kent Quirk](https://github.com/kentquirk)
- feat: Rename EnableTraceLocality to DisableTraceLocality (#1442) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: add a limit to queue draining logic (#1441) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Try to drain incoming and peer queues for an amount of time (#1440) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: ignore trace decision messages produced by the publishers (#1437) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Add basic telemetry to event, batch and OTLP endpoints (#1431) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: compress kept trace decision message (#1430) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: publish instanceID during peer comms (#1420) | [Kent Quirk](https://github.com/)
- feat: increase KeptDecisionSendInterval default value to 1s (#1421) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: batch kept decisions (#1419) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: set a better version for dev builds (#1415) | [Robb Kidd](https://github.com/robbkidd)
- feat: only enalbe stress relief for the entire cluster together (#1413) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: send kept trace decision in a separate goroutine (#1412) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: only redistribute traces when its ownership has changed (#1411) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Add a way to specify the team key for config fetches (experimental) (#1410) | [Kent Quirk](https://github.com/kentquirk)
- feat: send drop decisions in batch (#1402) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Update in-memory trace cache to use LRU instead of ring buffer (#1359) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: Log response bodies when sending events to Honeycomb (#1386) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: make collector health check timeout configurable (#1371) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Record original user agent for spans and logs (#1358) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: forward decision span through peer endpoint (#1342) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: extract decision span from full span (#1338) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat(doc): separate table for metrics contains prefix (#1354) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: generate metrics documentation (#1351) | [Yingrong Zhao](https://github.com/vinozzZ)
- feat: Improve shutdown logic (#1347) | [Kent Quirk](https://github.com/kentquirk)
- feat: Update Honeycomb logger to use EMAThroughput sampler (#1328) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: Improve log messages to be more informative (#1322) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- feat: extract key fields from rules config (#1327) | [Yingrong Zhao](https://github.com/vinozzZ)

### Fixes

- fix: documentation bug (#1449) | [Kent Quirk](https://github.com/kentquirk)
- fix: missing read lock in MapWithTTL (#1445) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: revert draining logic for incoming and peer queue (#1443) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: only ignore messages that are coming from the node itself (#1438) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: Update flaky test (#1436) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- fix: send all traffic through deterministic sampler during stress relief activated (#1433) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: more reliable dev version tagging (#1424) | [Robb Kidd](https://github.com/robbkidd)
- fix: do not use trace object during processTraceDecisions (#1423) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: count the number of IDs in drop decision messages (#1416) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- fix: replace api key with SendKey before transmission (#1404) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: deal with orphan traces and expired traces (#1408) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: reset redistribution delay on peer membership change (#1403) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: explictly assign float64 type for trace cache metrics (#1406) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: deal with orphan traces in trace cache (#1405) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: use current node address as default peer list (#1388) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: Only set incoming user agent if not already present (#1366) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- fix: Put a limit on the size of sampler keys (#1364) | [Kent Quirk](https://github.com/kentquirk)
- fix: set 0 for otel metrics during registration (#1352) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: remove InMemoryCollector from liveness check on shutdown (#1349) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: ConvertNumeric now handles bools (#1336) | [Kent Quirk](https://github.com/kentquirk)
- fix: remove unnecessary assertion to any (#1333) | [Yingrong Zhao](https://github.com/vinozzZ)
- fix: Use peer transmission during redistribute and shutdown events (#1332) | [Mike Goldsmith](https://github.com/MikeGoldsmith)

### Maintenance

- maint: remove unused collect_cache metrics (#1452) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint(deps): bump the minor-patch group with 6 updates (#1451) | [dependabot](https://github.com/dependabot)
- maint: update metrics doc (#1448) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint: remove trace cache metrics (#1447) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint: clean up sampler log entry (#1444) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint(deps): bump the minor-patch group with 11 updates (#1428) | [dependabot](https://github.com/dependabot)
- maint: Add missing LICENSE file (#1429) | [Kent Quirk](https://github.com/kentquirk)
- maint: build fixes (#1427) | [Kent Quirk](https://github.com/kentquirk)
- maint: Update log level for making a trace decision to debug (#1425) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
Goldsmith](https://github.com/MikeGoldsmith)
- maint: Add missing LICENSE file (#1429) | [Kent Quirk](https://github.com/kentquirk)
- maint: build fixes (#1427) | [Kent Quirk](https://github.com/kentquirk)
- maint: Update log level for making a trace decision to debug (#1425) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- docs: update config docs for compatability of using DryRun and EnableTraceLocality together (#1418) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
- maint: add comments about an edge case in SendKey (#1387) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint: Update OTel dependencies (#1409) | [Mike Goldsmith](https://github.com/MikeGoldsmith)
lmuth](https://github.com/vinozzZ)
- docs: Document the ability to use prefix in dynamic sampler FieldList (#1396) | [Irving Popovetsky](https://github.com/irvingpop)
(demo_metrics_doc)- maint: cherry pick v2.8.4 commits into main. (#1383) | [Tyler Helmuth](https://github.com/TylerHelmuth)
- maint: Update main documentation with 2.8.3 release (#1374) | [Tyler Helmuth](https://github.com/TylerHelmuth)
- docs: Update configMeta.yaml with capitalization fixes (#1373) | [Mary J.](https://github.com/mjingle)
- maint: add collector_redistribute_traces_duration_ms metric (#1368) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint(deps): bump the minor-patch group with 13 updates (#1357) | [dependabot](https://github.com/dependabot)
- maint: Refactor metrics registration to streamline declaration and enable easier documentation generation (#1350) | [Yingrong Zhao](https://github.com/vinozzZ)
- maint: rename sent_reason_cache to kept_reason_cache (#1346) | [Yingrong Zhao](https://github.com/vinozzZ)

## 2.8.4 2024-10-11

### Fixes
Expand Down
60 changes: 60 additions & 0 deletions RELEASE_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,66 @@

While [CHANGELOG.md](./CHANGELOG.md) contains detailed documentation and links to all the source code changes in a given release, this document is intended to be aimed at a more comprehensible version of the contents of the release from the point of view of users of Refinery.

## Version 2.9.0

This release has two major features: one that improves memory consumption reporting, and one experimental feature for configuring trace locality mode.

### Improved Memory Usage

Refinery 2.9 changes the way each instance consumes and releases memory which results in a more accurate representation of actual memory usage. Previously an instance would consume up to it’s allocated maximum and then stay roughly at that level. Now it will release memory faster for traces that it’s made a decision on leading to a more accurate memory usage indicator.

We’re hopeful this change to memory consumption will lead to a more reliable way to automatically scale a Refinery cluster such as with Kubernetes HPA auto scalers.

However, because Refinery concentrates all of the spans on a single trace to the same node, it’s possible that a very large trace can cause memory spikes on a single node in a way that is immune to horizontal scaling. If your service generates very large traces (10,000 spans or more) you should address that issue before attempting autoscaling.

### Trace Locality Mode (Experimental)

Since early days, Refinery has allocated individual traces to specific Refinery instances using the span’s trace ID, with an equal portion allocated to each. Refinery forwards spans belonging to other instances so that all spans for a given trace ID are stored and handled by a single instance. This method, which is known as “Trace Locality” with a default configuration option of concentrated, has served Refinery well for a long time but has two major issues:

1. It requires peer communication to forward these spans to the designated instance, and this traffic grows as the number of instances in the cluster increases.
2. Very large traces (10,000+ spans) exert increased memory pressure on the designated instances which can result in degraded stability, particularly in the amount of memory needed to store all the spans. This is also the primary reason why horizontally automatic scaling is inconsistent, as a single node might have irregular memory consumption compared to others.

Refinery 2.9 introduces a new experimental feature that allows configuring different mode for trace locality within the cluster. With TraceLocalityMode set to distributed, spans are accepted and held by the first instance they are received on and a special, smaller “decision” span is sent to the designated owner. Decision spans only contain the data required for the designed Refinery instance to make a decision and depending on the size of the span and the rules configuration, they can be significantly smaller. We have seen reductions of 70-90% in our testing.

Now that spans are spread across the cluster, when the designated instance of a trace makes a decision, it needs to share that decision with its peers so they can apply the decision on the spans they are holding. This is done using Redis’ publish/subscribe (pubsub) messaging system, which allows each Refinery instance to subscribe to decision messages.

With distributed mode, the cluster is more resilient to large traces as spans are distributed across all instances and has reduced peer communication as less data is transferred between instances using the decision spans. Horizontal automatic scaling is also more consistent.

In our testing, we have seen that memory usage on the nodes in the cluster tends to rise and fall together, which we believe may lead us to the ability to reliably scale horizontally based on memory consumption.

If you’re interested in testing this experimental feature, please reach out to your account manager. We’re very eager to get additional feedback.

Note: The increased use of Redis as the mechanism to share trace decisions between instances requires a larger Redis installation than has been needed to date. Our initial testing indicates that Redis requires more CPU and network IO compared to previous Refinery clusters that only handled peer changes.

### Bug fixes

In addition to the above features, we’ve also resolved the following bugs.

_Collect loop taking too long_

In some circumstances it was possible that a Refinery instance would spend longer than expected making trace decisions, causing it to not respond quickly enough to its health checks. This could result in Kubernetes marking the instance as unhealthy, and then killing it.

We added the new `MaxExpiredTraces` configuration option to limit the number of traces that can be handled at a time.

_Consecutive trace redistribution events_

When the number of instances changes in a Refinery cluster, each instance schedules a redistribution of traces that no longer belong to it (eg a change in owner). When consecutive changes to the number of peers happens, eg a kubernetes deployment that restarts pods, this can result in instances performing consecutive redistribution actions.

We’ve added the new `RedistributionDelay` configuration option; this is the duration that each instance will wait before performing trace redistribution. Each change in the cluster instances resets the timer so that trace redistribution happens after the cluster has stabilized.

_SendKey_

There were two bug fixes related to the new SendKey feature that allows Refinery to use it’s configured Honeycomb API key when sending telemetry. The first adds support for using SendKey via the command line. The second was a bug where if the list of allowed keys did not contain the send key, spans forwarded to peers would be dropped.

### Configuration Changes

The following is a list of configuration changes for the 2.9 release.

- MaxExpiredTraces - The maximum number of traces that can be processed per cycle of the collect loop. This is used to prevent the collect loop from taking too long and causing health checks to fail.
- Insecure - Allows Refinery’s internal telemetry to be send to an unsecured source, for example an OpenTelemetry Collector on the same network
- RedistributionDelay - The amount of time to wait after the last cluster peers have been updated before redistributing traces to new owners. Each observed change during the delay will reset the timer so that only one redistribution event occurs.
- CacheCapacity - This was the number of traces to keep in the cache's circular buffer. However, in this release, the trace cache was reimplemented using a priority queue which uses memory dynamically. This setting is now deprecated and no longer controls the cache size. Instead the maximum memory usage is controlled by MaxMemoryPercentage and MaxAlloc.

## Version 2.8.4

This is a bug fix release and includes the follow change:
Expand Down
Loading