fix: improve Slack rate limiting logic when updating alert groups #5287

joeyorlando · 2024-11-21T15:56:39Z

What this PR does

https://www.loom.com/share/1ac33822301444748133ffe72638ddc4

The two asks in the original GH issue were:

Make the error message clearer. We can identify if it's delivering or updating and being rate-limited. This is possible because Slack sets limits per API method. Also, this limit is a per-slack channel while we are posting messages & applying ratelimit per on-call integration, which confuses customers.

Debounce update alert group message in Slack

Both of these have been addressed in this PR

Which issue(s) this PR closes

Closes https://github.com/grafana/oncall-private/issues/2947

Checklist

Unit, integration, and e2e (if applicable) tests updated
Documentation added (or pr:no public docs PR label added if not required)
Added the relevant release notes label (see labels prefixed w/ release:). These labels dictate how your PR will
show up in the autogenerated release notes.

…:grafana/oncall into jorlando/integration-slack-rate-limiting

joeyorlando · 2024-11-21T18:29:27Z

engine/apps/alerts/models/alert_receive_channel.py

mostly type hint updates in this file (was poking around in here so decided to leave type hints just a bit better than I found them)

joeyorlando · 2024-11-21T18:33:37Z

engine/apps/alerts/models/alert_receive_channel.py

        return (
            self.rate_limited_in_slack_at is not None
            and self.rate_limited_in_slack_at + SLACK_RATE_LIMIT_TIMEOUT > timezone.now()
        )

-    def start_send_rate_limit_message_task(self, delay=SLACK_RATE_LIMIT_DELAY):
+    def start_send_rate_limit_message_task(self, error_message_verb: str, delay: int) -> None:


this is currently only invoked in two spots, both of which have been updated (additionally, both of these spots that were invoking this function were already passing in a delay, hence why I removed the default of SLACK_RATE_LIMIT_DELAY)

joeyorlando · 2024-11-21T18:35:23Z

engine/apps/slack/models/slack_message.py

@@ -61,8 +66,10 @@ class SlackMessage(models.Model):
        related_name="slack_messages",
    )

-    # ID of a latest celery task to update the message
    active_update_task_id = models.CharField(max_length=100, null=True, default=None)


this field (conveniently) already existed, but it doesn't look like it was being used? (code search) in this PR I'm going to start reusing this field

joeyorlando · 2024-11-21T18:35:39Z

engine/apps/slack/models/slack_message.py

+    def update_alert_groups_message(self) -> None:
+        """
+        Schedule an update task for the associated alert group's Slack message, respecting the debounce interval.
+
+        This method ensures that updates to the Slack message related to an alert group are not performed
+        too frequently, adhering to the `ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS` debounce interval.
+        It schedules a background task to update the message after the appropriate countdown.
+
+        The method performs the following steps:
+        - Checks if there's already an active update task (`active_update_task_id` is set). If so, exits to prevent
+        duplicate scheduling.
+        - Calculates the time since the last update (`last_updated` field) and determines the remaining time needed
+        to respect the debounce interval.
+        - Schedules the `update_alert_group_slack_message` task with the calculated countdown.
+        - Stores the task ID in `active_update_task_id` to prevent multiple tasks from being scheduled.
+        """
+
+        if not self.alert_group:
+            logger.warning(
+                f"skipping update_alert_groups_message as SlackMessage {self.pk} has no alert_group associated with it"
+            )
+            return
+        elif self.active_update_task_id:
+            logger.info(
+                f"skipping update_alert_groups_message as SlackMessage {self.pk} has an active update task {self.active_update_task_id}"
+            )
+            return
+
+        now = timezone.now()
+
+        # we previously weren't updating the last_updated field for messages, so there will be cases
+        # where the last_updated field is None
+        last_updated = self.last_updated or now
+
+        time_since_last_update = (now - last_updated).total_seconds()
+        remaining_time = self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS - int(time_since_last_update)
+        countdown = max(remaining_time, 10)
+
+        logger.info(
+            f"updating message for alert_group {self.alert_group.pk} in {countdown} seconds "
+            f"(debounce interval: {self.ALERT_GROUP_UPDATE_DEBOUNCE_INTERVAL_SECONDS})"
+        )
+
+        task_id = celery_uuid()
+        update_alert_group_slack_message.apply_async((self.pk,), countdown=countdown, task_id=task_id)
+        self.active_update_task_id = task_id
+        self.save(update_fields=["active_update_task_id"])


main change

joeyorlando · 2024-11-21T18:37:46Z

engine/apps/slack/scenarios/distribute_alerts.py

+            if not alert_group_slack_message:
+                logger.info(
+                    f"Skip updating alert group in Slack because alert_group {alert_group.pk} doesn't "
+                    "have a slack message associated with it"
                )


I decided to centralize all of the logic of conditionally updating the alert group's slack message to SlackMessage.update_alert_groups_message (and also decided to drop this caching logic in favour of slack_message.active_update_task_id + celery task countdown (which represents the "debounce interval"))

joeyorlando · 2024-11-21T18:39:11Z

engine/apps/slack/tasks.py

+def update_alert_group_slack_message(slack_message_pk: int) -> None:
+    """
+    Background task to update the Slack message for an alert group.
+
+    This function is intended to be executed as a Celery task. It performs the following:
+    - Compares the current task ID with the `active_update_task_id` stored in the `SlackMessage`.
+      - If they do not match, it means a newer task has been scheduled, so the current task exits to prevent outdated updates.
+    - Uses the `AlertGroupSlackService` to perform the actual update of the Slack message.
+    - Upon successful completion, clears the `active_update_task_id` to allow future updates.
+
+    Args:
+        slack_message_pk (int): The primary key of the `SlackMessage` instance to update.
+    """
+
+    from apps.slack.models import SlackMessage
+
+    current_task_id = update_alert_group_slack_message.request.id

-        return (
-            f"update_incident_slack_message rescheduled because of current task_id ({current_task_id})"
-            f" for alert_group {alert_group_pk} doesn't exist in cache"
+    logger.info(
+        f"update_alert_group_slack_message for slack message {slack_message_pk} started with task_id {current_task_id}"
+    )
+
+    try:
+        slack_message = SlackMessage.objects.get(pk=slack_message_pk)
+    except SlackMessage.DoesNotExist:
+        logger.warning(f"SlackMessage {slack_message_pk} doesn't exist")
+        return
+
+    if not current_task_id == slack_message.active_update_task_id:
+        logger.warning(
+            f"update_alert_group_slack_message skipped, because current_task_id ({current_task_id}) "
+            f"does not equal to active_update_task_id ({slack_message.active_update_task_id}) "
        )
-    if not current_task_id == cached_task_id:
-        return (
-            f"update_incident_slack_message skipped, because of current task_id ({current_task_id})"
-            f" doesn't equal to cached task_id ({cached_task_id}) for alert_group {alert_group_pk}"
+        return
+
+    alert_group = slack_message.alert_group
+    if not alert_group:
+        logger.warning(
+            f"skipping update_alert_group_slack_message as SlackMessage {slack_message_pk} "
+            "doesn't have an alert group associated with it"
        )
+        return
+
+    AlertGroupSlackService(slack_message.slack_team_identity).update_alert_group_slack_message(alert_group)

+    slack_message.active_update_task_id = None
+    slack_message.last_updated = timezone.now()
+    slack_message.save(update_fields=["active_update_task_id", "last_updated"])


main change

joeyorlando · 2024-11-21T18:39:51Z

engine/apps/slack/tasks.py

+
+
+@shared_dedicated_queue_retry_task(autoretry_for=(Exception,), retry_backoff=True)
+def update_incident_slack_message(slack_team_identity_pk: int, alert_group_pk: int) -> None:


this has been renamed to update_alert_group_slack_message.. leaving this around to allow the celery workers to process any remaining update_incident_slack_message tasks (before removing this in a subsequent PR)

joeyorlando · 2024-11-21T18:45:33Z

engine/apps/alerts/tasks/escalate_alert_group.py

@@ -29,7 +28,7 @@ def escalate_alert_group(alert_group_pk):
        except IndexError:
            return f"Alert group with pk {alert_group_pk} doesn't exist"

-        if not compare_escalations(escalate_alert_group.request.id, alert_group.active_escalation_id):


this function felt improperly named to me, as the only invocations of it are to compare celery task ids. Given that this function simply looks like this 👇, decide to just completely remove it in favour of a simple equality check:

def compare_escalations(request_id, active_escalation_id): if request_id == active_escalation_id: return True return False

joeyorlando · 2024-11-27T15:33:53Z

engine/apps/alerts/tests/test_maintenance.py

mock_start_disable_maintenance_task isn't needed.. these tests still pass without it, deleting this fixture

joeyorlando · 2024-11-27T21:29:05Z

engine/apps/slack/models/slack_message.py

@@ -253,3 +263,54 @@ def send_slack_notification(
                    slack_user_identity.send_link_to_slack_message(slack_message)
        except (SlackAPITokenError, SlackAPIMethodNotSupportedForChannelTypeError):
            pass
+
+    def update_alert_groups_message(self, bypass_debounce: bool) -> None:


main change

joeyorlando · 2024-11-27T21:30:27Z

engine/apps/slack/scenarios/distribute_alerts.py

@@ -50,68 +46,85 @@
 logger.setLevel(logging.DEBUG)


-class AlertShootingStep(scenario_step.ScenarioStep):
+class IncomingAlertStep(scenario_step.ScenarioStep):


curious on others thoughts around the naming change here? Personally AlertShootingStep (🔫) was not super clear to me on what it did

joeyorlando · 2024-11-27T21:31:12Z

engine/apps/slack/scenarios/distribute_alerts.py

+            # NOTE: very important. We need to debounce the update_alert_groups_message call here. This is because
+            # we may possibly receive a flood of incoming alerts. We do not want to trigger a Slack message update
+            # for each of these, and hence we should instead debounce them
+            alert_group_slack_message.update_alert_groups_message(bypass_debounce=False)


most important line of the PR

joeyorlando · 2024-11-27T21:44:11Z

engine/apps/slack/scenarios/distribute_alerts.py

@@ -50,68 +46,92 @@
 logger.setLevel(logging.DEBUG)


-class AlertShootingStep(scenario_step.ScenarioStep):
+class IncomingAlertStep(scenario_step.ScenarioStep):
    def process_signal(self, alert: Alert) -> None:


the logic in IncomingAlertStep.process_signal should be functionally the same. I made a few cleanups (like assigning things to variables) and also think I got rid of a few unnecessary SQL queries.

For example, when we were doing AlertGroup.objects.filter(pk=alert.group.pk, ..., we don't need to do an additional SELECT query, since we already have this AlertGroup available via alert.group (alert is passed as the sole argument to IncomingAlertStep.process_signal).

See the _mark_alert_group_slack_message_as_sent function for more context (defined as a nested function inside of process_signal)

joeyorlando · 2024-11-27T21:46:41Z

engine/apps/slack/scenarios/distribute_alerts.py

            except (SlackAPIError, TimeoutError):
-                AlertGroup.objects.filter(pk=alert.group.pk).update(slack_message_sent=False)


by not doing the .update(slack_message_sent=True) call until after we successfully call self._post_alert_group_to_slack, I don't think we need this .update(slack_message_sent=False) call here.

if self._post_alert_group_to_slack throws an exception, _mark_alert_group_slack_message_as_sent will never get called, and slack_message_sent will never get set to True.

joeyorlando · 2024-11-27T21:48:34Z

engine/apps/slack/scenarios/resolution_note.py

-class ReadEditPostmortemStep(ResolutionNoteModalStep):
-    # Left for backward compatibility with slack messages created before postmortems -> resolution note change
-    pass


probably okay to drop this? I don't see any references to this over the last 6 weeks (logs)

joeyorlando added 2 commits November 21, 2024 05:37

fix: improve Slack rate limiting logic when updating alert groups

2e7e46f

remove django-deprecate-fields

ee013d8

joeyorlando mentioned this pull request Nov 21, 2024

wip: optimize slack alert group updates #5185

Closed

3 tasks

joeyorlando had a problem deploying to github-pages November 21, 2024 15:56 — with GitHub Actions Error

joeyorlando had a problem deploying to github-pages November 21, 2024 15:57 — with GitHub Actions Error

joeyorlando and others added 3 commits November 21, 2024 10:59

Merge branch 'dev' into jorlando/integration-slack-rate-limiting

202f8b6

wip

6229ff3

Merge branch 'jorlando/integration-slack-rate-limiting' of github.com…

26df057

…:grafana/oncall into jorlando/integration-slack-rate-limiting

joeyorlando had a problem deploying to github-pages November 21, 2024 15:59 — with GitHub Actions Error

linting

522a699

joeyorlando temporarily deployed to github-pages November 21, 2024 16:04 — with GitHub Actions Inactive

wip

4cafa21

joeyorlando temporarily deployed to github-pages November 21, 2024 18:28 — with GitHub Actions Inactive

wip

2d55fb5

joeyorlando had a problem deploying to github-pages November 21, 2024 18:43 — with GitHub Actions Error

wip

5b9b4a8

joeyorlando temporarily deployed to github-pages November 21, 2024 18:44 — with GitHub Actions Inactive

joeyorlando commented Nov 21, 2024

View reviewed changes

joeyorlando added the pr:no public docs Added to a PR that does not require public documentation updates label Nov 21, 2024

joeyorlando temporarily deployed to github-pages November 26, 2024 20:41 — with GitHub Actions Inactive

wip

0763c46

joeyorlando temporarily deployed to github-pages November 27, 2024 10:59 — with GitHub Actions Inactive

joeyorlando added 2 commits November 27, 2024 10:32

wip

8075140

wip

bd3f6fd

joeyorlando temporarily deployed to github-pages November 27, 2024 15:33 — with GitHub Actions Inactive

joeyorlando commented Nov 27, 2024

View reviewed changes

wip

bb6624b

joeyorlando temporarily deployed to github-pages November 27, 2024 21:21 — with GitHub Actions Inactive

wip

54561d6

joeyorlando temporarily deployed to github-pages November 27, 2024 21:40 — with GitHub Actions Inactive

wip

83a2bf3

joeyorlando temporarily deployed to github-pages November 27, 2024 21:49 — with GitHub Actions Inactive

joeyorlando commented Nov 27, 2024

View reviewed changes

wip

d9c9988

joeyorlando temporarily deployed to github-pages November 27, 2024 23:15 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve Slack rate limiting logic when updating alert groups #5287

fix: improve Slack rate limiting logic when updating alert groups #5287

joeyorlando commented Nov 21, 2024 •

edited

Loading

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 21, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024

joeyorlando Nov 27, 2024



		@shared_dedicated_queue_retry_task(autoretry_for=(Exception,), retry_backoff=True)
		def update_incident_slack_message(slack_team_identity_pk: int, alert_group_pk: int) -> None:

		except (SlackAPIError, TimeoutError):
		AlertGroup.objects.filter(pk=alert.group.pk).update(slack_message_sent=False)

fix: improve Slack rate limiting logic when updating alert groups #5287

Are you sure you want to change the base?

fix: improve Slack rate limiting logic when updating alert groups #5287

Conversation

joeyorlando commented Nov 21, 2024 • edited Loading

What this PR does

Which issue(s) this PR closes

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joeyorlando commented Nov 21, 2024 •

edited

Loading