Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query scheduler query component load verification #8189

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

francoposa
Copy link
Member

@francoposa francoposa commented May 27, 2024

What this PR does

Branched off of #8132, uses the actual code going into main to do an updated version of this sketch branch.

The important bit is here: https://github.com/grafana/mimir/pull/8189/files#diff-fb3d6b36d3cd727e1dd2353fb86c0aee0bfaae1c900e0f005ab913620021d551R447-R464, and the sleep added in pkg/storegateway/gateway.go

The rest of the diff is just making some private fields public for the purposes of this hacked-up demo.

Ran two instances of a query script against this setup with different parameters, one to hit ingesters only and one to hit store gateways only, to verify the behavior that ingester queries can continue being serviced when the store-gateways back up.

The query script will print the response code and error message if any, so you can see the ingester-only requests continue getting 200 OKs while the store-gateway queries will receive something like

Error querying Prometheus: Get "http://localhost:8007/prometheus/api/v1/query_range?query=up+offset+1h0m0s&start=2024-05-27T16:18:17Z&end=2024-05-27T19:18:17Z&step=1m": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

While running the mimir microservices mode docker compose you can grep the scheduler logs for request dropped.

This is a rudimentary simulation because we dequeue the store-gateway request then drop it if the store gateway capacity utilization is past the threshold. We are not yet able to simulate leaving the request in the queue and picking a different one with the current code, though that will be upcoming.
As an extension, we also cannot yet simulate here coming back to dequeue the request for the component over the utilization threshold if no other options are found, so this demo will drop requests past the utilization threshold even if there's only requests for a single query component coming in.

Query script utilized:

package main

import (
	"flag"
	"fmt"
	"io"
	"net/http"
	"net/url"
	"time"
)

const (
	prometheusURL = "http://localhost:8007/prometheus" // mimir development microservices mode query-frontend
	query         = "up"                               // Change this to your specific Prometheus query
)

func main() {
	// Define a flag for the time range as a string
	var rangeDurationFlag string
	flag.StringVar(&rangeDurationFlag, "range-duration", "6h", "go-parseable duration for the Prometheus query range (e.g., 30m, 1h)")
	var offsetDurationFlag string
	flag.StringVar(&offsetDurationFlag, "offset-duration", "0s", "go-parseable duration for the Prometheus query range offset (e.g., 30m, 1h)")
	var sleepDurationFlag string
	flag.StringVar(&sleepDurationFlag, "sleep-duration", "10s", "go-parseable sleep duration between requests")
	flag.Parse()

	var err error

	var rangeDuration time.Duration
	rangeDuration, err = time.ParseDuration(rangeDurationFlag)
	if err != nil {
		fmt.Printf("Invalid rangeDuration format: %s\n", err)
		return
	}
	var offsetDuration time.Duration
	offsetDuration, err = time.ParseDuration(offsetDurationFlag)
	if err != nil {
		fmt.Printf("Invalid offsetDuration format: %s\n", err)
		return
	}

	var sleepDuration time.Duration
	sleepDuration, err = time.ParseDuration(sleepDurationFlag)
	if err != nil {
		fmt.Printf("Invalid sleepDuration format: %s\n", err)
		return
	}

	client := &http.Client{
		Timeout: 2 * time.Minute,
	}

	for {
		go func() {
			start := time.Now().UTC().Add(-rangeDuration) // Dynamic time range before now
			end := time.Now().UTC()                       // Current time

			// Format the start and end times to RFC3339 format required by Prometheus
			startStr := start.Format(time.RFC3339)
			fmt.Printf("query_range start: %s\n", startStr)
			endStr := end.Format(time.RFC3339)
			fmt.Printf("query_range end: %s\n", endStr)

			fmt.Printf("query_range offset: %s\n", offsetDuration.String())

			queryStr := query
			if offsetDuration > 0 {
				queryStr = fmt.Sprintf("%s offset %s", query, offsetDuration.String())
			}
			encodedQueryStr := url.QueryEscape(queryStr)

			// Construct the query URL
			url := fmt.Sprintf("%s/api/v1/query_range?query=%s&start=%s&end=%s&step=1m",
				prometheusURL, encodedQueryStr, startStr, endStr)

			// Make the HTTP GET request
			resp, err := client.Get(url)
			if err != nil {
				fmt.Printf("Error querying Prometheus: %s\n", err)
				return
			}
			fmt.Printf("response status: %s\n", resp.Status)

			_, err = io.ReadAll(resp.Body)
			if err != nil {
				fmt.Printf("Error reading response body: %s\n", err)
				resp.Body.Close()
				return
			}
			resp.Body.Close()
		}()

		//fmt.Printf("Response from Prometheus: %s\n", string(body))

		// Wait for a bit before making the next request
		time.Sleep(sleepDuration) // Adjust the sleep rangeDuration as needed
	}
}

Script args for ingester-only queries, which continue to be serviced:
-range-duration 5m -sleep-duration 1s

Script args for store-gateway-only queries, which get backed up then dropped:
-range-duration 3h -offset-duration 1h -sleep-duration 1s

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@francoposa francoposa changed the title Francoposa/query scheduler query component load verification query scheduler query component load verification May 27, 2024
@francoposa francoposa changed the base branch from main to francoposa/query-scheduler-query-component-load May 27, 2024 22:44
@francoposa francoposa changed the base branch from francoposa/query-scheduler-query-component-load to main May 27, 2024 22:45
@francoposa francoposa force-pushed the francoposa/query-scheduler-query-component-load-verification branch from 14a1429 to d69510d Compare May 27, 2024 22:51
@francoposa francoposa force-pushed the francoposa/query-scheduler-query-component-load-verification branch from d69510d to 60cd195 Compare May 28, 2024 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant