what am i missing? workers immediately disconnect #2945

sickmint79 · 2024-10-17T08:15:10Z

sickmint79
Oct 17, 2024

i have a script where i load data and hit an api. i was running it with a command like
locust -f fs_test.py --headless --processes 6 -u 40 -r 1 -t 3m

and it worked well up until i wanted to add more users and throughput. it appeared to be single cpu bound. this was the same on another machine, and the same when i made one machine a master and another a worker. so my theory became that the distributor/iterator was the bottleneck, and i need each user to load data themselves.

i have since moved that code to users, but my new script bails right when it is expected to do anything. it basically appears to get to the self.client.post but not actually send it; i see nothing for a user after it and no requests accrue. the output says workers have been lost. strangely also at very high 90% cpu triggers even if i do just a couple users on a big box. my code is below.

from locust import FastHttpUser, events, task, constant_throughput
import pandas as pd
from locust_plugins.distributor import Distributor
from locust.runners import WorkerRunner
from geventhttpclient.client import HTTPClientPool
import random
import logging
import os
from datetime import datetime
import configparser
import json
import boto3
import sys
from io import BytesIO
import gc

config = configparser.ConfigParser()
config.read('config.ini')

#currentWorkerId = None
distributors = {}

STREAM_PIPE_PATH = config['settings']['stream_pipe_path']
FILE_PIPE_PATH = config['settings']['file_pipe_path']
FS = config['settings']['fs']
AUTH = "key " + config['settings']['api_key']
HOST = config['settings']['host']
LOG_STREAM = config['settings']['log_stream']
LOG_RESULTS = config['settings']['log_results']
WORKSPACE = config['settings']['workspace']




@events.init.add_listener
def on_locust_init(environment, **_kwargs):

    if not isinstance(environment.runner, WorkerRunner):
        logging.info("Master init stuff")
    else:
        logging.info("Worker init process stuff")


class testUser(FastHttpUser):

    random.seed(42)

    host = HOST
    client_pool=HTTPClientPool(concurrency=10000)


    # number of tasks this user will try to send/limit to per second
    wait_time = constant_throughput(4)

    # run on user start
    def on_start(self):
        logging.info("Starting user...")

        # Load data inside on_start
        df_txns = pd.read_parquet('data')

        # Initialize random sampling
        self.transactions = list(zip(
            df_txns['ca'],
            df_txns['da'],
            df_txns['cb'],
            df_txns['db'],
            df_txns['isa']
        ))

        # Free up memory by deleting the DataFrame
        del df_txns

        start_index = random.randint(0, len(self.transactions) - 1)
        logging.info(f"User starting at random position: {start_index}")

        self.transaction_iterator = iter(self.transactions[start_index:] + self.transactions[:start_index])

        logging.info("User ready to work...")

    # run on user stop
    def on_stop(self):
        self.client_pool.close()
        gc.collect()

    @task
    def loadTest(self):
        path = "api/v1/fss"
        logging.info("Performing task...")
        try:
            # Get the next transaction tuple
            ca, da, cb, db, isa = next(self.transaction_iterator)
        except StopIteration:
            # Handle the end of data, e.g., restart from the beginning
            self.transaction_iterator = iter(self.transactions)
            ca, da, cb, db, isa = next(self.transaction_iterator)
        logging.info("Got task data...")
        data = {
            "params": {
                "fs": FS,
                "jkm": {
                    "ca": str(ca),
                    "da": str(da),
                    "cb": str(cb),
                    "db": str(db)
                },
                "map": {
                    "isa": isa
                },
                "ws": WORKSPACE
            }
        }
        headers = {
            "Content-Type": "application/json",
            "Authorization": AUTH
        }
        logging.info(f"Data is \nheaders: {headers}\ndata: {data}")
        logging.info("Sending post...")
        # response = self.client.post(
        #     path,
        #     json = data,
        #     headers = headers)
        try:
            response = self.client.post(
                path,
                json=data,
                headers=headers,
                timeout=2  # Adjust timeout as needed
            )
            logging.info("Got response...")
        except Exception as e:
            print(f"Request failed: {e}")
            logging.error(f"Request failed: {e}")

if i run master and worker on the same machine, here is the output from my worker. eventually i kill the master.

[2024-10-17 07:52:07,730] ip-172-31-8-35/INFO/root: User starting at random position: 1867825
[2024-10-17 07:52:08,030] ip-172-31-8-35/INFO/root: User ready to work...
[2024-10-17 07:52:08,031] ip-172-31-8-35/INFO/root: Performing task...
[2024-10-17 07:52:08,031] ip-172-31-8-35/INFO/root: Got task data...
[2024-10-17 07:52:08,031] ip-172-31-8-35/INFO/root: Data is 
headers:  data stuff i redacted
[2024-10-17 07:52:08,031] ip-172-31-8-35/INFO/root: Sending post...
[2024-10-17 07:52:08,049] ip-172-31-8-35/WARNING/root: CPU usage above 90%! This may constrain your throughput and may even give inconsistent response time measurements! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
[2024-10-17 07:52:09,088] ip-172-31-8-35/WARNING/locust.runners: CPU usage was too high at some point during the test! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
[2024-10-17 07:54:58,738] ip-172-31-8-35/INFO/locust.runners: Got quit message from master, shutting down...
[2024-10-17 07:54:58,738] ip-172-31-8-35/INFO/locust.main: Shutting down (exit code 0)

by the time you can see the sending post immediately results in a 90% error. this is the user's first transaction and i'm on a 16 cpu c7i box.

what if i just run the script as is?
locust -f master.py --headless --processes 1 -u 2 -r 1 -t 3m

and this output.

[2024-10-17 08:08:26,716] ip-172-31-8-35/INFO/locust.runners: Sending spawn jobs of 2 users at 1.00 spawn rate to 1 ready workers
[2024-10-17 08:08:26,776] ip-172-31-8-35/INFO/root: Starting user...
Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-10-17 08:08:28,718] ip-172-31-8-35/INFO/locust.runners: Spawning is complete and report waittime is expired, but not all reports received from workers: {"testUser": 1} (1 total users)
Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-10-17 08:08:31,720] ip-172-31-8-35/INFO/locust.runners: Worker ip-172-31-8-35.us-west-2.compute.internal_1b2389b7225642bd9b26d3755fe92dc9 failed to send heartbeat, setting state to missing.
[2024-10-17 08:08:31,720] ip-172-31-8-35/INFO/locust.runners: The last worker went missing, stopping test.
Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-10-17 08:08:35,060] ip-172-31-8-35/INFO/root: User starting at random position: 766992
[2024-10-17 08:08:35,351] ip-172-31-8-35/INFO/root: User ready to work...
[2024-10-17 08:08:35,351] ip-172-31-8-35/INFO/root: Performing task...
[2024-10-17 08:08:35,351] ip-172-31-8-35/INFO/root: Got task data...
[2024-10-17 08:08:35,352] ip-172-31-8-35/INFO/root: Data is 
headers: redacted
[2024-10-17 08:08:35,352] ip-172-31-8-35/INFO/root: Sending post...
[2024-10-17 08:08:35,369] ip-172-31-8-35/INFO/locust.runners: Worker ip-172-31-8-35.us-west-2.compute.internal_1b2389b7225642bd9b26d3755fe92dc9 self-healed with heartbeat, setting state to running.
[2024-10-17 08:08:35,370] ip-172-31-8-35/WARNING/root: CPU usage above 90%! This may constrain your throughput and may even give inconsistent response time measurements! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

[2024-10-17 08:08:36,342] ip-172-31-8-35/WARNING/locust.runners: CPU usage was too high at some point during the test! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
[2024-10-17 08:08:36,343] ip-172-31-8-35/INFO/locust.runners: ip-172-31-8-35.us-west-2.compute.internal_1b2389b7225642bd9b26d3755fe92dc9 (index 0) reported that it has stopped
[2024-10-17 08:08:36,343] ip-172-31-8-35/INFO/locust.runners: ip-172-31-8-35.us-west-2.compute.internal_1b2389b7225642bd9b26d3755fe92dc9 (index 0) reported as ready. 1 workers connected.
[2024-10-17 08:08:36,371] ip-172-31-8-35/WARNING/locust.runners: Worker ip-172-31-8-35.us-west-2.compute.internal_1b2389b7225642bd9b26d3755fe92dc9 (index 0) exceeded cpu threshold (will only log this once per worker)
Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

Type     Name                                                                     # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
--------|-----------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                    0     0(0.00%) |      0       0       0      0 |    0.00        0.00

missing heartbeats, 90%, nothing sent, then just hanging out doing nothing. what am i missing? i feel like this is going to be one line or something, but i really don't have any idea what that one line is.

the script does indeed work if i forego --processes, and just use
locust -f master.py --headless -u 2 -r 1 -t 3

[2024-10-17 08:12:10,689] ip-172-31-8-35/INFO/root: Sending post...
[2024-10-17 08:12:12,501] ip-172-31-8-35/WARNING/locust.runners: CPU usage was too high at some point during the test! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST     api/v1/feature-service/get-features                                               78     0(0.00%) |    133      10    9225     13 |    2.86        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                        78     0(0.00%) |    133      10    9225     13 |    2.86        0.00

Response time percentiles (approximated)
Type     Name                                                                                  50%    66%    75%    80%    90%    95%    98%    99%  99.9% 99.99%   100% # reqs
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
POST     api/v1/feature-service/get-features                                                    13     14     14     16     23     37     50   9200   9200   9200   9200     78
--------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------
         Aggregated                                                                             13     14     14     16     23     37     50   9200   9200   9200   9200     78

this doesn't seem the way to go though if i'm trying to scale to 1000 users and 5000 requests/sec. i'm also unclear why there is a big 9200, as if it is counting my user's on_start time or something too?

cyberw · 2024-10-17T10:27:44Z

cyberw
Oct 17, 2024
Maintainer

its a little bit hard to know but a couple things jump right out at me:

This is at best meaningless

client_pool=HTTPClientPool(concurrency=10000)

Every FastHttpUser gets its own connection pool by default, there's probably no advantage & potentially harmful to create a shared one.

This looks a little weird, and potentially bad for performance:

    self.transaction_iterator = iter(self.transactions[start_index:] + self.transactions[:start_index])

Maybe just use indices instead of an iterator? And is your data like really big or something?

If that doesnt help, try simplifying your example further and I can have another look.

5 replies

cyberw Oct 17, 2024
Maintainer

Also, why not just read the data once instead of for every user? Like move pd.read_parquet('data') and as much else as possible to the top/module level.

sickmint79 Oct 18, 2024
Author

i am now on a c7i.4xlarge, which seems a decently large box. my script is as follows.

from locust import FastHttpUser, events, task, constant_throughput
import pandas as pd
from locust_plugins.distributor import Distributor
from locust.runners import WorkerRunner
from geventhttpclient.client import HTTPClientPool
import random
import logging
import os
from datetime import datetime
import configparser
import json
import boto3
import sys
from io import BytesIO
import gc

config = configparser.ConfigParser()
config.read('config.ini')


FS = config['settings']['fs']
AUTH = "key " + config['settings']['api_key']
HOST = config['settings']['host']
WORKSPACE = config['settings']['workspace']


@events.init.add_listener
def on_locust_init(environment, **_kwargs):

    if not isinstance(environment.runner, WorkerRunner):
        logging.info("In master/controller process initialization")

    else:
        logging.info("In worker/process initialization")
        logging.info("Worker will spawn users...")
        global df_txns

        logging.info("worker getting list of data...")
        s3_client = boto3.client('s3')
        bucket_name = 'kafka'
        prefix = 'topics/synthetic_history_epoch_ms/year_part=2024/month_part=06/day_part=10'

        df_list = []
        
        logging.info("loading data...")
        response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

        if 'Contents' in response:
            for obj in response['Contents']:
                key = obj['Key']
                if key.endswith('.parquet'):  
                    response = s3_client.get_object(Bucket=bucket_name, Key=key)
                    parquet_data = response['Body'].read()
                    df = pd.read_parquet(BytesIO(parquet_data))
                    df_list.append(df)

        if len(df_list) == 0:
            print("no data loaded")
            sys.exit(1)

        df_txns = pd.concat(df_list, ignore_index=True)
        df_txns = df_txns.drop(columns=['id', 'dt', 'sm', 'month', 'ts'])
        df_txns = df_txns.astype({col: 'str' for col in df_txns.select_dtypes(include=['int64']).columns})

        print(f"read in {df_txns.shape[0]} records")

        gc.collect()
        logging.info("successfully loaded transactions...")


class testUser(FastHttpUser):

    host = HOST
    path = "api/v1/feature-service/get"
    headers = {
        "Content-Type": "application/json",
        "Authorization": AUTH
    }

    # number of tasks this user will try to send/limit to per second
    wait_time = constant_throughput(4)

    @task
    def loadTest(self):
        global df_txns 
        
        random_row_series = df_txns.sample(n=1).iloc[0]

        data = {
            "params": {
                "feature_service_name": FS,
                "join_key_map": {
                    "ca": str(random_row_series['ca']),
                    "da": str(random_row_series['da']),
                    "cb": str(random_row_series['cb']),
                    "db": str(random_row_series['db'])
                },
                "request_context_map": {
                    "isa": random_row_series['isa']
                },
                "workspace_name": WORKSPACE
            }
        }
        
        #print("sending response...")
        response = self.client.post(
            self.path,
            json = data,
            headers = self.headers)
        #print(response.text)

the df contains 306,883 records.

cpu gets to 90 and results get weird/don't ramp up. i don't think i got over 1600. i can get a larger machine, or put master somewhere else and 2 of these as workers (i assume same results if i just double this machine though?)

locust -f simple.py --headless --processes 15 -u 650 -r 5 -t 10m

no longer have any master losing workers, but am back to my original something seems bottlenecked problem. is the df simply too large to be sampled this frequently? what's the proper way to iterate through the data quickly? what's the limit of records i should be considering?

sickmint79 Oct 19, 2024
Author

ok, i think all working now as expected.
moving dataframe to workers/processes and making it globally available to users they spawn made cpu usage more flat.
i'm assuming the way i'm sampling is lower cpu than other methods? although i still need a lot of gear to get throughput. i think i was getting 118 rec/s per core.
my loss of workers below was due to me opening named pipes in the on_start methods, and i think in scenarios where the pipe reader was not running (outside of locusts) so the code was unhappy about trying to open the pipe to write, although this didn't really surface as anything other than loss of connection to master

cyberw Oct 19, 2024
Maintainer

👍👍 the only thing that still could cause issues isdf_txns.sample(n=1).iloc[0] I guess.

I wonder if it would be faster to convert it to a list and just do random.choice(mylist)? Or shuffle them in on_start and then iterate in order?

sickmint79 Oct 24, 2024
Author

i've changed this now to sampling 3k records per user, with np seed for user of self.environment.runner.user_count. then i make an array of indices. an np.random.choice(self.row_indices) from the array, then self.df_sampled_rows.iloc[random_index] for the row retrieval.

this seems to be working well/not affect anything negatively; the more i could sample the better, but i've not made an attempt to see where performance starts getting negatively affected there.

sickmint79 · 2024-10-17T14:11:47Z

sickmint79
Oct 17, 2024
Author

as a quick response (i'll add in a bit) i have the iterator randomly start somewhere for each user. i plan to test with 250/500/1250 users with a constant throughput of 4 targeting 1k/2k/5k qps.

the data size is about 18m rows for it to burn through. i suppose right now each user has their own 18m though.

i had the read of parquet and creation of iterators and distributors originally in the master init, and that script ones fine all within one boxes with the processes 6 command above. however i see 6 processes running, then a 7th, i assume is simply the master node, pegged at 100%. this was the case when i separated workers on to another machine as well; my conclusion was the master node distributing the data to the workers was taking up too much cpu and i needed to put it in to the workers. i don't have the code to create values at random, at least for some of the elements, so that is why i'm using an input data set in the first place.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locust

what am i missing? workers immediately disconnect #2945

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Locust

what am i missing? workers immediately disconnect #2945

sickmint79 Oct 17, 2024

Replies: 2 comments · 5 replies

cyberw Oct 17, 2024 Maintainer

cyberw Oct 17, 2024 Maintainer

sickmint79 Oct 18, 2024 Author

sickmint79 Oct 19, 2024 Author

cyberw Oct 19, 2024 Maintainer

sickmint79 Oct 24, 2024 Author

sickmint79 Oct 17, 2024 Author

sickmint79
Oct 17, 2024

Replies: 2 comments 5 replies

cyberw
Oct 17, 2024
Maintainer

cyberw Oct 17, 2024
Maintainer

sickmint79 Oct 18, 2024
Author

sickmint79 Oct 19, 2024
Author

cyberw Oct 19, 2024
Maintainer

sickmint79 Oct 24, 2024
Author

sickmint79
Oct 17, 2024
Author