Experiment Manager Design Discussion #227

bharathappali · 2021-04-21T05:46:50Z

bharathappali
Apr 21, 2021
Collaborator

This discussion talks about the changes we expect in the input, output, flow etc of the Experiment Manager.

Here are the expected input and output JSON.

Expected Input JSON:

{
   "trials": {
      "id": "101383821",
      "app-version": "v1",
      "deployment_name": "petclinic-sample",
      "trial_num": 1,
      "trial_run": "15mins",
      "trial_measurement_time": "3mins",
      "metrics": [
         {
            "name": "obj_fun_var1",
            "query": "obj_fun_var1_query",
            "datasource": "prometheus"
         },
         {
            "name": "obj_fun_var2",
            "query": "obj_fun_var2_query",
            "datasource": "prometheus"
         },
         {
            "name": "obj_fun_var3",
            "query": "obj_fun_var3_query",
            "datasource": "prometheus"
         },
         {
            "name": "cpuRequest",
            "query": "cpuRequest_query",
            "datasource": "prometheus"
         },
         {
            "name": "memRequest",
            "query": "memRequest_query",
            "datasource": "prometheus"
         }
      ],
      "update_config": [
         {
            "name": "update requests and limits",
            "spec": {
               "template": {
                  "spec": {
                     "container": {
                        "resources": {
                           "requests": {
                              "cpu": 2,
                              "memory": "512Mi"
                           },
                           "limits": {
                              "cpu": 3,
                              "memory": "1024Mi"
                           }
                        }
                     }
                  }
               }
            }
         },
         {
            "name": "update env",
            "spec": {
               "template": {
                  "spec": {
                     "container": {
                        "env": {
                           "JVM_OPTIONS": "-XX:MaxInlineLevel=23",
                           "JVM_ARGS": "-XX:MaxInlineLevel=23"
                        }
                     }
                  }
               }
            }
         }
      ]
   }
}

Expected Output JSON:

{
   "trials": {
      "id": "101383821",
      "app-version": "v1",
      "deployment_name": "petclinic_deployment",
      "trial_num": 1,
      "trial_run": "15mins",
      "trial_measurement_time": "3mins",
      "trial_result": "",
      "trial_result_info": "",
      "trial_result_error": "",
      "metrics": [
         {
            "name": "obj_fun_var1",
            "query": "obj_fun_var1_query",
            "datasource": "prometheus",
            "score": "",
            "Error": "",
            "mean": "",
            "mode": "",
            "95.0": "",
            "99.0": "",
            "99.9": "",
            "99.99": "",
            "99.999": "",
            "99.9999": "",
            "100.0": "",
            "spike": ""
         },
         {
            "name": "obj_fun_var2",
            "query": "obj_fun_var2_query",
            "datasource": "prometheus",
            "score": "",
            "Error": "",
            "mean": "",
            "mode": "",
            "95.0": "",
            "99.0": "",
            "99.9": "",
            "99.99": "",
            "99.999": "",
            "99.9999": "",
            "100.0": "",
            "spike": ""
         },
         {
            "name": "obj_fun_var3",
            "query": "obj_fun_var3_query",
            "datasource": "prometheus",
            "score": "",
            "Error": "",
            "mean": "",
            "mode": "",
            "95.0": "",
            "99.0": "",
            "99.9": "",
            "99.99": "",
            "99.999": "",
            "99.9999": "",
            "100.0": "",
            "spike": ""
         },
         {
            "name": "cpuRequest",
            "query": "cpuRequest_query",
            "datasource": "prometheus",
            "score": "",
            "Error": "",
            "mean": "",
            "mode": "",
            "95.0": "",
            "99.0": "",
            "99.9": "",
            "99.99": "",
            "99.999": "",
            "99.9999": "",
            "100.0": "",
            "spike": ""
         },
         {
            "name": "memRequest",
            "query": "memRequest_query",
            "datasource": "prometheus",
            "score": "",
            "Error": "",
            "mean": "",
            "mode": "",
            "95.0": "",
            "99.0": "",
            "99.9": "",
            "99.99": "",
            "99.999": "",
            "99.9999": "",
            "100.0": "",
            "spike": ""
         }
      ]
   }
}

dinogun · 2021-05-04T08:42:19Z

dinogun
May 4, 2021
Maintainer

As per our discussion and the RM design as seen here, here is the latest INPUT JSON

[
   {
      "experiment_id": "2190310A384BC90EF",
      "application_name": "galaxies-sample-cf764b6d-kmh27",
      "trials": [
         {
            "app-version": "v1",
            "trial_id": "",
            "trial_num": 1,
            "trial_run": "15mins",
            "trial_measurement_time": "3mins",
            "training": {
               "deployment_name": "petclinic-sample-training-1",
               "state": "",
               "result": "",
               "result_info": "",
               "result_error": "",
               "metrics": [
                  {
                     "name": "request_sum",
                     "query": "request_sum_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "request_count",
                     "query": "request_count_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "hotspot_function",
                     "query": "hotspot_function_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "cpuRequest",
                     "query": "cpuRequest_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "memRequest",
                     "query": "memRequest_query",
                     "datasource": "prometheus"
                  }
               ],
               "config": [
                  {
                     "name": "update requests and limits",
                     "spec": {
                        "template": {
                           "spec": {
                              "container": {
                                 "resources": {
                                    "requests": {
                                       "cpu": 2,
                                       "memory": "512Mi"
                                    },
                                    "limits": {
                                       "cpu": 3,
                                       "memory": "1024Mi"
                                    }
                                 }
                              }
                           }
                        }
                     }
                  },
                  {
                     "name": "update env",
                     "spec": {
                        "template": {
                           "spec": {
                              "container": {
                                 "env": {
                                    "JVM_OPTIONS": "-XX:MaxInlineLevel=23",
                                    "JVM_ARGS": "-XX:MaxInlineLevel=23"
                                 }
                              }
                           }
                        }
                     }
                  }
               ]
            },
            "production": {
               "deployment_name": "petclinic-sample-1",
               "state": "",
               "result": "",
               "result_info": "",
               "result_error": "",
               "metrics": [
                  {
                     "name": "request_sum",
                     "query": "request_sum_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "request_count",
                     "query": "request_count_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "hotspot_function",
                     "query": "hotspot_function_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "cpuRequest",
                     "query": "cpuRequest_query",
                     "datasource": "prometheus"
                  },
                  {
                     "name": "memRequest",
                     "query": "memRequest_query",
                     "datasource": "prometheus"
                  }
               ],
               "config": [
                  {
                     "name": "update requests and limits",
                     "spec": {
                        "template": {
                           "spec": {
                              "container": {
                                 "resources": {
                                    "requests": {
                                       "cpu": 2,
                                       "memory": "512Mi"
                                    },
                                    "limits": {
                                       "cpu": 3,
                                       "memory": "1024Mi"
                                    }
                                 }
                              }
                           }
                        }
                     }
                  },
                  {
                     "name": "update env",
                     "spec": {
                        "template": {
                           "spec": {
                              "container": {
                                 "env": {
                                    "JVM_OPTIONS": "-XX:MaxInlineLevel=23",
                                    "JVM_ARGS": "-XX:MaxInlineLevel=23"
                                 }
                              }
                           }
                        }
                     }
                  }
               ]
            }
         }
      ]
   },
   {
      "experiment_id": "21901213354545310A384BC90EF",
      "application_name": "petclinic-c7f764b6d-kmh27",
      "trials" : [
         {
         }
      ]
   }
]

1 reply

bharathappali May 4, 2021
Collaborator Author

please correct me if i'm wrong, The deployment_name in training and production sections will be same right ?

dinogun · 2021-05-20T09:21:29Z

dinogun
May 20, 2021
Maintainer

Deployment Usecases that Exp Mgr needs to cater to

Deployment options for a single microservice part of an App that has many microservices.

Update existing deployment
1. Running in a Dev/QA Cluster, use EM to do a K8s rolling update
2. Running in Prod (Using OpenShift routes to config traffic)
New Deployment
1. Use EM to create a new deployment
2. Use Operator to handle deployment
3. Use custom scripts to handle deployment using autotune client (Scripts / Helm charts etc)
4. Use ISTIO to load balance

mode in the autotune deployment yaml can specify the deployment strategy.

0 replies

bharathappali · 2021-05-25T08:27:40Z

bharathappali
May 25, 2021
Collaborator Author

Running in a Dev/QA Cluster, use EM to do a K8s rolling update

Experiment manager needs to update the existing deployment
Check if the deployment rolling update is successful
Rollback to production config after the experiment

Reference:
https://www.weave.works/blog/kubernetes-deployment-strategies

0 replies

bharathappali · 2021-05-31T08:19:07Z

bharathappali
May 31, 2021
Collaborator Author

Would like to get views on the input addon in JSON which is sent to EM (Will be added between trail info and training info)

{
  ... <trail info>
  "experiment_settings": {
    "deployment_policy" : {
      "type" : "rollingUpdate || newDeployment",
      "target_env" : "dev || qa || prod",
      "agent" : "EM || AutotuneOperator || CLI" 
    }
  }
 ... <training info>
}

type can be either rollingUpdate or newDeployment
target_env can be dev or qa or prod
agent is the tool which will be launching the experiment (we give options as EM, AutotuneOperator, CLI)

1 reply

bharathappali May 31, 2021
Collaborator Author

Any other settings which will be required later will be added as key and value in the experiment_settings section of the input JSON

bharathappali · 2021-06-02T03:14:35Z

bharathappali
Jun 2, 2021
Collaborator Author

I was thinking about a config store and validation object for storing the parsed entities from JSON and supply it on request to the ETC (Experiment Trial Controller) currently we have the ExperimentTrialObject which is getting passed across the stages of FSM, So the options like trail run measurement time are separate attributes in it, I would now like to keep the config data in a separate object (Let's call it ExperimentTrailConfigSettings - we can change the name in the implememtation if we are not comfortable with this) and attach that in the ExperimentTrialObject (Just like how we do in case of attaching FSM object to it)

ExperimentTrailConfigSettings object will have :

Private variables for settings
Getters of the settings (No setters)
Constructor which accepts the incoming JSON Object and maps the variables to appropriate values from the JSON object
An isValid variable which is set to false initially
A validate function which validates the settings and sets isValid to true
In getter functions we check the isValid and return the value or an exception saying invalid settings

Validations : (Some validations which i can think of but not limited to)

trail run < measurement time run
deployment policy set to rollingUpdate and deployments_to_track set to both production and training
etc

I feel this can solve the storage and validation problem of EM config, I do accept the storage part and validation part can be separate but I feel keeping them accesible in one object avoids creating a separate validation flow for config which involves new objects and new function calls and also we can keep it config specific validation rather than a generic validation so it will be easy in a maintainance perspective. Please correct me if I'm wrong

0 replies

bharathappali · 2021-06-02T03:47:42Z

bharathappali
Jun 2, 2021
Collaborator Author

For ease of adding further config json changes would like to propose a section design in config json which has sections like info, settings, deployments

Note: list of deployments which we have in trackers will be the keys in deployments array JSON objects. if its a bit inappropriate we can change it and keep it generic by adding name key which holds value (training || production) and match it with list in trackers

Can I please get a review on this input JSON structure ? @dinogun @chandrams @kusumachalasani @shruacha1234

Example JSON:

{
  "experiment_id": "2190310A384BC90EF",
  "application_name": "petclinic-sample",
  "trials": [
    {
      "app-version": "v1",
      "info": {
        "trial_id": "",
        "trial_num": 1
      },
      "settings": {
        "trail_settings": {
          "trial_run": "15mins",
          "trial_measurement_time": "3mins"
        },
        "deployment_settings": {
          "deployment_policy" : {
            "type" : "rollingUpdate || newDeployment",
            "target_env" : "dev || qa || prod",
            "agent" : "EM || AutotuneOperator || CLI"
          },
          "deployment_tracking": {
            "trackers": [
              "training",
              "production"
            ]
          }
        }
      },
      "deployments": [
        {
          "training": {
            "deployment_name": "petclinic-sample-training-1",
            "state": "",
            "result": "",
            "result_info": "",
            "result_error": "",
            "metrics": [
              {
                "name": "request_sum",
                "query": "request_sum_query",
                "datasource": "prometheus"
              },
              {
                "name": "request_count",
                "query": "request_count_query",
                "datasource": "prometheus"
              },
              {
                "name": "hotspot_function",
                "query": "hotspot_function_query",
                "datasource": "prometheus"
              },
              {
                "name": "cpuRequest",
                "query": "cpuRequest_query",
                "datasource": "prometheus"
              },
              {
                "name": "memRequest",
                "query": "memRequest_query",
                "datasource": "prometheus"
              }
            ],
            "config": [
              {
                "name": "update requests and limits",
                "spec": {
                  "template": {
                    "spec": {
                      "container": {
                        "resources": {
                          "requests": {
                            "cpu": 2,
                            "memory": "512Mi"
                          },
                          "limits": {
                            "cpu": 3,
                            "memory": "1024Mi"
                          }
                        }
                      }
                    }
                  }
                }
              },
              {
                "name": "update env",
                "spec": {
                  "template": {
                    "spec": {
                      "container": {
                        "env": {
                          "JVM_OPTIONS": "-XX:MaxInlineLevel=23",
                          "JVM_ARGS": "-XX:MaxInlineLevel=23"
                        }
                      }
                    }
                  }
                }
              }
            ]
          }
        },
        {
          "production": {
            "deployment_name": "petclinic-sample-1",
            "state": "",
            "result": "",
            "result_info": "",
            "result_error": "",
            "metrics": [
              {
                "name": "request_sum",
                "query": "request_sum_query",
                "datasource": "prometheus"
              },
              {
                "name": "request_count",
                "query": "request_count_query",
                "datasource": "prometheus"
              },
              {
                "name": "hotspot_function",
                "query": "hotspot_function_query",
                "datasource": "prometheus"
              },
              {
                "name": "cpuRequest",
                "query": "cpuRequest_query",
                "datasource": "prometheus"
              },
              {
                "name": "memRequest",
                "query": "memRequest_query",
                "datasource": "prometheus"
              }
            ],
            "config": [
              {
                "name": "update requests and limits",
                "spec": {
                  "template": {
                    "spec": {
                      "container": {
                        "resources": {
                          "requests": {
                            "cpu": 2,
                            "memory": "512Mi"
                          },
                          "limits": {
                            "cpu": 3,
                            "memory": "1024Mi"
                          }
                        }
                      }
                    }
                  }
                }
              },
              {
                "name": "update env",
                "spec": {
                  "template": {
                    "spec": {
                      "container": {
                        "env": {
                          "JVM_OPTIONS": "-XX:MaxInlineLevel=23",
                          "JVM_ARGS": "-XX:MaxInlineLevel=23"
                        }
                      }
                    }
                  }
                }
              }
            ]
          }
        }
      ]
    }
  ]
}

0 replies

bharathappali · 2021-06-06T20:02:58Z

bharathappali
Jun 6, 2021
Collaborator Author

NOTE: This comment is added to make things clear about the data transfer between DA and EM. I'm just adding my views below, please correct me if I'm wrong

Requirement of AutotuneDTO dependency in EM:

Class:

/*******************************************************************************
 * Copyright (c) 2020, 2021 Red Hat, IBM Corporation, and others.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *******************************************************************************/
package com.autotune.queue;

import java.io.Serializable;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.autotune.analyzer.datasource.DataSourceFactory;
import com.autotune.utils.AutotuneUtils;

/**
 * AutotuneDTO is a data traversing object, it used to transfer data between different autotune components.
 * @author bipkumar
 *
 */

public class AutotuneDTO implements Serializable {
	private static final long serialVersionUID = 8789442223058487193L;
	private int id;
	private String name;
	private String url;
	private String operation;
	private StringBuffer infoMessage;
	private StringBuffer errorMessage;
	private static final Logger LOGGER = LoggerFactory.getLogger(DataSourceFactory.class);
	
	public AutotuneDTO() {}
	
	/**
	 * 
	 * @param experimentId : Id generated by Dependency analyzer
	 * @param componentName: Name of the component processing this DTO
	 * @param dataURL: fetch the data(JSON)from recommendation manager using this url
	 * @param operation : set operation name using Operation enum from AutotuneUtil.java
	 *  
	 */
	public AutotuneDTO(int experimentId, String componentName, String dataURL, String operation) {
		this.id = experimentId;
		this.name = componentName;
		this.url = dataURL;
		this.infoMessage = new StringBuffer();
		this.errorMessage = new StringBuffer();
		this.operation = operation;
	}
	
	public int getId() {
		return id;
	}
	
	public void setId(int id) {
		this.id = id;
	}
	
	public String getName() {
		return name;
	}
	
	public void setName(String name) {
		this.name = name;
	}

	public String getUrl() {
		return url;
	}

	public void setUrl(String url) {
		if (! AutotuneUtils.isValidURL(url)) {
			errorMessage.append("\n URL is not valid or empty");
			LOGGER.error("URL is not valid");
		}
		this.url = url;
	}
	
	public StringBuffer getInfoMessage() {
		return infoMessage;
	}

	public void setInfoMessage(StringBuffer infoMessage) {
		this.infoMessage = infoMessage;
	}

	public StringBuffer getErrorMessage() {
		return errorMessage;
	}

	public void setErrorMessage(StringBuffer errorMessage) {
		this.errorMessage = errorMessage;
	}
	

	public String getOperation() {
		return operation;
	}

	public void setOperation(String operation) {
		this.operation = operation;
	}
	

	@Override
	public String toString() {
		return "id=" + id + ", name=" + name +  ", url=" + url + " , operation=" + operation + ", infoMessage="+infoMessage.toString() + ", errorMessage=" +errorMessage;
	}
}

I feel we need to add/return the JSONObject via queue or endpoint as it makes the mechanism simple. Currently, it's a two-stepped process
Step - 1 : Read autotune object from the queue or by calling DA endpoint
Step - 2 : Read URL from the returned object and call the URL to get the config JSON

I would like to suggest a change in this mechanism which makes ot one step:
Set caller as DA and Receiver as EM and pass the JSON object directly, Add the URL to post results in the config JSON itself.

Now in EM we write a validator to check if the received JSON Object is of type config object (validate the required keys and structure) and then we save the config as config object and link it in ExperimentTrailData object (which travels all the stages in FSM) so at in any given point of time, we will have all the required config data in EM to make decisions.

Pros:

We have a one time check (validator) to check the structure and format of the JSON Object
we can decouple EM from DA for data input, any service can jus send a JSON object to make EM work
We remove multi staged config gathering
We eliminate the ExperimentTrailData to hold the config fields and link the formated config json object to ExperimentTrailData
Gives a everything at one place kind of feel to access
Config Object is made final upon creation (no set functions only get is allowed) so config is not tamperred at any given point of time in the FSM transfers

Cons:

Validation should be changed with change in JSON format
Config JSON generated will change with change in JSON format
Object weight is heavier than early as we carry all the data we got (we can trim it down but mostly we carry all the data of JSON in ExperimentTrailData object)

Please add any pros or cons which you feel that the new mechanism could bring in. I feel it's a very tough way to go forward as minimal change in JSON structure could make drastic change in the Config Object design and functionality but I feel it's really useful to make this change coz its an Everything at one place design so you will be needing to make changes only at one place which makes it easy to maintain going forward

PS: I'm expecting lot of changes in input JSON going forward so keeping it limited to a change in class and reflecting changes in it's callers can be a first good step to make EM design + code mapping errors free.

@dinogun @chandrams @kusumachalasani @shruacha1234 Can I please have your views on it ?

0 replies

bharathappali · 2021-06-18T18:15:41Z

bharathappali
Jun 18, 2021
Collaborator Author

Adding Unscheduled stage merge feature to EM:

In the current EM implementation, we are having each stage separated and being executed by different threads, Some stages can be very minimal (Eg: deploying the config) which can be in-lined with the previous stage. We keep the stages separate but provide an in-lining feature for unscheduled stages.

The stage which is continuous without any scheduling delay can be in-lined and we could take that decision based on a setting whether to in-line stages. This can be a performance tuning for the EM itself.

PROS:

Can be plugged into existing EM design as EM stages Enum provide the info if the stage is scheduled or not and also EMTransitionRegistry provides a functionality to get the possible next continuous stage (method - getPossibleContinuousStage)
Removes intermediate steps for successive stages like creating a stage transition object and pushing it to the queue and then notifying the Queue Processor

CONS:

Complex implementation as the control to launch next stages is partially taken away from Queue Processor and given to individual stage components which are being in-lined
very minimal performance improvement and not an async way of doing things as the same thread processes multiple stages.

@dinogun @chandrams @kusumachalasani @shruacha1234 Can I have your views on this?

0 replies

bharathappali · 2021-08-26T06:44:19Z

bharathappali
Aug 26, 2021
Collaborator Author

updated JSON:

{
  "experiment_id": "2190310A384BC90EF",
  "name": "petclinic-autotune",
  "info": {
    "trial_id": "",
    "trial_num": 1
  },
  "settings": {
    "trial_settings": {
      "total_duration": "20 mins",
      "warmup_cycles": 5,
      "warmup_duration": "1 min",
      "measurement_cycles": 15,
      "measurement_duration": "1 min"
    },
    "deployment_settings": {
      "deployment_info": {
        "deployment_name" : "petclinic-sample",
        "target_env" : "qa"
      },
      "deployment_policy" : {
        "type" : "rollingUpdate"
      },
      "deployment_tracking": {
        "trackers": [
          "training",
          "production"
        ]
      }
    }
  },
  "deployments": [
    {
      "type" : "training",
      "deployment_name": "petclinic-sample",
      "namespace" : "default",
      "state": "",
      "result": "",
      "result_info": "",
      "result_error": "",
      "metrics": [
        {
          "name": "request_sum",
          "query": "request_sum_query",
          "datasource": "prometheus"
        },
        {
          "name": "request_count",
          "query": "request_count_query",
          "datasource": "prometheus"
        },
        {
          "name": "hotspot_function",
          "query": "hotspot_function_query",
          "datasource": "prometheus"
        },
        {
          "name": "cpuRequest",
          "query": "cpuRequest_query",
          "datasource": "prometheus"
        },
        {
          "name": "memRequest",
          "query": "memRequest_query",
          "datasource": "prometheus"
        }
      ],
      "config": [
        {
          "name": "update requests and limits",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "resources": {
                    "requests": {
                      "cpu": 2,
                      "memory": "512Mi"
                    },
                    "limits": {
                      "cpu": 3,
                      "memory": "1024Mi"
                    }
                  }
                }
              }
            }
          }
        },
        {
          "name": "update env",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "env": {
                    "JDK_JAVA_OPTIONS": "-XX:MaxInlineLevel=23"
                  }
                }
              }
            }
          }
        }
      ]
    },
    {
      "type" : "production",
      "deployment_name": "petclinic-sample-1",
      "state": "",
      "result": "",
      "result_info": "",
      "result_error": "",
      "metrics": [
        {
          "name": "request_sum",
          "query": "request_sum_query",
          "datasource": "prometheus"
        },
        {
          "name": "request_count",
          "query": "request_count_query",
          "datasource": "prometheus"
        },
        {
          "name": "hotspot_function",
          "query": "hotspot_function_query",
          "datasource": "prometheus"
        },
        {
          "name": "cpuRequest",
          "query": "cpuRequest_query",
          "datasource": "prometheus"
        },
        {
          "name": "memRequest",
          "query": "memRequest_query",
          "datasource": "prometheus"
        }
      ],
      "config": [
        {
          "name": "update requests and limits",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "resources": {
                    "requests": {
                      "cpu": 2,
                      "memory": "512Mi"
                    },
                    "limits": {
                      "cpu": 3,
                      "memory": "1024Mi"
                    }
                  }
                }
              }
            }
          }
        },
        {
          "name": "update env",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "env": {
                    "JDK_JAVA_OPTIONS": "-XX:MaxInlineLevel=23"
                  }
                }
              }
            }
          }
        }
      ]
    }
  ]
}

0 replies

bharathappali · 2021-10-06T04:52:03Z

bharathappali
Oct 6, 2021
Collaborator Author

Expected Output JSON - for listTrialStatus API:

{
  "experiment_id": "2190310A384BC90EF",
  "name": "petclinic-autotune",
  "status": "COMPLETED",
  "info": {
    "trial_id": "",
    "trial_num": 1,
    "trial_result_url": "http://analyser.com/receive/results?expId=2190310A384BC90EF"
  },
  "settings": {
    "trial_settings": {
      "total_duration": "7 mins",
      "warmup_cycles": 2,
      "warmup_duration": "1 min",
      "measurement_cycles": 5,
      "measurement_duration": "1 min"
    },
    "deployment_settings": {
      "deployment_info": {
        "deployment_name" : "petclinic-sample",
        "target_env" : "qa"
      },
      "deployment_policy" : {
        "type" : "rollingUpdate"
      },
      "deployment_tracking": {
        "trackers": [
          "training"
        ]
      }
    }
  },
  "deployments": [
    {
      "type" : "training",
      "deployment_name": "petclinic-sample",
      "namespace" : "default",
      "state": "",
      "result": "",
      "result_info": "",
      "result_error": "",
      "metrics": [
        {
          "name": "request_sum",
          "query": "request_sum_query",
          "datasource": "prometheus",
          "value_type": "double",
          "metric_results": {
            "warmup_results": {
              "cycles": 2,
              "duration": "1 min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            },
            "measurement_results": {
              "cycles": 5,
              "duration": "1min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            }
          }
        },
        {
          "name": "request_count",
          "query": "request_count_query",
          "datasource": "prometheus",
          "value_type": "double",
          "metric_results": {
            "warmup_results": {
              "cycles": 2,
              "duration": "1 min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            },
            "measurement_results": {
              "cycles": 5,
              "duration": "1min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            }
          }
        },
        {
          "name": "hotspot_function",
          "query": "hotspot_function_query",
          "datasource": "prometheus",
          "value_type": "double",
          "metric_results": {
            "warmup_results": {
              "cycles": 2,
              "duration": "1 min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            },
            "measurement_results": {
              "cycles": 5,
              "duration": "1min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            }
          }
        },
        {
          "name": "cpuRequest",
          "query": "cpuRequest_query",
          "datasource": "prometheus",
          "value_type": "double",
          "metric_results": {
            "warmup_results": {
              "cycles": 2,
              "duration": "1 min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            },
            "measurement_results": {
              "cycles": 5,
              "duration": "1min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            }
          }
        },
        {
          "name": "memRequest",
          "query": "memRequest_query",
          "datasource": "prometheus",
          "value_type": "double",
          "metric_results": {
            "warmup_results": {
              "cycles": 2,
              "duration": "1 min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            },
            "measurement_results": {
              "cycles": 5,
              "duration": "1min",
              "results": [
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                },
                {
                  "score": "",
                  "Error": "",
                  "mean": "",
                  "mode": "",
                  "95.0": "",
                  "99.0": "",
                  "99.9": "",
                  "99.99": "",
                  "99.999": "",
                  "99.9999": "",
                  "100.0": "",
                  "spike": ""
                }
              ]
            }
          }
        }
      ],
      "config": [
        {
          "name": "update requests and limits",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "resources": {
                    "requests": {
                      "cpu": 2,
                      "memory": "512Mi"
                    },
                    "limits": {
                      "cpu": 3,
                      "memory": "1024Mi"
                    }
                  }
                }
              }
            }
          }
        },
        {
          "name": "update env",
          "spec": {
            "template": {
              "spec": {
                "container": {
                  "env": {
                    "JDK_JAVA_OPTIONS": "-XX:MaxInlineLevel=23"
                  }
                }
              }
            }
          }
        }
      ]
    }
  ]
}

0 replies

bharathappali · 2022-02-23T08:20:58Z

bharathappali
Feb 23, 2022
Collaborator Author

Need for datasource availability monitor:

The current EM design expects the datasource to be available all time, what if the datasource is not up and running? EM shouldn't wait for metric collection to fail getting unreachable endpoint or etc. We need to have datasource monitor service which takes in trial id as input and constantly monitors a datasource given in the trial input json, it holds a map of datasources and the trial id's using those sources so once it finds that data source is unavailable it gets the ETD (Experiment Trial Data) object and updates a field (Lets say is datasource available - mapped to each data source of in trail input JSON) so based on this status when CollectMetrics transistion is invoked it checks the status (we make it synchronous so only after update we read it) and then proceeds to metric collection if not waits for a certain timeout and even after the datasource is not up we report it to the user saying datasource is compromised.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment Manager Design Discussion #227

{{title}}

Replies: 11 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Experiment Manager Design Discussion #227

bharathappali Apr 21, 2021 Collaborator

Replies: 11 comments · 2 replies

dinogun May 4, 2021 Maintainer

bharathappali May 4, 2021 Collaborator Author

dinogun May 20, 2021 Maintainer

bharathappali May 25, 2021 Collaborator Author

bharathappali May 31, 2021 Collaborator Author

bharathappali May 31, 2021 Collaborator Author

bharathappali Jun 2, 2021 Collaborator Author

bharathappali Jun 2, 2021 Collaborator Author

bharathappali Jun 6, 2021 Collaborator Author

bharathappali Jun 18, 2021 Collaborator Author

bharathappali Aug 26, 2021 Collaborator Author

bharathappali Oct 6, 2021 Collaborator Author

bharathappali Feb 23, 2022 Collaborator Author

bharathappali
Apr 21, 2021
Collaborator

Replies: 11 comments 2 replies

dinogun
May 4, 2021
Maintainer

bharathappali May 4, 2021
Collaborator Author

dinogun
May 20, 2021
Maintainer

bharathappali
May 25, 2021
Collaborator Author

bharathappali
May 31, 2021
Collaborator Author

bharathappali May 31, 2021
Collaborator Author

bharathappali
Jun 2, 2021
Collaborator Author

bharathappali
Jun 2, 2021
Collaborator Author

bharathappali
Jun 6, 2021
Collaborator Author

bharathappali
Jun 18, 2021
Collaborator Author

bharathappali
Aug 26, 2021
Collaborator Author

bharathappali
Oct 6, 2021
Collaborator Author

bharathappali
Feb 23, 2022
Collaborator Author