Technical concept: Scheduling and implementation of periodic jobs #1560

wkl3nk · 2024-12-03T09:52:06Z

wkl3nk
Dec 3, 2024

Motivation

In ORT Server, there are multiple use cases that require some sort of periodic job execution. In the current implementation, some (limited) scheduling functionality has already been implemented for Kubernetes Job Monitor to check for long-running or lost jobs in certain intervals. For new pending features, running specific jobs in a customizable schedule will be important as well, for instance:

To enforce retention policies on stored data. This affects the database (here old ORT runs should be deleted when they reach a configurable age) but also the (S3) storage with the reports.
To do a periodic cleanup of the database. For instance, orphan entities (emerging from deleting data due to retention policies) should be detected from time to time and deleted.
To sync with external systems. For example, the synchronization of Keycloak roles and groups should run periodically and not only on startup of the core pod.
For complex and long-running database migrations, it may be necessary to run them as background jobs that process data chunk-wise. In such a scenario, a migration job would be run in specific intervals until the whole migration has finished. Removed 2024-11-04 after feedback: This is actually a more a one-time job, and not a periodic job.

The objective is to implement a technical solution that allows to easily add periodic jobs, and to execute them at arbitrary points in time.

There are the following constraints:

a.) The impact of the execution of periodic jobs on the processing of scan-jobs should be kept as low as possible. There shall be no major performance degradation of the system when a periodic job is running, nor should periodic jobs have negative impact on the stability of regular operation.
b.) The effort for implementation should be kept as low as possible. It is not the goal to implement a scheduler or job management system within ORT Server, as this is not seen as key competency of ORT Server and there are already enough scheduler and job management systems out there that can be used off-the-shelf.

Proposed Solution

ps1: Location of code of the jobs: Like for the (already implemented) Job Monitor, the code is located inside the ORT Server repository. The Kotlin/Gradle Module System provides modules for specific purposes, e.g. for database access or for common functionality used by workers. This allows for a convenient programming model where components can be integrated with functionality provided by other parts of the codebase.

ps2: Standard programming interface for jobs: Jobs standardized interfaces for lookup, creation (factory) and configuration.

ps3: Execution of jobs: Like for the Job Monitor, the jobs run in a separate Kubernetes pod. This ensures to keep the impact of the regular scan activities as low as possible and creates no risks for robustness of ORT Server operation.

ps4: Scheduling by date/time: For executing jobs periodically at a specific point in time, an external scheduler is used. In case that ORT Server is running in a Kubernetes environment, this could be a Kubernetes CronJob. This keeps implementation efforts at a minimum and is a proven solution. Documentation will be added how adopters of ORT Server can use these CronJobs. These CronJobs also have features like to make sure that not more than one job of a kind runs at a time.

ps5: Reporting: In order to get information about job executions, duration of executions and the outcome of the jobs, the features of the external scheduling solution (Kubernetes Cron Jobs) is regarded as sufficient at the moment.

Next steps

Feel free to ask questions about the motivation, objectives and the proposed solution.
Please state here if you have any serious objections. An objection is serious if it can be proven to harm the achievement of the common goal. It is ideal if objections are well-founded and contain possible approaches to solutions. However, it is also OK to raise the objection without a concrete solution at first, as the group has a shared responsibility to find a suitable adjustment.
When consent is found, we will create a task in the issue tracker and start implementation.

mnonnenmacher · 2024-12-04T09:21:17Z

mnonnenmacher
Dec 4, 2024
Maintainer

For complex and long-running database migrations, it may be necessary to run them as background jobs that process data chunk-wise. In such a scenario, a migration job would be run in specific intervals until the whole migration has finished.

This only works for database maintenance, schema migrations must be finished before the application is started.

In case that ORT Server is running in a Kubernetes environment, this could be a Kubernetes CronJob. This keeps implementation efforts at a minimum and is a proven solution. Documentation will be added how adopters of ORT Server can use these CronJobs. These CronJobs also have features like to make sure that not more than one job of a kind runs at a time.

How would this work exactly when different jobs have different schedules? Is there one Docker image that is called with different options or are there multiple Docker images?
Also, the example in the docs is a very simple Kubernetes job configuration, in our case it would be more complex because access to the database, secrets, and probably more needs to be configured. Repeating that for every single schedule could be annoying, depending on the amount of maintenance jobs.
The examples also mention database maintenance jobs, I assume this relates to past clean-ups like the deduplication of projects and packages which in the end was performed by a database migration. These are one-time jobs, setting up a Kubernetes job for this seems to be overcomplicated.

The listed examples are mandatory tasks, like the Keycloak synchronization. What is the plan to implement that for the Docker Compose test environment?

If the runner for maintenance jobs was a permanent deployment, it would be possible to trigger maintenance jobs via an API endpoint, this would allow to trigger jobs also manually on demand, for example, when it is known that the Keycloak synchronization failed and the next scheduled execution is too far in the future.

I am not sure if it is a good idea that the application is not aware of the maintenance jobs. This means, that it is not possible to implement an admin UI that shows any information about past and upcoming jobs.

Finally, it looks like my draft for implementing maintenance jobs was ignored in this proposal. Why?

6 replies

mnonnenmacher Dec 4, 2024
Maintainer

The listed examples are mandatory tasks, like the Keycloak synchronization. What is the plan to implement that for the Docker Compose test environment?

This proposal assumes that ORT Server runs in a Kubernetes environment.

In a development environment based on Docker Compose one could use a Task Scheduler on the Host Machine.
One could schedule periodic execution on the host machine using tools like cron to start Docker containers:

Important to me is that the server can be tested by just cloning the repository and running docker compose up. It should not be required to configure cron jobs on the host system. As long as the maintenance jobs are not required to run the test system this should be fine, however, that completely depends on what kind of maintenance jobs will be added in the future.

If the runner for maintenance jobs was a permanent deployment, it would be possible to trigger maintenance jobs via an API endpoint, this would allow to trigger jobs also manually on demand, for example, when it is known that the Keycloak synchronization failed and the next scheduled execution is too far in the future.

Yes, it is a implementation choice. Given there is one pod that holds all the implementation code for jobs, one could think about a CLI-like application, that takes the job name as parameter. In that case, there would be no need that the docker image runs as a server. If the implementation choice is to implement it as a REST Server, then the entry point could be a central rest endpoint that takes the name of the job. But in this case, the docker image would run permanently. And you would have a little overhead for all this HTTP REST API handling.

I didn't mean to add a REST API to the runner itself but to add an endpoint to the existing API. The communication between the API and the runner would happen via the configured transport, e.g., RabbitMQ. For example, the existing endpoint to trigger the Keycloak sync would be changed to not run the sync directly, but to send a message to the maintenance job runner instead.

Finally, it looks like #1034 for implementing maintenance jobs was ignored in this proposal. Why?

I was not aware that such a draft already existed.

That's unfortunate because I spend some time on it and it is actually working within the scope it was made for.

The effort for implementation should be kept as low as possible. It is not the goal to implement a scheduler or job management system within ORT Server, as this is not seen as key competency of ORT Server and there are already enough scheduler and job management systems out there that can be used off-the-shelf.

I believe my problem with this statement is that I have the feeling that the proposal is not really making the feature simpler, it's just shifting effort from the implementation to the maintenance: The initial implementation requires little code, but all changes to available jobs or recommended schedules require action from all maintainers of running ORT Server instances.
After all, the Kubernetes job monitor is running fine with the scheduling based on the coroutine API, and a library like Quartz should make it easy to implement also more complex scenarios.

oheger-bosch Dec 5, 2024
Collaborator

How would this work exactly when different jobs have different schedules? Is there one Docker image that is called with different options or are there multiple Docker images? Also, the example in the docs is a very simple Kubernetes job configuration, in our case it would be more complex because access to the database, secrets, and probably more needs to be configured. Repeating that for every single schedule could be annoying, depending on the amount of maintenance jobs.

It is true that Kubernetes manifests tend to be verbose, but for the tasks, they should be structurally similar. The complexity of the setup can be mitigated when using Helm charts. There can be a YAML structure in the values file defining just the schedules and the task images. In the Helm template, a loop iterates over this structure and generates the corresponding Kubernetes manifests. We already do something similar elsewhere. Analogously, for the creation of container images, there should be means offered by Gradle to avoid redundancy, for instance a plugin could be written for this purpose.

Given that, my preferred way would be to have separate container images for different tasks. This would also be beneficial in terms of modularity and efficiency (smaller images).

If the runner for maintenance jobs was a permanent deployment, it would be possible to trigger maintenance jobs via an API endpoint, this would allow to trigger jobs also manually on demand, for example, when it is known that the Keycloak synchronization failed and the next scheduled execution is too far in the future.

This is still possible with the transport abstraction. The Kubernetes implementation would be able to create jobs analogously as is done for worker jobs.

I am not sure if it is a good idea that the application is not aware of the maintenance jobs. This means, that it is not possible to implement an admin UI that shows any information about past and upcoming jobs.

First, I do not see a pressing need for such an admin UI in the near future to control the few tasks that will be running. If the need arises, there are still ways to handle it. For instance, there could be a base class for tasks that writes some audit information about task execution into a database table. This table could then be used for an UI - or for a Grafana dashboard.

Finally, it looks like my draft for implementing maintenance jobs was ignored in this proposal. Why?

I had a look, but my impression was that there is still a major portion missing in this draft. For instance, it does not deal with schedules at all, and the jobs to execute had to be hardcoded.

oheger-bosch Dec 5, 2024
Collaborator

I believe my problem with this statement is that I have the feeling that the proposal is not really making the feature simpler, it's just shifting effort from the implementation to the maintenance: The initial implementation requires little code, but all changes to available jobs or recommended schedules require action from all maintainers of running ORT Server instances. After all, the Kubernetes job monitor is running fine with the scheduling based on the coroutine API, and a library like Quartz should make it easy to implement also more complex scenarios.

Kubernetes is great for executing jobs. Deploying an enterprise scheduler library on it just to run jobs, I would not consider as a smart solution. Also, a solution with an internal scheduler, be it based on a library or self-written, will also impose runtime and maintenance effort: It has to be monitored to ensure that it is working correctly, it makes every deployment more complex, it consumes resources permanently, even if no jobs are running.

mnonnenmacher Dec 6, 2024
Maintainer

To my understanding the actual requirements we have right now are:

Periodic deletion of old runs and their associated binary artifacts.
Periodic synchronization with Keycloak.
Some way to configure the schedule for those two jobs.
Some way to monitor if the jobs are executed successfully.
The same job should not be executed in parallel on separate pods (to me the reason to move the Keycloak sync out of the core module which should be scalable horizontally).

And then there are several potential requirements or ideas for the future, for example:

Being able to trigger maintenance jobs on demand.
Being able to scale the execution of maintenance jobs.
Being able to show information about maintenance jobs in the UI.
Different retention schedules for different kinds of ORT runs or organizations.

So I would rather focus on a simple solution that solves the issue at hand and is easy to adapt when the requirements change. To me a single deployment that uses an internal scheduler seems to be such a simple solution because it only requires one Gradle module, one Docker image, and one deployment, and it works with and without Kubernetes. Monitoring could simply use the logs of that deployment. Also, with the job monitor we already have a very similar module.

Compared to that, my impression is that the proposed solution is overly complex because it tries to solve issues which are not part of the requirements and by doing that is less adaptable.

it consumes resources permanently, even if no jobs are running.

I don't see this as a big problem, while it's idle it would consume some memory but that should be little compared to the other workloads we handle.

oheger-bosch Dec 9, 2024
Collaborator

I agree about the use cases, but we seem to have different opinions about complexity. For me, a solution that does not require any implementation effort is less complex. I am afraid that once we start with the implementation of an internal scheduler solution, this implementation will never go away; it has to be maintained and will grow more and more complex.

oheger-bosch · 2024-12-12T06:30:45Z

oheger-bosch
Dec 12, 2024
Collaborator

So, there are currently two proposals on the table: One is to delegate task execution to Kubernetes (in the following referred to as the "Kubernetes proposal"), and the other one is to have a deployment that runs a scheduler library like Quartz which handles the task execution (the "Quartz proposal").

After thinking a while about the Quartz proposal and its possible implementation, I come to the conclusion that there actually is not that much difference between both proposals. This also impacts some of the arguments brought up so far in favor to or against one of the proposals. What they have in common is:

The task logic to be executed is implemented in separate task classes part of the ORT Server codebase. In fact, the same task interface can be used for both proposals.
The logic to execute tasks is separated from the tasks themselves and can be delegated to an external scheduler.

The main differences are IMHO the following:

The Kubernetes proposal suggests packaging tasks into separate container images that can then be run as cron jobs in Kubernetes.
The Quartz proposal in contrast recommends a single container image containing all tasks together with an engine for the task execution, like Quartz.

As arguments for the Quartz proposal, it was stated that it would give more control over the execution of tasks and would therefore open up further use cases (e.g. for monitoring or an admin UI), and everything would be contained in ORT Server without the need to further set things up.

Regarding the first point, one can have a look into how the execution of tasks with Quartz looks like (and this is considered typical for similar libraries or tools). The tutorial gives the following example:

// define the job and tie it to our HelloJob class
  JobDetail job = newJob(HelloJob.class)
      .withIdentity("job1", "group1")
      .build();

  // Trigger the job to run now, and then repeat every 40 seconds
  Trigger trigger = newTrigger()
      .withIdentity("trigger1", "group1")
      .startNow()
            .withSchedule(simpleSchedule()
              .withIntervalInSeconds(40)
              .repeatForever())            
      .build();

  // Tell quartz to schedule the job using our trigger
  scheduler.scheduleJob(job, trigger);

So, the deployment for executing tasks would probably have an entry point script that contains statements like those to configure the scheduler. This script can either hard-code the tasks to execute or use some magic (based on service loaders or reflection) to determine them. In any case, after the execution of the script, the scheduler is fully responsible for the task execution. Regarding control, this is exactly the same situation as in the Kubernetes scenario; the tasks are run by an external party, and in order to collect statistics or track the execution, additional means have to be implemented. Quartz may offer some hooks to receive notifications about task executions, but the same is true for Kubernetes. Only if an own scheduler mechanism is implemented in ORT Server, full control over task execution can be achieved.

The other argument for the Quartz proposal is that it simplifies the ORT Server deployment, since no external configuration is needed to set up the task execution. This is, however, not the full truth: The deployment responsible for the task execution of course has to be defined and configured. This can be done (for instance) as part of a Helm chart; the complexity should be similar to other ORT Server components like core or Kubernetes Job Monitor, with one major difference: the schedule of the tasks needs to be configured as well. This has to be done via a mechanism that is still to be defined. Probably, there will be configuration properties or environment variables corresponding to the single tasks supported by ORT Server and expressions when they should be executed. These settings are then evaluated by the entry point script of the deployment that creates the task definitions and passes them to Quartz. This is a proprietary mechanism which needs to be adapted every time another task class is added. In the Kubernetes proposal, at least standard configuration is used to configure cron jobs which should be familiar to people with an Ops background. Also, it is also unclear how this mechanism could work with other task implementations not contained in the ORT Server codebase.

To sum this up, both proposals define a similar model for task execution that basically delegates the execution of ORT Server tasks to an external engine. They therefore share common characteristics with regard to control over execution or additional configuration.

1 reply

mnonnenmacher Dec 19, 2024
Maintainer

I agree about the use cases, but we seem to have different opinions about complexity. For me, a solution that does not require any implementation effort is less complex.

Only, there is no such solution. One proposal is to use a programmatic API like Quartz which requires writing Kotlin code, the other is to use the Kubernetes API via Helm Charts which also requires implementation effort as you described here (e.g., how does implementing a Gradle plugin "not require any implementation effort"?):

It is true that Kubernetes manifests tend to be verbose, but for the tasks, they should be structurally similar. The complexity of the setup can be mitigated when using Helm charts. There can be a YAML structure in the values file defining just the schedules and the task images. In the Helm template, a loop iterates over this structure and generates the corresponding Kubernetes manifests. We already do something similar elsewhere. Analogously, for the creation of container images, there should be means offered by Gradle to avoid redundancy, for instance a plugin could be written for this purpose.

I am afraid that once we start with the implementation of an internal scheduler solution, this implementation will never go away.

I am afraid that once we start with the implementation of a Helm template used to generate Kubernetes manifests with the goal to mitigate the complexity of the setup and a Gradle plugin to reduce the redundancy when creating one Gradle module and one Docker image per task, this implementation will never go away.

it has to be maintained and will grow more and more complex.

The same is true for any approach if the requirements become more complex, but you never know for sure what tomorrows requirements will be. That's why it's good to start simple.

As arguments for the Quartz proposal, it was stated that it would give more control over the execution of tasks and would therefore open up further use cases [...]
In any case, after the execution of the script, the scheduler is fully responsible for the task execution. Regarding control, this is exactly the same situation as in the Kubernetes scenario; the tasks are run by an external party, and in order to collect statistics or track the execution, additional means have to be implemented. Quartz may offer some hooks to receive notifications about task executions, but the same is true for Kubernetes. Only if an own scheduler mechanism is implemented in ORT Server, full control over task execution can be achieved.

This argument was not made in this thread, so I'm not sure what you are arguing against and what exactly was meant with "more control".
However, I would argue that writing code that interacts with a third-party library running in the same JVM is in general less complex and offers "more control" than interacting with external infrastructure like Kubernetes which, for example, involves network calls and permission management.

The other argument for the Quartz proposal is that it simplifies the ORT Server deployment, since no external configuration is needed to set up the task execution.

That was never said, I mentioned above that it needs a deployment. But a single deployment is less complex than "a loop [that] iterates over [a YAML structure in the values file defining just the schedules and the task images] and generates the corresponding Kubernetes manifests".
And by the way, in your implementation your are abstracting away the dependencies of the tasks by letting them define their own Koin modules, however, these different dependencies also need different configuration, for example, one task might require Keycloak credentials, another task might require access to the binary storage. So the Helm Chart would also have to handle this, not just schedules and task images. Except if you want to configure everything for all tasks in case they might need it which somehow counteracts the idea of loose coupling.

the schedule of the tasks needs to be configured as well. This has to be done via a mechanism that is still to be defined. Probably, there will be configuration properties or environment variables corresponding to the single tasks supported by ORT Server and expressions when they should be executed. These settings are then evaluated by the entry point script of the deployment that creates the task definitions and passes them to Quartz. This is a proprietary mechanism which needs to be adapted every time another task class is added. In the Kubernetes proposal, at least standard configuration is used to configure cron jobs which should be familiar to people with an Ops background.

Based on the requirements defined earlier I don't see why anything more complex than environment variables like KEYCLOAK_SYNC_SCHEDULE which are set to cron expressions should be required, and I would not call this a "proprietary mechanism".

Also, it is also unclear how this mechanism could work with other task implementations not contained in the ORT Server codebase.

This is out of scope, it was never mentioned as a requirement in this thread.

sschuberth · 2024-12-20T16:29:43Z

sschuberth
Dec 20, 2024
Collaborator

I just want to highlight #1687 in this context, which is about being able to automatically trigger (partial) runs based on events, including timers. The motivation is that for users who "live" exclusively in the server UI (and do not trigger jobs via the REST API from Ci via the CLI), there should be a way to configure regular reruns of e.g. the advisor to check for newly discovered security vulnerabilities for unchanged code. How does this use-case fit to the proposals?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Technical concept: Scheduling and implementation of periodic jobs #1560

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Technical concept: Scheduling and implementation of periodic jobs #1560

wkl3nk Dec 3, 2024

Motivation

Proposed Solution

Next steps

Replies: 3 comments · 7 replies

mnonnenmacher Dec 4, 2024 Maintainer

mnonnenmacher Dec 4, 2024 Maintainer

oheger-bosch Dec 5, 2024 Collaborator

oheger-bosch Dec 5, 2024 Collaborator

mnonnenmacher Dec 6, 2024 Maintainer

oheger-bosch Dec 9, 2024 Collaborator

oheger-bosch Dec 12, 2024 Collaborator

mnonnenmacher Dec 19, 2024 Maintainer

sschuberth Dec 20, 2024 Collaborator

wkl3nk
Dec 3, 2024

Replies: 3 comments 7 replies

mnonnenmacher
Dec 4, 2024
Maintainer

mnonnenmacher Dec 4, 2024
Maintainer

oheger-bosch Dec 5, 2024
Collaborator

oheger-bosch Dec 5, 2024
Collaborator

mnonnenmacher Dec 6, 2024
Maintainer

oheger-bosch Dec 9, 2024
Collaborator

oheger-bosch
Dec 12, 2024
Collaborator

mnonnenmacher Dec 19, 2024
Maintainer

sschuberth
Dec 20, 2024
Collaborator