Cron job stops after random hours/days #805

github-anis-snoussi · 2023-11-28T16:33:17Z

Description

I have a node app running on a machine with 512mb of memory and 256vCPU on AWS, and I have a cron job (actually using Nestjs which in turn uses this package) that runs every second using node-cron.
After few days (doesn't always happen), the cron just stops working with no error or log.
In fact there was an identical issue reported before (#232) and was closed despite people still reporting the same behaviour.

Expected Behavior

cron job should not stop working, or at the very least throw an error ?

Actual Behavior

job stops randomly after few days, without any error / log
restarting the container fixes the issue

Possible Fix

No response

Steps to Reproduce

hard to reproduce, create a cron job that runs every second and leave it running for couple of days ?

Context

Your Environment

cron version: 2.4.1
NodeJS version: 18.18.2
Operating System and version: node:18.18.2-alpine3.17 docker image
TypeScript version (if applicable): 5.1.6
Link to your project (if applicable): -

The text was updated successfully, but these errors were encountered:

intcreator · 2023-11-28T16:50:51Z

is the Node process still running when the cron job stops? is it only the cron part that fails?

github-anis-snoussi · 2023-11-28T16:52:20Z

yes the Node process keeps running.
In fact I have other crons defined in the same process, and they keep on working.

sheerlox · 2023-11-28T17:45:33Z

I have a NestJS codebase running in production for more than 8 months (restarted once in a while for deployments) that runs a lot of cronjobs for handling financial transactions, one of them running every second, and I've never encountered this issue... However, I do have more memory on my server, 2GB I believe, so maybe that's the issue. Have you ever encountered a situation when your container exceeds its maximum memory usage?

Could you try updating cron to 2.4.4 to see if the issue still happens? You can force NestJS to use it in this way:

"overrides": {
  "@nestjs/schedule": {
    "cron": "2.4.4"
  }
},

This kind of issue is typically difficult to pinpoint without detailed information from those who have experienced it. Unfortunately, reports of this nature are quite rare and often lack critical details. If you can provide any detail or hint that would help us reproduce the issue, we would appreciate it and do our best to find a solution.

github-anis-snoussi · 2023-11-29T08:57:22Z

I understand how difficult it is to reproduce these kind of bugs, I can't even reproduce it myself and it happened twice so far.
I just noticed that the previous issue (#232) was closed while some people are still reporting it, so I opened this issue just to keep track.

For details, the container with the cron jobs is running usually at 70% memory and <5% cpu, but there was no peak in usage when the bug occurred.

Thank you @sheerlox for the suggestion, but I think that blindly upgrading cron without identifying the root cause will give a false feeling that bug was fixed (exactly what happened in the previously closed Github issue).

Again, I'm not even sure how to debug this bug, so I'm available to give more details if possible.

sheerlox · 2023-11-29T10:00:28Z

I gave a good look to #232, and it actually contains a lot of information! From what I understand it looks like the issue is with line 191 from job.js. I've created a patch you can deploy easily to confirm that's indeed the issue and provide more information.

Please add this file to your project, then follow the set-up instructions from patch-package:

./patches/cron+2.4.1.patch

diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..9d89479 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -135,6 +135,8 @@ function CronJob(CronTime, spawn) {
 		// The callback wrapper checks if it needs to sleep another period or not
 		// and does the real callback logic when it's time.
 		function callbackWrapper() {
+			console.debug(`[CRON DEBUG] BEG CALLBACK: ${performance.now()}`);
+
 			const diff = startTime + timeout - Date.now();
 
 			if (diff > 0) {
@@ -188,7 +190,13 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()}`);
 			this.stop();
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+					self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+			);
 		}
 	};

Once the issue manifests itself again, you should be able to extract the interesting output with the following command:
cat {YOUR_LOGFILE} | grep "[CRON DEBUG]" | grep -B 15 "JOB ABORTED"

15 lines above should give us a pretty good picture of the previous runs of the callbackWrapper function.

If you can get us this debug information I'll make sure to free up some time to dig deeper into the issue and find a fix.

Thank you for re-opening an issue about this to let us know that still happens.

github-anis-snoussi · 2023-11-29T10:07:37Z

Thanks for the update, will make sure to add this and get back to you if the issue manifests itself again 🙏.

github-anis-snoussi · 2023-11-30T13:15:20Z

A little update :

The cron job is running on an ephemeral docker container (logs are not saved on disk, instead they are sent to a log aggregator) => adding the console.logs you mentioned will pollute extensively the current logs (+drive up the cost of the log aggregator 😅).
So instead of the patch you proposed, here's what I ended up changing :

diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..b34927c 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -188,7 +188,13 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()}`);
 			this.stop();
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+				self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+				);
 		}
 	};

will that be enough to debug the issue in case it happens ?

sheerlox · 2023-11-30T13:24:35Z

That makes sense, I didn't consider that when writing the patch!

Knowing the exact time the callback was started would be really useful, so instead of my first version can I suggest saving performance.now()'s result in a variable and logging it when the job gets killed in the else block?

github-anis-snoussi · 2023-11-30T13:26:09Z

Sounds good to me, will add it 👌.

Update patch :

diff --git a/node_modules/cron/lib/job.js b/node_modules/cron/lib/job.js
index a3b12b5..833640e 100644
--- a/node_modules/cron/lib/job.js
+++ b/node_modules/cron/lib/job.js
@@ -135,6 +135,8 @@ function CronJob(CronTime, spawn) {
 		// The callback wrapper checks if it needs to sleep another period or not
 		// and does the real callback logic when it's time.
 		function callbackWrapper() {
+			let callbackStartedAt = performance.now()
+
 			const diff = startTime + timeout - Date.now();
 
 			if (diff > 0) {
@@ -188,8 +190,14 @@ function CronJob(CronTime, spawn) {
 
 			_setTimeout(timeout);
 		} else {
+			console.debug(`[CRON DEBUG] END CALLBACK: ${performance.now()} WHICH STARTED: ${callbackStartedAt}`);
 			this.stop();
-		}
+			console.debug(
+				`[CRON DEBUG] JOB ABORTED: ${
+				self.cronTime.source
+				} - ${luxon.DateTime.local().toISO()}`
+				);
+						 }
 	};
 
 	CJ.prototype.start = start;

MiikeVi · 2024-05-17T06:14:36Z

Just stopping by to say that it’s still happening in version 3.1.7, using node 14.19.3 (node:14-alpine). I had to switch to node-cron

sheerlox · 2024-05-17T08:27:31Z

Just stopping by to say that it’s still happening in version 3.1.7, using node 14.19.3 (node:14-alpine). I had to switch to node-cron

As I previously said, that issue is hard to reproduce and we're still looking for help from someone encountering it.

I have critical production code running this library for more than a year and I've never been able to observe that issue, so we're welcoming anyone who would like to help us debug the issue with more information. Please find the patch to apply for the necessary debug information here: #805 (comment).

Regarding the transition to node-cron, it's known for having lots of bugs and being unmaintained (look at the issues list on their GitHub). If you really have to switch libraries, I'd recommend using croner.

P.S. moreover that issue also seems to exist in node-cron.

ivan-kleshnin · 2024-07-09T08:19:05Z

There's an interesting blog post: https://lucumr.pocoo.org/2024/6/5/node-timeout/ that might be relevant to the issue.

There Armin describes how clearTimeout does not actually free memory because of timer reference enclosing over the scope and preserving cyclic links. He recommends to cast setTimeout return value to integer (which is an accepted type) to prevent the issue. At the time he wrote this, he was also describing another bug, with the workaround, but it is already fixed in Node v22.

I see that here:
https://github.com/kelektiv/node-cron/blob/main/src/job.ts#L223
you rely on non-privimitive setTimeout return

And here:
https://github.com/kelektiv/node-cron/blob/main/src/job.ts#L297
you just clearTimeout without pruning the value (an alternative to using a primitive type).

So the described memory leek may very well apply to the library.

droderuan · 2024-08-27T13:39:50Z

Hello guys. I would like to add my case that happened to me a few days ago. I have a payment gateway application implemented with NestJs, in production, that has a cron job (from nestjs) running to trigger webhooks. The cronjob runs every 10s, using the CronExpression from the library which solves in */10 * * * * *.

The application runs on GCP using Cloud Run. In my implementation, for every execution, the tries increase by one until it is processed. The API was running for at least 20 days without an interruption or reset.

The problem started between 2024-08-24 00:28:16 and 2024-08-24 00:31:28.
Look at the orange line, it was the last webhook processed by the cronjob, and the next one at the red line was executed the next day (when I saw the problem) on my computer. The cronjob was working precisely, processing every 10 seconds. When I started on my computer, the cronjob worked pretty well.

So move on to what I have in GCP...

I tried to see if the application was reaching the memory limit, but no, and the cloud run has a configuration to keep at least one instance. In this documentation about instances in cloud run, it has an idle status but is not deactivated.

Cloud run configuration has 512mb and 1cpu allocated. The CPU is not set to be always allocated, but this can not be the problem as the other webhooks were processed in time.

Your Environment
cron version: 3.1.7
NodeJS version: 20.12
Operating System and version: node:20.12 docker image
TypeScript version (if applicable): 5.1.3
Link to your project (if applicable): -

lmcuong25 · 2024-10-23T08:31:33Z

i have the same issue: Cron job stops after random time

Description:

I encountered an issue where the cron job stops running after a random amount of time. After monitoring the behavior for a while, I noticed that the job unexpectedly stops without any specific error message or pattern.

Steps to Reproduce:

Set up a cron job using "cron": "^3.1.7" .
Schedule the job to run at regular intervals.
Observe the cron job over time.

Expected Behavior:

The cron job should continue running consistently according to the schedule.

Actual Behavior:

The cron job stops after a random period of time. There is no specific pattern or error log that explains the stoppage.

What I Did to Resolve the Issue:

I forked the project and made modifications to the code. After testing the changes over the course of a few weeks, I no longer experienced the issue.

Changes Made:

github source : https://github.com/lublue/node-cron-track
github link compare: main...lublue:node-cron-track:main
create new function in file: src/time.ts

        getTimeout2() {
		let sendAtToMillis = this.sendAt().toMillis();
		let localTimeToMillis = DateTime.local().toMillis();
		return {
			sendAtToMillis,
			localTimeToMillis,
			minus: sendAtToMillis - localTimeToMillis,
			timeout: Math.max(-1, sendAtToMillis - localTimeToMillis)
		};
	}

then use that function in file src/job.ts to get timeout

let resultTimeout:any = this.cronTime.getTimeout2();
let timeout = resultTimeout?.timeout;

Conclusion:

Since the changes I made have resolved the issue for me, I wanted to report this in case others are facing the same problem. I'm happy to provide further details or submit a pull request if needed.

sheerlox · 2024-10-24T01:51:32Z

Hi @lmcuong25, thanks a lot for the effort. However, I'm afraid I might be missing something because except for returning more data for logging purposes, I don't see any changes to the getTimeout function. Could you point out to what you think solves the issue in your patch?

lmcuong25 · 2024-10-24T03:33:18Z

Hi @sheerlox ,
I'm not really sure why. The getTimeout2 function just returns additional data for debugging purposes (it logs more information to the console to monitor the situation when the cron job stops). I'm still monitoring it.

droderuan · 2024-10-26T19:39:51Z

Hi guys, I spent some time trying to understand what could cause the cron job to stop running. I used this tool https://github.com/wolfcw/libfaketime to try to replicate any situation where the cronjob could stop. I don't know if this tool is the best way to replicate any scenario, but I could stop the cron-job.

If the multiplier to speed up the process is too big, the timeout calculated by the this.cronTime.getTimeout() at job.ts returns -1, as provided by the code itself from time.ts.

	/**
	 * Get the number of milliseconds in the future at which to fire our callbacks.
	 */
	getTimeout() {
		return Math.max(-1, this.sendAt().toMillis() - DateTime.local().toMillis());
	}

Another curiosity is that, for example, I set the ratio in libfaketime to speed up 1000x (in seconds), sometimes could not start (timeout get -1 in the first loop), other times the code started and kept running, and in other cases, it ran by N loops, then got -1 as timeout and started the stop flow.

I know it is inconclusive, but I think the problem could be with the way the getTimeout() is working or with the environment, which is not related to the library and could explain why some people do not experience this bug.

In my case, maybe GCP misconfigured the time/timezone while orchestrating the containers(?)

I did not comprehend why a timeout of -1 should stop the cronjob from running. Maybe to not compromise the cronjob execution interval and lead to unexpected behavior to the user? But in my opinion, stop cronjob because this already leads to unexpected behavior. My suggestion would be to throw a specialized error instead of calling this.stop() inside job.ts, and leave the user to decide what to do if an error is thrown. Of course, I`m assuming that the problem is related to a timeout of -1.

As mentioned in other issues, like #232, the -1 in timeout could be the problem. In this comment #232 (comment), the person points out that the node is running very slowly, which could break the logic inside the library. If this is correct, the library should have minimal specs to run perfectly, and throwing an error would be a better choice in these situations.

This is the repository to the fork and the changes I made: https://github.com/droderuan/node-cron

sheerlox · 2024-10-28T17:00:37Z

Thank for so much to all of you for the work you've put into this. Thanks to your input, I think I might have a solution to the issue. I'll be working on it ASAP and keep you updated!

lmcuong25 · 2024-10-29T03:44:42Z

I'm using Docker with the image: node:20.16.0-alpine

Here’s my log output:

CRON STOP DEBUG:  {"sendAtToMillis":1729964793000,"localTimeToMillis":1729964793084,"minus":-84,"timeout":-1,"cronTimeSource":"*/3 * * * * *"}

sendAtToMillis: 2024-10-26T17:46:33.000Z
localTimeToMillis: 2024-10-26T17:46:33.084Z

At the time of logging, CPU and RAM usage weren’t particularly high. However, it’s possible that a process was blocking the thread at that moment (since Node.js is single-threaded).

To address the issue, I modified the code to call localTimeToMillis first, as shown in this comparison: main...lublue:node-cron-track:main

Updated code:

getTimeout2() {
    let localTimeToMillis = DateTime.local().toMillis();
    let sendAtToMillis = this.sendAt().toMillis();
    return {
        sendAtToMillis,
        localTimeToMillis,
        minus: sendAtToMillis - localTimeToMillis,
        timeout: Math.max(-1, sendAtToMillis - localTimeToMillis)
    };
}

ncb000gt · 2024-12-27T23:55:54Z

🎉 This issue has been resolved in version 3.3.2-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

sheerlox · 2024-12-28T00:03:28Z

Hey everyone, we just released a beta version with an idea of a fix so we don't let that issue stale.

Basically, it will execute the job even if with some delay. There's still a maximum delay of 1000ms (1 sec) to avoid potentially disruptive behaviors, in which case it will behave as before (stop the job).

In any case, a few logs have been added that should give us more info on the best way to go about this issue.

We would highly appreciate if some of you could use 3.3.2-beta.1 on your servers and report back with the logs!

All relevant logs start with [CRON DEBUG], so a simple grep on your log files should do the job.

This version should be production-safe, all tests are still passing, but feel free to take a look at the code before you commit.

Happy Holiday Season! 🎄

This comment was marked as off-topic.

Sign in to view

Battlesquid mentioned this issue Jun 12, 2024

change cron library qnaplus/qnaplus#16

Merged

droderuan added a commit to droderuan/node-cron that referenced this issue Nov 2, 2024

perf: trying to reproduce issue kelektiv#805

8a867c9

intcreator added the type:bug Bug reports and bug fixes label Dec 11, 2024

sheerlox added a commit to sheerlox/node-cron that referenced this issue Dec 27, 2024

fix: attempt to fix kelektiv#805

187dbe5

ncb000gt added the released on @beta label Dec 27, 2024

sheerlox mentioned this issue Dec 30, 2024

"Error: WARNING: Date in past. Will never be fired." crash inside callback wrapper #813

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron job stops after random hours/days #805

Cron job stops after random hours/days #805

github-anis-snoussi commented Nov 28, 2023

intcreator commented Nov 28, 2023

github-anis-snoussi commented Nov 28, 2023

sheerlox commented Nov 28, 2023 •

edited

Loading

github-anis-snoussi commented Nov 29, 2023 •

edited

Loading

sheerlox commented Nov 29, 2023

github-anis-snoussi commented Nov 29, 2023

github-anis-snoussi commented Nov 30, 2023

sheerlox commented Nov 30, 2023

github-anis-snoussi commented Nov 30, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

MiikeVi commented May 17, 2024

sheerlox commented May 17, 2024 •

edited

Loading

ivan-kleshnin commented Jul 9, 2024 •

edited

Loading

droderuan commented Aug 27, 2024 •

edited

Loading

lmcuong25 commented Oct 23, 2024 •

edited

Loading

sheerlox commented Oct 24, 2024

lmcuong25 commented Oct 24, 2024

droderuan commented Oct 26, 2024 •

edited

Loading

sheerlox commented Oct 28, 2024

lmcuong25 commented Oct 29, 2024

ncb000gt commented Dec 27, 2024

sheerlox commented Dec 28, 2024

Cron job stops after random hours/days #805

Cron job stops after random hours/days #805

Comments

github-anis-snoussi commented Nov 28, 2023

Description

Expected Behavior

Actual Behavior

Possible Fix

Steps to Reproduce

Context

Your Environment

intcreator commented Nov 28, 2023

github-anis-snoussi commented Nov 28, 2023

sheerlox commented Nov 28, 2023 • edited Loading

github-anis-snoussi commented Nov 29, 2023 • edited Loading

sheerlox commented Nov 29, 2023

github-anis-snoussi commented Nov 29, 2023

github-anis-snoussi commented Nov 30, 2023

sheerlox commented Nov 30, 2023

github-anis-snoussi commented Nov 30, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

MiikeVi commented May 17, 2024

sheerlox commented May 17, 2024 • edited Loading

ivan-kleshnin commented Jul 9, 2024 • edited Loading

droderuan commented Aug 27, 2024 • edited Loading

lmcuong25 commented Oct 23, 2024 • edited Loading

Description:

Steps to Reproduce:

Expected Behavior:

Actual Behavior:

What I Did to Resolve the Issue:

Changes Made:

Conclusion:

sheerlox commented Oct 24, 2024

lmcuong25 commented Oct 24, 2024

droderuan commented Oct 26, 2024 • edited Loading

sheerlox commented Oct 28, 2024

lmcuong25 commented Oct 29, 2024

ncb000gt commented Dec 27, 2024

sheerlox commented Dec 28, 2024

sheerlox commented Nov 28, 2023 •

edited

Loading

github-anis-snoussi commented Nov 29, 2023 •

edited

Loading

github-anis-snoussi commented Nov 30, 2023 •

edited

Loading

sheerlox commented May 17, 2024 •

edited

Loading

ivan-kleshnin commented Jul 9, 2024 •

edited

Loading

droderuan commented Aug 27, 2024 •

edited

Loading

lmcuong25 commented Oct 23, 2024 •

edited

Loading

droderuan commented Oct 26, 2024 •

edited

Loading