Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka Crash #7

Open
PeterBKramer opened this issue Mar 22, 2023 · 57 comments
Open

Kafka Crash #7

PeterBKramer opened this issue Mar 22, 2023 · 57 comments

Comments

@PeterBKramer
Copy link

I have been 'successfully' connected to igwn.gwalert and receiving alerts for many hours. But after many successful downloads I received this error message and a fatal crash:

{"level":"ERROR","timestamp":"2023-03-22T12:08:23.130Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":17,"size":10}
{"level":"ERROR","timestamp":"2023-03-22T12:08:23.131Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}
{"level":"INFO","timestamp":"2023-03-22T12:08:23.210Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802”}

I am Running node.js v14.20.0
I run a node.js script (.cjs) monitoring kafka using forever (v4.0.3).
This is an extract of the node.js script:

const { Kafka } = require ('gcn-kafka');
(async () => {      
	console.log('starting');
	const kafka = new Kafka({
		client_id: '   [deleted]  '
		client_secret: '    [deleted]   '
	})
	const consumer = kafka.consumer()
    	try {    
   		await consumer.subscribe({
		   topics: [
			//	'gcn.classic.text.LVC_COUNTERPART',
			//	'gcn.classic.text.LVC_EARLY_WARNING',
			//	'gcn.classic.text.LVC_INITIAL',
			//	'gcn.classic.text.LVC_PRELIMINARY',
			//	'gcn.classic.text.LVC_RETRACTION',
			//	'gcn.classic.text.LVC_TEST',
			//	'gcn.classic.text.LVC_UPDATE',
			'igwn.gwalert'
		   ],
  		})
   
	} catch (error) {
	  if (error.type === 'TOPIC_AUTHORIZATION_FAILED'){
		console.warn('Not all subscribed topics are available')
          } else {
		console.warn('throwing an error' + error.type )
		throw error
	  }
	}
	await consumer.run({
 		eachMessage: async (payload) => {
			const valueText=payload.message.value?.toString();

The system started with these messages (except for the first two, the additional messages are unusual) (and when I just now restarted the script only the first two appeared)

{"level":"INFO","timestamp":"2023-03-21T22:39:33.323Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"INFO","timestamp":"2023-03-21T22:39:35.109Z","logger":"kafkajs","message":"[ConsumerGroup] Consumer has joined the group","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","memberId":"kafkajs-911d2926-de7f-40fd-ad01-2d1fc460dadf","leaderId":"kafkajs-911d2926-de7f-40fd-ad01-2d1fc460dadf","isLeader":true,"memberAssignment":{"igwn.gwalert":[0]},"groupProtocol":"RoundRobinAssigner","duration":1784}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.197Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs"}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.359Z","logger":"kafkajs","message":"[Connection] Connection error: Client network socket disconnected before secure TLS connection was established","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: Client network socket disconnected before secure TLS connection was established\n at connResetException (internal/errors.js:639:14)\n at TLSSocket.onConnectEnd (_tls_wrap.js:1570:19)\n at TLSSocket.emit (events.js:412:35)\n at endReadableNT (internal/streams/readable.js:1333:12)\n at processTicksAndRejections (internal/process/task_queues.js:82:21)"}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.390Z","logger":"kafkajs","message":"[Connection] Connection error: Client network socket disconnected before secure TLS connection was established","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: Client network socket disconnected before secure TLS connection was established\n at connResetException (internal/errors.js:639:14)\n at TLSSocket.onConnectEnd (_tls_wrap.js:1570:19)\n at TLSSocket.emit (events.js:412:35)\n at endReadableNT (internal/streams/readable.js:1333:12)\n at processTicksAndRejections (internal/process/task_queues.js:82:21)"}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.520Z","logger":"kafkajs","message":"[Connection] Connection error: Client network socket disconnected before secure TLS connection was established","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: Client network socket disconnected before secure TLS connection was established\n at connResetException (internal/errors.js:639:14)\n at TLSSocket.onConnectEnd (_tls_wrap.js:1570:19)\n at TLSSocket.emit (events.js:412:35)\n at endReadableNT (internal/streams/readable.js:1333:12)\n at processTicksAndRejections (internal/process/task_queues.js:82:21)"}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.599Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","stack":"KafkaJSConnectionError: Connection timeout\n at Timeout.onTimeout (/home/peterb2/node_modules/kafkajs/src/network/connection.js:223:23)\n at listOnTimeout (internal/timers.js:557:17)\n at processTimers (internal/timers.js:500:7)"}
{"level":"INFO","timestamp":"2023-03-21T22:39:36.683Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"ERROR","timestamp":"2023-03-21T22:39:36.684Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"INFO","timestamp":"2023-03-21T22:39:36.984Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"INFO","timestamp":"2023-03-21T22:39:39.335Z","logger":"kafkajs","message":"[ConsumerGroup] Consumer has joined the group","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","memberId":"kafkajs-04355959-3447-4455-8ffc-9eb0ab18edc9","leaderId":"kafkajs-04355959-3447-4455-8ffc-9eb0ab18edc9","isLeader":true,"memberAssignment":{"igwn.gwalert":[0]},"groupProtocol":"RoundRobinAssigner","duration":1580}
found a Notice
(node:150457) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
(Use node --trace-deprecation ... to show where the warning was created)

During the evening, while operating normally, the system at one point sent these messages:

{"level":"ERROR","timestamp":"2023-03-22T10:17:32.115Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs"}
{"level":"ERROR","timestamp":"2023-03-22T10:17:34.316Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSConnectionError: Connection timeout","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","stack":"KafkaJSConnectionError: Connection timeout\n at Timeout.onTimeout [as _onTimeout] (/home/peterb2/node_modules/kafkajs/src/network/connection.js:223:23)\n at listOnTimeout (internal/timers.js:557:17)\n at processTimers (internal/timers.js:500:7)"}
{"level":"INFO","timestamp":"2023-03-22T10:17:34.397Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"ERROR","timestamp":"2023-03-22T10:17:34.399Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 300ms","retryTime":300,"groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"INFO","timestamp":"2023-03-22T10:17:34.700Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802"}
{"level":"ERROR","timestamp":"2023-03-22T10:17:36.282Z","logger":"kafkajs","message":"[Connection] Connection error: Client network socket disconnected before secure TLS connection was established","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: Client network socket disconnected before secure TLS connection was established\n at connResetException (internal/errors.js:639:14)\n at TLSSocket.onConnectEnd (_tls_wrap.js:1570:19)\n at TLSSocket.emit (events.js:412:35)\n at endReadableNT (internal/streams/readable.js:1333:12)\n at processTicksAndRejections (internal/process/task_queues.js:82:21)"}
{"level":"INFO","timestamp":"2023-03-22T10:17:37.573Z","logger":"kafkajs","message":"[ConsumerGroup] Consumer has joined the group","groupId":"ea3e9f07-6a67-4d8d-903f-afe206208802","memberId":"kafkajs-17308c8c-48a5-4918-b4aa-e56cf744bd53","leaderId":"kafkajs-17308c8c-48a5-4918-b4aa-e56cf744bd53","isLeader":true,"memberAssignment":{"igwn.gwalert":[0]},"groupProtocol":"RoundRobinAssigner","duration":1969}

@lpsinger
Copy link
Member

Your sample code is cut off. Would you please post example code that we can run?

@PeterBKramer
Copy link
Author

I think this will run.

var fetch=require(‘node-fetch');

// irrelevant
// this is to connect to CloudKit:
//var CloudKit=require('./cloudkit');
//CloudKit.configure({
// services:
// {fetch:fetch},
// containers:[{
// containerIdentifier:’ [deleted] ’,
// environment:'Production',
// serverToServerKeyAuth:{
// keyID:’ [deleted] ’,
// privateKeyFile:’ [deleted] ’
// }
// }]
//});

const { Kafka } = require ('gcn-kafka');
(async () => {
console.log('starting');
const kafka = new Kafka({
client_id: ‘ [deleted] ’,
client_secret: ‘ [deleted] ’,
})
const consumer = kafka.consumer()
try { // was return (async () => { // addded or removed
await consumer.subscribe({
topics: [
'igwn.gwalert'
],
})
}
catch (error) {
if (error.type === 'TOPIC_AUTHORIZATION_FAILED'){
console.warn('Not all subscribed topics are available')
} else {
console.warn('throwing an error' + error.type )
throw error
}
}
await consumer.run({
eachMessage: async (payload) => {

		// again irrelevant - this is to connect to CloudKit
		//var container = CloudKit.getContainer(‘   [deleted]   ’);
                    //var database = container.getDatabaseWithDatabaseScope 				
						(CloudKit.DatabaseScope['PUBLIC']);
             	//container.setUpAuth()


          	          var triggerNum='no value detected';
                  const valueText=payload.message.value?.toString();
                      console.log('found a Notice');
	         .then(function(userInfo){
                               console.log('go to save a record: ' );
                                const alert=JSON.parse(valueText);
		            var record={recordType: ‘Notices'};


			       // [many lines deleted]:
				
                                   //      Parse and do many things with alert.  
			       //       Create a record. 
			       // during that, do these things:
			      //	console.log(alert.time_created);
			      //	console.log(alert.time_created.toString());
			      //	console.log('typed time_created’);
                                  //  	console.log(record);               
			      //	console.log('saving a record’);
			      // and finally, post to CloudKit:
			      //	return database.saveRecords(record);

                      })
                      .then(function(response){
   	                                console.log('saved a record');
				//   if there is a record to delete:
				//   return database.deleteRecords({recordName: recordToDelete});
	          })
                      .then(function(response){	
			      console.log('here deleted a record');
		   })
                       .catch(function(error){
                   		      console.log('Error here '+ error);
                      	     //process.exit(1);
                        });
  	 },
  })

})()

@PeterBKramer
Copy link
Author

Tried to improve format. I think this will run.

var fetch=require(‘node-fetch');

 //     irrelevant
//  this is to connect to CloudKit:
//  var CloudKit=require('./cloudkit');
//CloudKit.configure({
//  	services:
//  	{fetch:fetch},
//  	containers:[{
//  		containerIdentifier:’ [deleted] ’,
//  		environment:'Production',
//  		serverToServerKeyAuth:{
//  			keyID:’ [deleted] ’,
//  			privateKeyFile:’ [deleted] ’
//  		}
//  	}]
//  });

const { Kafka } = require ('gcn-kafka’);
(async () => {
	console.log('starting’);
	const kafka = new Kafka({
		client_id: ‘ [deleted] ’,
		client_secret: ‘ [deleted] ’,
	})
	const consumer = kafka.consumer()
	try { 
		await consumer.subscribe({
			topics: [
				‘igwn.gwalert'
			],
		})
	}
	catch (error) {
		if (error.type === 'TOPIC_AUTHORIZATION_FAILED’){
		console.warn('Not all subscribed topics are available’)
	} else {
		console.warn('throwing an error' + error.type )
		throw error
	}
}
await consumer.run({
	eachMessage: async (payload) => {
  
           // again irrelevant - this is to connect to CloudKit
           //   var container = CloudKit.getContainer(‘   [deleted]   ’);
           //    var database = container.getDatabaseWithDatabaseScope 
                                        (CloudKit.DatabaseScope['PUBLIC']);
           //    container.setUpAuth()


           var triggerNum='no value detected';
           const valueText=payload.message.value?.toString();
           console.log('found a Notice');
           .then(function(userInfo){
                console.log('go to save a record: ' );
                const alert=JSON.parse(valueText);
                var record={recordType: ‘Notices’};
   
                     // [many lines deleted]:
				
                     //      Parse and do many things with alert.         
                     //       Create a record. 
                     //           during that, do these things:
                     //           console.log(alert.time_created);
                      //          console.log(alert.time_created.toString());
                      //          console.log('typed time_created’);
                      //           console.log(record);               
                      //           console.log('saving a record’);
                      //     and finally, post to CloudKit:
                      //           return database.saveRecords(record);

           })
           .then(function(response){
                console.log('saved a record');
                   //   if there is a record to delete:
                   //        return database.deleteRecords({recordName: recordToDelete});
           })
           .then(function(response){	
                  console.log('here deleted a record');
           })
           .catch(function(error){
                  console.log('Error here '+ error);
                 //process.exit(1);
           });
      },
 })

})() `

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 22, 2023 via email

@lpsinger
Copy link
Member

I've been letting our demo script from https://gcn.nasa.gov/docs/client#ecmascript-mjs run for several hours and I have seen similar log messages:

{"level":"ERROR","timestamp":"2023-03-23T04:57:39.094Z","logger":"kafkajs","message":"[Connection] Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov","broker":"kafka.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)"}
{"level":"ERROR","timestamp":"2023-03-23T04:57:39.097Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to seed broker, trying another broker from the list: Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov","retryCount":4,"retryTime":4712}
{"level":"ERROR","timestamp":"2023-03-23T04:57:43.825Z","logger":"kafkajs","message":"[Connection] Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov","broker":"kafka.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov\n    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:107:26)"}
{"level":"ERROR","timestamp":"2023-03-23T04:57:43.826Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to seed broker, trying another broker from the list: Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov","retryCount":5,"retryTime":7834}
{"level":"ERROR","timestamp":"2023-03-23T04:57:43.827Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNumberOfRetriesExceeded: Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov","groupId":"89b18835-ca33-4b9e-a5fe-34bb2d887dee","retryCount":5,"stack":"KafkaJSNonRetriableError\n  Caused by: KafkaJSConnectionError: Connection error: getaddrinfo ENOTFOUND kafka.gcn.nasa.gov\n    at TLSSocket.onError (/Users/lpsinger/Downloads/example/node_modules/kafkajs/src/network/connection.js:210:23)\n    at TLSSocket.emit (node:events:513:28)\n    at emitErrorNT (node:internal/streams/destroy:151:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:116:3)\n    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"}
{"level":"INFO","timestamp":"2023-03-23T04:57:43.859Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"89b18835-ca33-4b9e-a5fe-34bb2d887dee"}
{"level":"ERROR","timestamp":"2023-03-23T04:57:43.859Z","logger":"kafkajs","message":"[Consumer] Restarting the consumer in 7834ms","retryCount":5,"retryTime":7834,"groupId":"89b18835-ca33-4b9e-a5fe-34bb2d887dee"}

These probably correspond to times when my laptop's VPN flakes out. However, the script automatically reconnects and continues receiving alerts.

Are you finding that you stop receiving alerts, or that the script actually terminates? If your client is recovering automatically and continuing to receive alerts, then I think that you can safely ignore these --- or at worst, treat them as warnings that your Internet connectivity is slightly unreliable.

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 24, 2023 via email

@lpsinger
Copy link
Member

FYI - Using the time stamps on your script, "2023-03-23T04:57:39.094Z” , my script was running fine during that time as shown by these console.logs:

Like I said, those log messages were in my output were probably due to my VPN dropping out momentarily: a local network connectivity issue. So it wouldn't be correlated with any warnings in your log output.

@lpsinger
Copy link
Member

Within 1 second of the time that your script crashed (2023-03-22T12:08:23.130Z), one of our Kafka brokers logged the following message (IP addresses redacted):

[2023-03-22 20:08:23,612] INFO [SocketServer listenerType=ZK_BROKER, nodeId=1] Failed authentication with /xxx.xxx.xxx.xxx (Unsupported SASL mechanism SCRAM-SHA-512) (org.apache.kafka.common.network.Selector)

However, that log message was from kafka1.gcn.nasa.gov, whereas your log message refers to kafka2.gcn.nasa.gov. So it is not clear whether or not this is related.

Have you been able to reproduce this more than once?

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 24, 2023 via email

@lpsinger
Copy link
Member

Nor have I. Please report again on this issue if it recurs. Thanks!

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 25, 2023 via email

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 25, 2023 via email

@lpsinger
Copy link
Member

It turns out that my forever script continues to run but I no longer receive Kafka notices (assuming they have come out since earlier this morning). Any idea how to detect this error and restart with:

You should not have to restart your Kafka client. Most standard Kafka clients are designed to recover automatically from network connectivity outages, outages of the broker, and so on. If it isn't recovering automatically, then we need to report that upstream as a bug.

@lpsinger
Copy link
Member

@PeterBKramer, how long does your eachMessage callback take to run? I wonder if KafkaJS might be assuming that it returns quickly, and it might get into an invalid state if it does not.

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 27, 2023 via email

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 28, 2023 via email

@lpsinger
Copy link
Member

I use “.then” functions in node.js to handle those delays.

Interesting that you are mixing the then callback style and the newer await syntactic sugar.

I believe that node.js handles those .then calls asynchronously and can receive another eachMessage while awaiting a .then response from CloudKit. If so, the answer to your question would be "less than .001 seconds". If the node.js code is not handling them asynchronously and cannot handle another eachMessage before the three .then are executed then the answer to your question would be “usually 1.1 seconds" but on one occasion out of about 30 examined it was “5.2 seconds" because of a delay in getting a response to an authorization request. I cannot say how long it might have been during the fatal error.

OK, I wouldn't worry about that, then.

@PeterBKramer
Copy link
Author

PeterBKramer commented Mar 31, 2023 via email

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 1, 2023 via email

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 4, 2023 via email

@lpsinger
Copy link
Member

lpsinger commented Apr 4, 2023

Is your full script somewhere on GitHub where I can view it?

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 4, 2023 via email

@lpsinger
Copy link
Member

lpsinger commented Apr 4, 2023

@PeterBKramer, I'm trying to reproduce this but it's hard to do it while seeing just bits of your code out of context. Here's what I need you to do:

  • Post the simplest complete, self-contained script that reproduces the crash to https://gist.github.com and place a link to it in this issue.
  • Keep a list of the UTC times of the crashes and put them in a text file, ideally also in a gist (https://gist.github.com). Don't email every time it crashes, just add the timestamp to the list.
  • Note for us as many details as you can think of about your runtime environment: at minimum, your operating system and architecture, your version of NodeJS, your versions of gcn-kafka-js and KafkaJS.
  • Also, what kind of Internet connectivity does the machine on which you are testing have? Is it a cloud instance or server with a reliable connection? Is it a laptop that goes to sleep sometimes or roams from one WiFi network to another?

I have been running a KafkaJS client for a day or so now in Amazon Fargate but haven't yet seen a crash.

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 4, 2023 via email

@lpsinger
Copy link
Member

lpsinger commented Apr 4, 2023

link: https://gist.github.com/PeterBKramer/2dd52817383d941b7595adaebebcc9d3 Gravitational Wave Events Kafka handler gist.github.com

I get a 404 error. Did you make a private Gist? Try making a public gist.

There are 2 issues - why am I getting fatal crashes and how can I recover from them. Can anyone help me with the second - what is wrong with my payload.payload.error that it does not return a yes or no for "instanceof KafkaJSNonRetriableError”.

Would you please make two separate code samples? One with your crash handling, and one without.

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 4, 2023 via email

@lpsinger
Copy link
Member

lpsinger commented Apr 4, 2023

https://gist.github.com/PeterBKramer/2dd52817383d941b7595adaebebcc9d3 Gravitational Wave Events Kafka handler gist.github.com

I still get a 404.

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 4, 2023 via email

@PeterBKramer
Copy link
Author

@PeterBKramer
Copy link
Author

I think this link will work......
https://gist.github.com/PeterBKramer/264e82962f94fc59317f60d82ddb05ea

@PeterBKramer
Copy link
Author

Again, I am sorry about the spacing - tabs versus spaces.
There are two issues - how to prevent the fatal crashes and how to recover from the fatal crashes. The first is out of my control, the second is beyond my skill level. But I am close to solving the second...unfortunately I can't parse the payload in:

    consumer.on('consumer.crash', async (payload) => {

I am trying to select the nature of the error (so I only restart if it is non-retriable) with:

     const isNonRetriableError = payload.payload.error instanceof KafkaJSNonRetriableError;

I get:

      KafkaJSNonRetriableError is not defined"

Any suggestions?

@lpsinger
Copy link
Member

lpsinger commented Apr 5, 2023

I use a DreamHost VPS to receive these Kafka packages. [vps37660]$ npm -v 8.19.1 [vps37660]$ node -v v14.20.0 [vps37660]$ nvm --version 0.33.0

NodeJS 14.x is past end of life. Would you please try using a supported version of NodeJS? I recommend NodeJS 18.x which is the most recent LTS version.

@lpsinger
Copy link
Member

lpsinger commented Apr 5, 2023

I think this link will work......
https://gist.github.com/PeterBKramer/264e82962f94fc59317f60d82ddb05ea

That code sample has your CloudKit code in it. Can your reproduce the crash without it? Please isolate the simplest code that reproduces the crash.

@lpsinger
Copy link
Member

lpsinger commented Apr 5, 2023

I am trying to select the nature of the error (so I only restart if it is non-retriable) with:

     const isNonRetriableError = payload.payload.error instanceof KafkaJSNonRetriableError;

I get:

      KafkaJSNonRetriableError is not defined"

Any suggestions?

You need to require it:

const { KafkaJSNonRetriableError } = require('kafkajs')

@PeterBKramer
Copy link
Author

Great news. I had two crashes over the past few hours. One was Retriable and the system recovered itself. The other was the NonRetriable crash that has been causing the problem. Using the recovery code above, and 'require' as instructed by Leo, the system recovered itself as desired. Here is the full payload package describing the event that was recovered:

through to bottom 2023-04-07T02:54:23.591Z
here is a package

{"level":"ERROR","timestamp":"2023-04-07T02:57:47.740Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka3.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":125,"size":10}

{"level":"ERROR","timestamp":"2023-04-07T02:57:48.306Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"6dafd3e1-eeb2-4d0e-95a7-dc4609a8d498","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}

{"level":"INFO","timestamp":"2023-04-07T02:57:48.397Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"6dafd3e1-eeb2-4d0e-95a7-dc4609a8d498"}

WARNING WARNING sss
payload [object Object]
payload.payload [object Object]
payload.payload.error KafkaJSNonRetriableError: Request is not valid given the current SASL state
Consumer crashed on non-retriable error: restarting
KafkaJSNonRetriableError: Request is not valid given the current SASL state
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5) {
retriable: false,
helpUrl: undefined,
cause: KafkaJSProtocolError: Request is not valid given the current SASL state
at createErrorFromCode (/home/peterb2/node_modules/kafkajs/src/protocol/error.js:581:10)
at Object.parse (/home/peterb2/node_modules/kafkajs/src/protocol/requests/saslHandshake/v0/response.js:24:11)
at Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:433:35)
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async SASLAuthenticator.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:35:23)
at async /home/peterb2/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:381:7)
at async Broker.[private:Broker:sendRequest] (/home/peterb2/node_modules/kafkajs/src/broker/index.js:904:14)
at async Broker.metadata (/home/peterb2/node_modules/kafkajs/src/broker/index.js:177:12) {
retriable: false,
helpUrl: undefined,
cause: undefined,
type: 'ILLEGAL_SASL_STATE',
code: 34
}
}
awaiting disconnect

{"level":"INFO","timestamp":"2023-04-07T02:57:48.405Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"6dafd3e1-eeb2-4d0e-95a7-dc4609a8d498"}

Going to recall KafkaConnect
reconnecting in next 5 seconds
starting
reconnected

{"level":"ERROR","timestamp":"2023-04-07T02:57:55.051Z","logger":"kafkajs","message":"[Connection] Response Metadata(key: 3, version: 6)","broker":"kafka.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Not authorized to access topics: [Topic authorization failed]","correlationId":3,"size":499}
Not all subscribed topics are available
{"level":"INFO","timestamp":"2023-04-07T02:57:55.052Z","logger":"kafkajs","message":"[Consumer] Starting","groupId":"4a7f5f26-68be-47e8-8646-485451c89997"}
{"level":"INFO","timestamp":"2023-04-07T02:57:56.457Z","logger":"kafkajs","message":"[ConsumerGroup] Consumer has joined the group","groupId":"4a7f5f26-68be-47e8-8646-485451c89997","memberId":"kafkajs-6cfb689f-2948-49be-a889-91ca7e155dae","leaderId":"kafkajs-6cfb689f-2948-49be-a889-91ca7e155dae","isLeader":true,"memberAssignment":{"gcn.classic.text.LVC_INITIAL":[0],"gcn.classic.text.LVC_PRELIMINARY":[0],"gcn.classic.text.LVC_RETRACTION":[0]},"groupProtocol":"RoundRobinAssigner","duration":1246}
found a Notice 2023-04-07T02:59:42.366Z

@lpsinger
Copy link
Member

lpsinger commented Apr 7, 2023

That's great.

I'm still trying to reproduce the crash, and I haven't seen one yet. Once again, would you please post a minimal script that reproduces the crash?

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 7, 2023

Minimal code is below.

I am running this type of code simultaneously from the same server looking at different "consumer.subscribe topic" entries. The code runs with "forever start -a filename.cjs". For example, on April 7 at 02:24:03 GMT I had a NonRetriable crash when looking at topics:[gcn.classic.text.SNEWS] but at the same time different code looking at
topics: [
'gcn.classic.text.LVC_COUNTERPART',
'gcn.classic.text.LVC_EARLY_WARNING',
'gcn.classic.text.LVC_INITIAL',
'gcn.classic.text.LVC_PRELIMINARY',
'gcn.classic.text.LVC_RETRACTION',
'gcn.classic.text.LVC_TEST',
'gcn.classic.text.LVC_UPDATE',
],
did not crash. However, that different code did crash at April 7 02:57:47. The crashes were the same:

{"level":"ERROR","timestamp":"2023-04-07T02:57:47.740Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka3.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":125,"size":10}

{"level":"ERROR","timestamp":"2023-04-07T02:57:48.306Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"6dafd3e1-eeb2-4d0e-95a7-dc4609a8d498","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}

KafkaJSNonRetriableError: Request is not valid given the current SASL state
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5) {
retriable: false,
helpUrl: undefined,
cause: KafkaJSProtocolError: Request is not valid given the current SASL state
at createErrorFromCode (/home/peterb2/node_modules/kafkajs/src/protocol/error.js:581:10)
at Object.parse (/home/peterb2/node_modules/kafkajs/src/protocol/requests/saslHandshake/v0/response.js:24:11)
at Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:433:35)
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async SASLAuthenticator.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:35:23)
at async /home/peterb2/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:381:7)
at async Broker.[private:Broker:sendRequest] (/home/peterb2/node_modules/kafkajs/src/broker/index.js:904:14)
at async Broker.metadata (/home/peterb2/node_modules/kafkajs/src/broker/index.js:177:12) {
retriable: false,
helpUrl: undefined,
cause: undefined,
type: 'ILLEGAL_SASL_STATE',
code: 34
}
}

I was able to recover using code in: consumer.on('consumer.crash', async (payload) => {

MINIMAL CODE:

`var fetch=require('node-fetch');
function kafkaConnect(){
const { Kafka } = require ('gcn-kafka');
const kafka = new Kafka({
client_id: ' omitted ',
client_secret: ' omitted ',
})
var consumer;
(async () => {
console.log('starting');
consumer = kafka.consumer();
var triggerNumArray=[];
var triggerTypesMap=new Map()
try {
await consumer.subscribe({
topics: [
'igwn.gwalert'
],
})
}
catch (error) {
if (error.type === 'TOPIC_AUTHORIZATION_FAILED'){
console.warn('Not all subscribed topics are available')
} else {
console.warn('throwing an error' + error.type )
throw error
}
}
await consumer.run({
eachMessage: async (payload) => {

			const valueText=payload.message.value?.toString();
			const d=new Date();
			console.log('found a Notice                '+ d.toISOString());
	
			//  lots of stuff here unrelated to crash
			
              
   		},
	})
})()

}
setTimeout(kafkaConnect,10);`

@lpsinger
Copy link
Member

lpsinger commented Apr 7, 2023

I can't run your example:

$ node example.cjs 
/Users/lpsinger/Downloads/example/example.cjs:1
var fetch = require('node-fetch')
            ^

Error [ERR_REQUIRE_ESM]: require() of ES Module /Users/lpsinger/Downloads/example/node_modules/node-fetch/src/index.js from /Users/lpsinger/Downloads/example/example.cjs not supported.
Instead change the require of index.js in /Users/lpsinger/Downloads/example/example.cjs to a dynamic import() which is available in all CommonJS modules.
    at Object.<anonymous> (/Users/lpsinger/Downloads/example/example.cjs:1:13) {
  code: 'ERR_REQUIRE_ESM'
}

Node.js v18.15.0

Does it reproduce if I remove var fetch = require('node-fetch')?

@PeterBKramer
Copy link
Author

I see now that the line:

var fetch = require('node-fetch')

can be omitted. It is left over from:

 var fetch=require('node-fetch');
 var CloudKit=require('./cloudkit');
 CloudKit.configure({
   	services:{
           	fetch:fetch
},
    containers:[{	}]
 });   

@lpsinger
Copy link
Member

lpsinger commented Apr 7, 2023

OK. What's left is basically our own demo script, right?

Were you able to reproduce with a recent, supported version of NodeJS?

@PeterBKramer
Copy link
Author

I have not yet updated.

@PeterBKramer
Copy link
Author

New crashes to report.

System 1 recovered from a crash at April 8, 01:32:50. (at this time System 2 showed no error)
System 2 did not recover from a crash at April 8, 06:03:33 (at this time System 1 showed no error)

Crash logs:

System 1 crash:

{"level":"ERROR","timestamp":"2023-04-08T01:32:50.241Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":81,"size":10}
{"level":"ERROR","timestamp":"2023-04-08T01:32:50.242Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"b28624b8-d5b0-4a04-b138-7a8ed71f6504","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}

My system was able to reconnect after this crash.

System 2 crash:
{"level":"ERROR","timestamp":"2023-04-08T06:03:33.447Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":43,"size":10}
{"level":"ERROR","timestamp":"2023-04-08T06:03:33.448Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"4a7f5f26-68be-47e8-8646-485451c89997","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}
{"level":"INFO","timestamp":"2023-04-08T06:03:33.527Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"4a7f5f26-68be-47e8-8646-485451c89997"}

{"level":"INFO","timestamp":"2023-04-08T06:03:33.529Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"4a7f5f26-68be-47e8-8646-485451c89997"}

My system then tried to reconnect but:

{"level":"ERROR","timestamp":"2023-04-08T06:03:43.043Z","logger":"kafkajs","message":"[BrokerPool] outgoing request timed out after 3500ms","retryCount":0,"retryTime":257}

@lpsinger
Copy link
Member

@PeterBKramer, thanks for continuing to catalog these crashes.

Unfortunately, neither of my two test client instances (in AWS, with ultra-reliable network connectivity; and on my laptop, with flaky network connectivity) have reproduced this.

I can only think of two significant differences in our setups: the Node.js version, and where it is running.

Have you tried a recent version of Node.js? Are you seeing this only on your Dreamhost box? Do you also see these crashes on your own desktop machine? Do you have another physical machine or another compute resource in a different cloud provider where you can try this out?

@PeterBKramer
Copy link
Author

Thank you for continuing to explore this issue. I will try to update my version of node.js next week. Last time I tried to do that it created a number of issues. Hopefully it will be easier this time.

Regarding where I am running - I only run from my DreamHost VPS.

I run multiple scripts all using 'forever'. One script is downloading from gcn.classic.text.SNEWS and has never seen a crash. Another is downloading from 6 different gcn.classic.text.LVC_ topics and has seen multiple crashes. The third is downloading only from igwn.gwalert and sees a few crashes. The good news is that I am now able to recover cleanly from all crashes. While all my scripts are running from the same DreamHost VPS they crash at different times - and the other scripts continue to operate while one is crashing. All of the crashes generate the following errors:

{"level":"ERROR","timestamp":"2023-04-09T03:49:02.562Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":26,"size":10}
{"level":"ERROR","timestamp":"2023-04-09T03:49:02.564Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"f78c966c-2c2b-4e61-a3ac-7c83ccbd9bc0","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at runMicrotasks ()\n at processTicksAndRejections (internal/process/task_queues.js:95:5)"}
{"level":"INFO","timestamp":"2023-04-09T03:49:02.647Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"f78c966c-2c2b-4e61-a3ac-7c83ccbd9bc0"}

My code then catches the error and generates the following console.log messages:

WARNING WARNING sss
payload [object Object]
payload.payload [object Object]
payload.payload.error KafkaJSNonRetriableError: Request is not valid given the current SASL state
Consumer crashed on non-retriable error: restarting
KafkaJSNonRetriableError: Request is not valid given the current SASL state
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5) {
retriable: false,
helpUrl: undefined,
cause: KafkaJSProtocolError: Request is not valid given the current SASL state
at createErrorFromCode (/home/peterb2/node_modules/kafkajs/src/protocol/error.js:581:10)
at Object.parse (/home/peterb2/node_modules/kafkajs/src/protocol/requests/saslHandshake/v0/response.js:24:11)
at Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:433:35)
at runMicrotasks ()
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at async SASLAuthenticator.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:35:23)
at async /home/peterb2/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:381:7)
at async Broker.[private:Broker:sendRequest] (/home/peterb2/node_modules/kafkajs/src/broker/index.js:904:14)
at async Broker.metadata (/home/peterb2/node_modules/kafkajs/src/broker/index.js:177:12) {
retriable: false,
helpUrl: undefined,
cause: undefined,
type: 'ILLEGAL_SASL_STATE',
code: 34
}
}
awaiting disconnect

My code then disconnects and reconnects smoothly within a few seconds.

I have had such errors generated at the following times:

2023-04-11T13:32:24.090Z (one script)

2023-04-09T03:49:02.562Z (the other script)
2023-04-09T06:35:39.111Z

@PeterBKramer
Copy link
Author

On the continuing saga, again, before I update node.js, I record the following 3 errors from the 2 scripts over the past 2 days tying the 2 scripts' score 3 to 3:

2023-04-12T19:05:42.165Z (one script)
2023-04-13T13:10:01.917Z

2023-04-13T12:29:37.026Z (the other script)

and again - these are 'no-retrieable' and my code is able to recover smoothly.

@PeterBKramer
Copy link
Author

I have upgraded to node.js v18.16.0 but still have errors.
I capture errors and restart the connection if necessary.

I have recently received the following 2 errors:

  1. KafkaJSNonRetriableError: Request is not valid given the current SASL state:
    {"level":"ERROR","timestamp":"2023-04-18T09:12:00.925Z","logger":"kafkajs","message":"[Connection] Response SaslHandshake(key: 17, version: 1)","broker":"kafka2.gcn.nasa.gov:9092","clientId":"kafkajs","error":"Request is not valid given the current SASL state","correlationId":27,"size":10}
    {"level":"ERROR","timestamp":"2023-04-18T09:12:00.927Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: Request is not valid given the current SASL state","groupId":"5b05c77c-ebf0-4d85-a634-9a3795b6b6c3","stack":"KafkaJSNonRetriableError: Request is not valid given the current SASL state\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"}
    {"level":"INFO","timestamp":"2023-04-18T09:12:01.011Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"5b05c77c-ebf0-4d85-a634-9a3795b6b6c3"}

// At this point my code captures the error and produces this information:
payload.payload.error KafkaJSNonRetriableError: Request is not valid given the current SASL state
parsing the error message Request is not valid given the current SASL state
parsing the error stack KafkaJSNonRetriableError: Request is not valid given the current SASL state
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
Consumer crashed on non-retriable error: restarting
KafkaJSNonRetriableError: Request is not valid given the current SASL state
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
retriable: false,
helpUrl: undefined,
[cause]: KafkaJSProtocolError: Request is not valid given the current SASL state
at createErrorFromCode (/home/peterb2/node_modules/kafkajs/src/protocol/error.js:581:10)
at Object.parse (/home/peterb2/node_modules/kafkajs/src/protocol/requests/saslHandshake/v0/response.js:24:11)
at Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:433:35)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async SASLAuthenticator.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:35:23)
at async /home/peterb2/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:381:7)
at async [private:Broker:sendRequest] (/home/peterb2/node_modules/kafkajs/src/broker/index.js:904:14)
at async Broker.metadata (/home/peterb2/node_modules/kafkajs/src/broker/index.js:177:12)
at async /home/peterb2/node_modules/kafkajs/src/cluster/brokerPool.js:158:25 {
retriable: false,
helpUrl: undefined,
type: 'ILLEGAL_SASL_STATE',
code: 34,
[cause]: undefined
}
}

  1. Error: Client network socket disconnected before secure TLS connection was established
    followed by a:
    KafkaJSNonRetriableError: SASL OAUTHBEARER authentication failed: Not connected

{"level":"ERROR","timestamp":"2023-04-18T18:52:11.849Z","logger":"kafkajs","message":"[Connection] Connection timeout","broker":"kafka3.gcn.nasa.gov:9092","clientId":"kafkajs"}
{"level":"ERROR","timestamp":"2023-04-18T18:52:12.453Z","logger":"kafkajs","message":"[Connection] Connection error: Client network socket disconnected before secure TLS connection was established","broker":"kafka3.gcn.nasa.gov:9092","clientId":"kafkajs","stack":"Error: Client network socket disconnected before secure TLS connection was established\n at connResetException (node:internal/errors:717:14)\n at TLSSocket.onConnectEnd (node:_tls_wrap:1595:19)\n at TLSSocket.emit (node:events:525:35)\n at endReadableNT (node:internal/streams/readable:1359:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)"}
{"level":"ERROR","timestamp":"2023-04-18T18:52:12.929Z","logger":"kafkajs","message":"[SaslAuthenticator-OAUTHBEARER] SASL OAUTHBEARER authentication failed: Not connected","broker":"kafka3.gcn.nasa.gov:9092"}
{"level":"ERROR","timestamp":"2023-04-18T18:52:12.930Z","logger":"kafkajs","message":"[Consumer] Crash: KafkaJSNonRetriableError: SASL OAUTHBEARER authentication failed: Not connected","groupId":"2964bedb-d104-4b5f-90b1-88961648a27c","stack":"KafkaJSNonRetriableError: SASL OAUTHBEARER authentication failed: Not connected\n at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"}
{"level":"INFO","timestamp":"2023-04-18T18:52:13.302Z","logger":"kafkajs","message":"[Consumer] Stopped","groupId":"2964bedb-d104-4b5f-90b1-88961648a27c"}

// At this point my code captured the error and produced the following information:
payload.payload.error KafkaJSNonRetriableError: SASL OAUTHBEARER authentication failed: Not connected
Consumer crashed on non-retriable error: restarting
KafkaJSNonRetriableError: SASL OAUTHBEARER authentication failed: Not connected
at /home/peterb2/node_modules/kafkajs/src/retry/index.js:55:18
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
retriable: false,
helpUrl: undefined,
[cause]: KafkaJSSASLAuthenticationError: SASL OAUTHBEARER authentication failed: Not connected
at Object.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/oauthBearer.js:40:23)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async SASLAuthenticator.authenticate (/home/peterb2/node_modules/kafkajs/src/broker/saslAuthenticator/index.js:73:5)
at async /home/peterb2/node_modules/kafkajs/src/network/connection.js:139:9
at async Connection.send (/home/peterb2/node_modules/kafkajs/src/network/connection.js:381:7)
at async [private:Broker:sendRequest] (/home/peterb2/node_modules/kafkajs/src/broker/index.js:904:14)
at async Broker.metadata (/home/peterb2/node_modules/kafkajs/src/broker/index.js:177:12)
at async /home/peterb2/node_modules/kafkajs/src/cluster/brokerPool.js:158:25
at async /home/peterb2/node_modules/kafkajs/src/cluster/index.js:115:14 {
retriable: false,
helpUrl: undefined,
[cause]: undefined
}
}

@lpsinger
Copy link
Member

Would you please make a list of the UTC times of these crashes?

@PeterBKramer
Copy link
Author

No Kafka errors were generated in my script monitoring igwn.gwalert since 2023-04-19T00:00 (when you sent your request). I will keep this updated.

(The errors previously noted above have UTC times in their error messages)
And so far everything is 'caught' and the system stays up.

@PeterBKramer
Copy link
Author

It is now 2023-04-20T05:00:00.000Z
My auxiliary channel had a "Request is not valid given the current SASL State" error at 2023-04-19T22:14:40.779Z
My main channel had a "Request is not valid given the current SASL State" error at 2023-04-20T03:55:59.839Z
My main channel had a "The coordinator is not aware of this member" error at 2023-04-20T04:38:06.202Z
and, at almost the same time,
My auxiliary channel had a "The coordinator is not aware of this member" error at 2023-04-20T04:38:08.309Z

(My main channel subscribes to igmn.gwalert and gcn.classic.text.SNEWS.
(My auxiliary channel subscribes to gcn.classic.text.LVC_INITIAL/PRELIMINARYT/RETRACTION

@lpsinger
Copy link
Member

@PeterBKramer, as I said, would you please make a table or list of the UTC times of these crashes? My next step will be to correlate these error messages with our log files, but it's going to be a lot of work to collect all of the timestamps from your posts.

@PeterBKramer
Copy link
Author

PeterBKramer commented Apr 21, 2023

I have started a spreadsheet. There are a few different types of crashes.
KafkaCrashes.xlsx

@lpsinger
Copy link
Member

Did you close this issue intentionally?

@PeterBKramer
Copy link
Author

I did not intentionally close this issue.
My bad - again, I am a newbie here, sorry.

@lpsinger
Copy link
Member

That's alright. Thank you for putting together the spreadsheet. That's a huge help.

@PeterBKramer
Copy link
Author

Revised spreadsheet with 29 errors over 7 days; all that required restarting the script.
12 are from a script that subscribes to gcn.classic.text.LVC_ Kafkas
17 are from a script that subscribes to igwn.gwalert

Of the 17 from the igwn.gwalert script:
9 are "Response SaslHandshake" errors while sitting idle
6 are from an attempt at consumer.subscribe that fails with a 3500 ms timeout
1 is from an authentication error immediately after a "Connection timeout"
and 1 is from too many "Connection timeouts".
("Connection timeouts" also occur about once per day but usually do not cause any problem)

KafkaCrashes 4 25.xlsx

@PeterBKramer
Copy link
Author

I had an error at 2023-04-26T16:40:19.962Z from which I did not recover. It started with a typical "[Connection] Response SaslHandshake(key: 17, version: 1)". But that did not generate the typical crash which my code can catch and recover from. Instead it was followed by a:

{"level":"ERROR","timestamp":"2023-04-26T16:40:19.963Z","logger":"kafkajs","message":"[BrokerPool] Failed to connect to broker, reconnecting","retryCount":0,"retryTime":261}

that ended 13 seconds later with a:

{"level":"ERROR","timestamp":"2023-04-26T16:40:32.225Z","logger":"kafkajs","message":"[BrokerPool] KafkaJSLockTimeout: Timeout while acquiring lock (1 waiting locks): "connect to broker kafka2.gcn.nasa.gov:9092"","retryCount":1,"retryTime":542,"stack":"KafkaJSLockTimeout: Timeout while acquiring lock (1 waiting locks): "connect to broker kafka2.gcn.nasa.gov:9092"\n at Timeout._onTimeout (/home/peterb2/node_modules/kafkajs/src/utils/lock.js:48:23)\n at listOnTimeout (node:internal/timers:569:17)\n at process.processTimers (node:internal/timers:512:7)"}

and the script stopped.

@PeterBKramer
Copy link
Author

Final spreadsheet containing 35 crashes from the script subscribed to gcn.classic.text.LVC_ Kafkas and 37 crashes from the script subscribed to igwn.gwalert and gcn.classic.text.SNEWS
KafkaCrashes.xlsx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants