Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nil pointer dereference when setting partition leader #354

Open
dvolodin7 opened this issue Jun 23, 2021 · 4 comments
Open

nil pointer dereference when setting partition leader #354

dvolodin7 opened this issue Jun 23, 2021 · 4 comments

Comments

@dvolodin7
Copy link
Contributor

Found weird nil pointer dereference issue. Simple 3-node cluster in docker-compose. Pulled up to v1.6.0.
Suddenly, 2nd node starts to crash on startup during the raft election.

liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Liftbridge Version:        v1.6.0"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Server ID:                 liftbridge2"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Namespace:                 liftbridge-default"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="NATS Servers:              [nats://nats:4222/]"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Default Retention Policy:  [Age: 1 day, Compact: true]"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Default Partition Pausing: disabled"
liftbridge2_1        | time="2021-06-23 08:03:10" level=info msg="Starting Liftbridge server on 0.0.0.0:9292..."
liftbridge2_1        | panic: runtime error: invalid memory address or nil pointer dereference
liftbridge2_1        | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xdcf642]
liftbridge2_1        | 
liftbridge2_1        | goroutine 1 [running]:
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*replica).updateLatestOffset(0x0, 0xffffffffffffffff, 0xc0002bad90)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:40 +0x22
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).becomeLeader(0xc000497e00, 0x5f48, 0x32, 0xc00045e4c0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:494 +0xe1
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).startLeadingOrFollowing(0xc000497e00, 0xc000128948, 0x10c3018)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:430 +0x3c3
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).SetLeader(0xc000497e00, 0xc0000b3240, 0xb, 0x5f48, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:402 +0x117
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).addPartition(0xc000174b00, 0xc000448af0, 0xc000412900, 0x0, 0xc0000bcd80, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/metadata.go:780 +0xf3
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).AddStream(0xc000174b00, 0xc000133810, 0x0, 0x0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/metadata.go:740 +0x3fd
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).applyCreateStream(0xc0002b6b40, 0xc000133810, 0x0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/fsm.go:367 +0x49
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).Restore(0xc0002b6b40, 0x121bfc0, 0xc000106c20, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/fsm.go:347 +0x344
liftbridge2_1        | github.com/hashicorp/raft.(*Raft).restoreSnapshot(0xc00041e000, 0x1, 0x1)
liftbridge2_1        | 	/go/pkg/mod/github.com/hashicorp/[email protected]/api.go:594 +0x28d
liftbridge2_1        | github.com/hashicorp/raft.NewRaft(0xc000129150, 0x12243c0, 0xc0002b6b40, 0x1236c00, 0xc0000bbc40, 0x122a9c0, 0xc0000a8f40, 0x1223980, 0xc0003861b0, 0x123d280, ...)
liftbridge2_1        | 	/go/pkg/mod/github.com/hashicorp/[email protected]/api.go:542 +0x801
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).createRaftNode(0xc0002b6b40, 0xc000129238, 0x47bc1c, 0xc0002a15b0, 0xc0001a8600)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/raft.go:395 +0x6a2
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).setupMetadataRaft(0xc0002b6b40, 0x10985e3, 0x23, 0xc000064110)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/raft.go:203 +0x40
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).Start(0xc0002b6b40, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/server.go:178 +0xa1b
liftbridge2_1        | main.start(0xc00026ef20, 0x0, 0xc0001cf940)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/main.go:39 +0xbe
liftbridge2_1        | github.com/urfave/cli.HandleAction(0xebfc20, 0x10c26f8, 0xc00026ef20, 0xc00026ef20, 0x0)
liftbridge2_1        | 	/go/pkg/mod/github.com/urfave/[email protected]/app.go:526 +0xbe
liftbridge2_1        | github.com/urfave/cli.(*App).Run(0xc0002ac1c0, 0xc000094150, 0x3, 0x3, 0x0, 0x0)
liftbridge2_1        | 	/go/pkg/mod/github.com/urfave/[email protected]/app.go:288 +0x5f8
liftbridge2_1        | main.main()
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/main.go:24 +0x11a
noc_liftbridge2_1 exited with code 2

I've tried to clean 2nd node data directiory without visible result.

I can suppose the reason in following code in partition.go

	// Update this replica's latest offset to ensure it's up to date.
	rep := p.isr[p.srv.config.Clustering.ServerID]
	rep.updateLatestOffset(p.log.NewestOffset())

p.isr contains nil pointer, set somewhere before.

@tylertreat
Copy link
Member

Are you able to reproduce the issue reliably?

@dvolodin7
Copy link
Contributor Author

Yes. This node fails every restart.

@tylertreat
Copy link
Member

It seems like what's happening is that the node is becoming leader for a partition but the node is not part of the ISR, thus p.isr[p.srv.config.Clustering.ServerID] is returning nil. Does one of the other two nodes in the cluster not take over as leader when this node crashes?

Do you have the logs leading up to the crash? It would seem this node was removed from the ISR and then somehow became leader for the partition.

@dvolodin7
Copy link
Contributor Author

Crash occuring when I'm starting 3 liftbridge nodes in docker containers. There is full log from all 3 nodes.

Starting noc_liftbridge2_1 ... done
Starting noc_liftbridge1_1 ... done
Starting noc_liftbridge3_1 ... done
Attaching to noc_liftbridge2_1, noc_liftbridge1_1, noc_liftbridge3_1
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Liftbridge Version:        v1.6.0"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Liftbridge Version:        v1.6.0"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Server ID:                 liftbridge2"
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Server ID:                 liftbridge1"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Namespace:                 liftbridge-default"
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Namespace:                 liftbridge-default"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="NATS Servers:              [nats://nats:4222/]"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Default Retention Policy:  [Age: 1 day, Compact: true]"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Default Partition Pausing: disabled"
liftbridge2_1        | time="2021-07-04 09:14:57" level=info msg="Starting Liftbridge server on 0.0.0.0:9292..."
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="NATS Servers:              [nats://nats:4222/]"
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Default Retention Policy:  [Age: 1 day, Compact: true]"
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Default Partition Pausing: disabled"
liftbridge1_1        | time="2021-07-04 09:14:57" level=info msg="Starting Liftbridge server on 0.0.0.0:9292..."
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Liftbridge Version:        v1.6.0"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Server ID:                 liftbridge3"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Namespace:                 liftbridge-default"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="NATS Servers:              [nats://nats:4222/]"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Default Retention Policy:  [Age: 1 day, Compact: true]"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Default Partition Pausing: disabled"
liftbridge3_1        | time="2021-07-04 09:14:57" level=info msg="Starting Liftbridge server on 0.0.0.0:9292..."
liftbridge2_1        | panic: runtime error: invalid memory address or nil pointer dereference
liftbridge2_1        | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xdcf642]
liftbridge2_1        | 
liftbridge2_1        | goroutine 1 [running]:
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*replica).updateLatestOffset(0x0, 0xffffffffffffffff, 0xc000290d90)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:40 +0x22
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).becomeLeader(0xc00039d500, 0x5f48, 0x32, 0xc00000c640)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:494 +0xe1
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).startLeadingOrFollowing(0xc00039d500, 0xc0004d6948, 0x10c3018)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:430 +0x3c3
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*partition).SetLeader(0xc00039d500, 0xc0001571f0, 0xb, 0x5f48, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/partition.go:402 +0x117
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).addPartition(0xc00014ea80, 0xc0000fe690, 0xc00042c900, 0x0, 0xc000304ea0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/metadata.go:780 +0xf3
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*metadataAPI).AddStream(0xc00014ea80, 0xc000318460, 0x0, 0x0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/metadata.go:740 +0x3fd
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).applyCreateStream(0xc00028cd80, 0xc000318460, 0x0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/fsm.go:367 +0x49
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).Restore(0xc00028cd80, 0x121bfc0, 0xc0002d47c0, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/fsm.go:347 +0x344
liftbridge2_1        | github.com/hashicorp/raft.(*Raft).restoreSnapshot(0xc000446000, 0x1, 0x1)
liftbridge2_1        | 	/go/pkg/mod/github.com/hashicorp/[email protected]/api.go:594 +0x28d
liftbridge2_1        | github.com/hashicorp/raft.NewRaft(0xc0004d7150, 0x12243c0, 0xc00028cd80, 0x1236c00, 0xc0001543c0, 0x122a9c0, 0xc000152320, 0x1223980, 0xc000158a50, 0x123d280, ...)
liftbridge2_1        | 	/go/pkg/mod/github.com/hashicorp/[email protected]/api.go:542 +0x801
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).createRaftNode(0xc00028cd80, 0xc000125238, 0x47bc1c, 0xc0002755b0, 0xc00017a680)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/raft.go:395 +0x6a2
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).setupMetadataRaft(0xc00028cd80, 0x10985e3, 0x23, 0xc000102a00)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/raft.go:203 +0x40
liftbridge2_1        | github.com/liftbridge-io/liftbridge/server.(*Server).Start(0xc00028cd80, 0x0, 0x0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/server/server.go:178 +0xa1b
liftbridge2_1        | main.start(0xc00024d080, 0x0, 0xc0001a1ae0)
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/main.go:39 +0xbe
liftbridge2_1        | github.com/urfave/cli.HandleAction(0xebfc20, 0x10c26f8, 0xc00024d080, 0xc00024d080, 0x0)
liftbridge2_1        | 	/go/pkg/mod/github.com/urfave/[email protected]/app.go:526 +0xbe
liftbridge2_1        | github.com/urfave/cli.(*App).Run(0xc0002801c0, 0xc000096150, 0x3, 0x3, 0x0, 0x0)
liftbridge2_1        | 	/go/pkg/mod/github.com/urfave/[email protected]/app.go:288 +0x5f8
liftbridge2_1        | main.main()
liftbridge2_1        | 	/go/src/github.com/liftbridge-io/liftbridge/main.go:24 +0x11a
noc_liftbridge2_1 exited with code 2
liftbridge1_1        | time="2021-07-04 09:15:00" level=error msg="Failed to fetch last offset for leader epoch for partition [subject=ch.alarms, stream=ch.alarms, partition=0]: nats: timeout"
liftbridge3_1        | time="2021-07-04 09:15:00" level=error msg="Failed to fetch last offset for leader epoch for partition [subject=__offset.message, stream=__offset.message, partition=0]: nats: timeout"
liftbridge1_1        | time="2021-07-04 09:15:03" level=error msg="Error sending replication request for partition [subject=ch.alarms, stream=ch.alarms, partition=0]: nats: timeout"
liftbridge3_1        | time="2021-07-04 09:15:03" level=error msg="Error sending replication request for partition [subject=__offset.message, stream=__offset.message, partition=0]: nats: timeout"
liftbridge3_1        | time="2021-07-04 09:15:04" level=error msg="Failed to fetch last offset for leader epoch for partition [subject=ch.sla, stream=ch.sla, partition=0]: nats: timeout"
liftbridge1_1        | time="2021-07-04 09:15:05" level=error msg="Failed to fetch last offset for leader epoch for partition [subject=ch.mac, stream=ch.mac, partition=0]: nats: timeout"

For reason unexplained the second node crashes just after start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants