Skip to content

Commit

Permalink
Implement RETRY room lifecycle operation
Browse files Browse the repository at this point in the history
Based on the spec referenced in 20f21c7. The RETRY part of this spec was
quite unclear, and I asked quite a few questions on the PR to understand
it better, so the behaviour implemented here is based on the spec plus
Andy’s answers to my questions (I’ve linked to the discussions in the
code and / or tests). Recently (i.e. after most of this commit was
already implemented, Andy has updated the spec with answers to these
questions, but in the interests of not dragging out the current task,
I’ll incorporate these updates in #66.

The internal triggering of the RETRY operation (as specified by
CHA-RL1h3 and CHA-RL4b9) will come in #50.

Resolves #51.
  • Loading branch information
lawrence-forooghian committed Nov 12, 2024
1 parent cee3d29 commit cb45a3f
Show file tree
Hide file tree
Showing 3 changed files with 478 additions and 20 deletions.
170 changes: 153 additions & 17 deletions Sources/AblyChat/RoomLifecycleManager.swift
Original file line number Diff line number Diff line change
Expand Up @@ -181,11 +181,12 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
internal enum Status: Equatable {
case initialized
case attachingDueToAttachOperation(attachOperationID: UUID)
case attachingDueToRetryOperation(retryOperationID: UUID)
case attachingDueToContributorStateChange(error: ARTErrorInfo?)
case attached
case detaching(detachOperationID: UUID)
case detached
case suspended(error: ARTErrorInfo)
case suspended(retryOperationID: UUID, error: ARTErrorInfo)
case failed(error: ARTErrorInfo)
case releasing(releaseOperationID: UUID)
case released
Expand All @@ -196,6 +197,8 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
.initialized
case .attachingDueToAttachOperation:
.attaching(error: nil)
case .attachingDueToRetryOperation:
.attaching(error: nil)
case let .attachingDueToContributorStateChange(error: error):
.attaching(error: error)
case .attached:
Expand All @@ -204,7 +207,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
.detaching
case .detached:
.detached
case let .suspended(error):
case let .suspended(_, error):
.suspended(error: error)
case let .failed(error):
.failed(error: error)
Expand All @@ -219,12 +222,15 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
switch self {
case let .attachingDueToAttachOperation(attachOperationID):
attachOperationID
case let .attachingDueToRetryOperation(retryOperationID):
retryOperationID
case let .detaching(detachOperationID):
detachOperationID
case let .releasing(releaseOperationID):
releaseOperationID
case .suspended,
.initialized,
case let .suspended(retryOperationID, _):
retryOperationID
case .initialized,
.attached,
.detached,
.failed,
Expand Down Expand Up @@ -462,7 +468,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor

clearTransientDisconnectTimeouts()

changeStatus(to: .suspended(error: reason))
changeStatus(to: .suspended(retryOperationID: UUID(), error: reason))
}
case .attaching:
if !hasOperationInProgress, !contributorAnnotations[contributor].hasTransientDisconnectTimeout {
Expand Down Expand Up @@ -532,7 +538,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor

/// Whether the room lifecycle manager currently has a room lifecycle operation in progress.
///
/// - Warning: I haven’t yet figured out the exact meaning of “has an operation in progress” — at what point is an operation considered to be no longer in progress? Is it the point at which the operation has updated the manager’s status to one that no longer indicates an in-progress operation (this is the meaning currently used by `hasOperationInProgress`)? Or is it the point at which the `bodyOf*Operation` method for that operation exits (i.e. the point at which ``performAnOperation(_:)`` considers the operation to have completed)? Does it matter? I’ve chosen to not think about this very much right now, but might need to revisit. See TODO against `emitPendingDiscontinuityEvents` in `bodyOfDetachOperation` for an example of something where these two notions of “has an operation in progress” are not equivalent.
/// - Warning: I haven’t yet figured out the exact meaning of “has an operation in progress” — at what point is an operation considered to be no longer in progress? Is it the point at which the operation has updated the manager’s status to one that no longer indicates an in-progress operation (this is the meaning currently used by `hasOperationInProgress`)? Or is it the point at which the `bodyOf*Operation` method for that operation exits (i.e. the point at which ``performAnOperation(_:)`` considers the operation to have completed)? Does it matter? I’ve chosen to not think about this very much right now, but might need to revisit. See TODO against `emitPendingDiscontinuityEvents` in `performAttachmentCycle` for an example of something where these two notions of “has an operation in progress” are not equivalent.
private var hasOperationInProgress: Bool {
status.operationID != nil
}
Expand Down Expand Up @@ -675,7 +681,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
case .released:
// CHA-RL1c
throw ARTErrorInfo(chatError: .roomIsReleased)
case .initialized, .suspended, .attachingDueToAttachOperation, .attachingDueToContributorStateChange, .detached, .detaching, .failed:
case .initialized, .suspended, .attachingDueToAttachOperation, .attachingDueToRetryOperation, .attachingDueToContributorStateChange, .detached, .detaching, .failed:
break
}

Expand All @@ -684,8 +690,39 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
try? await waitForCompletionOfOperationWithID(currentOperationID, waitingOperationID: operationID)
}

try await performAttachmentCycle(trigger: .attachOperation(id: operationID))
}

private enum AttachmentCycleTrigger {
case attachOperation(id: UUID)
case retryOperation(id: UUID)

var attachingStatus: Status {
switch self {
case let .attachOperation(id):
.attachingDueToAttachOperation(attachOperationID: id)
case let .retryOperation(id):
.attachingDueToRetryOperation(retryOperationID: id)
}
}

// TODO: document
func createRetryOperationID() -> UUID {
switch self {
case .attachOperation:
// i.e. generate a new ID
UUID()
case let .retryOperation(id):
// i.e. keep the ID of the existing RETRY operation
id
}
}
}

// TODO: explain what this is, and what I’ve guessed
private func performAttachmentCycle(trigger: AttachmentCycleTrigger) async throws {
// CHA-RL1e
changeStatus(to: .attachingDueToAttachOperation(attachOperationID: operationID))
changeStatus(to: trigger.attachingStatus)

// CHA-RL1f
for contributor in contributors {
Expand All @@ -701,18 +738,34 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
case .suspended:
// CHA-RL1h2
let error = ARTErrorInfo(chatError: .attachmentFailed(feature: contributor.feature, underlyingError: contributorAttachError))
changeStatus(to: .suspended(error: error))

// CHA-RL1h3
throw error
let retryOperationID = trigger.createRetryOperationID()
changeStatus(to: .suspended(retryOperationID: retryOperationID, error: error))

switch trigger {
case .attachOperation:
// CHA-RL1h3
throw error
case .retryOperation:
// CHA-RL5f3 (TODO confirm that the idea is continue the existing RETRY operation, not to schedule a new one)
// TODO: it’s not great that there’s recursion here, could have stack overflow; would be better as a loop
logger.log(message: "Attachment cycle will continue existing RETRY operation with ID \(retryOperationID)", level: .debug)
await bodyOfRetryOperation(operationID: retryOperationID, triggeredByContributor: contributor)
return
}
case .failed:
// CHA-RL1h4
let error = ARTErrorInfo(chatError: .attachmentFailed(feature: contributor.feature, underlyingError: contributorAttachError))
changeStatus(to: .failed(error: error))

// CHA-RL1h5
// TODO: Implement the "asynchronously with respect to CHA-RL1h4" part of CHA-RL1h5 (https://github.com/ably-labs/ably-chat-swift/issues/50)
await detachNonFailedContributors()
switch trigger {
case .attachOperation:
await detachNonFailedContributors()
case .retryOperation:
// TODO: _are_ we meant to do the CHA-RL1h5 in the CHA-RL5f case? (https://github.com/ably/specification/pull/200/files#r1829964805)
break
}

throw error
default:
Expand All @@ -728,8 +781,8 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
// CHA-RL1g1
changeStatus(to: .attached)

// CHA-RL1g2
// TODO: It’s not clear to me whether this is considered to be part of the ATTACH operation or not; see the note on the ``hasOperationInProgress`` property
// CHA-RL1g2, CHA-RL5f1
// TODO: It’s not clear to me whether this is considered to be part of the ATTACH / RETRY operation or not; see the note on the ``hasOperationInProgress`` property
await emitPendingDiscontinuityEvents()
}

Expand Down Expand Up @@ -798,7 +851,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
case .failed:
// CHA-RL2d
throw ARTErrorInfo(chatError: .roomInFailedState)
case .initialized, .suspended, .attachingDueToAttachOperation, .attachingDueToContributorStateChange, .attached, .detaching:
case .initialized, .suspended, .attachingDueToAttachOperation, .attachingDueToRetryOperation, .attachingDueToContributorStateChange, .attached, .detaching:
break
}

Expand Down Expand Up @@ -897,7 +950,7 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
// See note on waitForCompletionOfOperationWithID for the current need for this force try
// swiftlint:disable:next force_try
return try! await waitForCompletionOfOperationWithID(releaseOperationID, waitingOperationID: operationID)
case .initialized, .attached, .attachingDueToAttachOperation, .attachingDueToContributorStateChange, .detaching, .suspended, .failed:
case .initialized, .attached, .attachingDueToAttachOperation, .attachingDueToRetryOperation, .attachingDueToContributorStateChange, .detaching, .suspended, .failed:
break
}

Expand Down Expand Up @@ -935,4 +988,87 @@ internal actor DefaultRoomLifecycleManager<Contributor: RoomLifecycleContributor
// CHA-RL3g
changeStatus(to: .released)
}

// MARK: - RETRY operation

/// Implements CHA-RL5’s RETRY operation.
///
/// - Parameters:
/// - forcedOperationID: Allows tests to force the operation to have a given ID. In combination with the ``testsOnly_subscribeToOperationWaitEvents`` API, this allows tests to verify that one test-initiated operation is waiting for another test-initiated operation.
/// - triggeringContributor: This is, in the language of CHA-RL5d, “the channel that caused the retry loop”.
internal func performRetryOperation(testsOnly_forcingOperationID forcedOperationID: UUID? = nil, triggeredByContributor triggeringContributor: Contributor) async {
await performAnOperation(forcingOperationID: forcedOperationID) { operationID in
await bodyOfRetryOperation(operationID: operationID, triggeredByContributor: triggeringContributor)
}
}

private func bodyOfRetryOperation(operationID: UUID, triggeredByContributor triggeringContributor: Contributor) async {
detachAllContributorsLoop: while true {
// CHA-RL5a - This is actually what’s written in the spec plus Andy’s clarification from https://github.com/ably/specification/pull/200/files#r1794116352 that "the implementation is supposed to be that all channels _except_ the one that entered suspended should be sent to detached" (TODO remove this comment once spec updated)
for contributor in contributors where contributor.id != triggeringContributor.id {
do {
logger.log(message: "RETRY operation will detach contributor \(contributor)", level: .debug)
try await contributor.channel.detach()
} catch {
let contributorState = await contributor.channel.state
if contributorState == .failed {
// CHA-RL5c
guard let contributorErrorReason = await contributor.channel.errorReason else {
// TODO: We assume this will be populated, but working in a multi-threaded environment means it might not be (https://github.com/ably-labs/ably-chat-swift/issues/49)
preconditionFailure("Contributor entered FAILED but its errorReason is not set")
}
// TODO: which error to use? https://github.com/ably/specification/pull/200/files#r1827819961
changeStatus(to: .failed(error: contributorErrorReason))
return
} else {
// CHA-RL5b - This is actually what’s written in the spec plus Andy’s clarification from https://github.com/ably/specification/pull/200/files#r1794116352 that this means "If the operation above fails because a channel has entered a state other than FAILED" (TODO remove this comment once spec updated)
// TODO: This duration is a guess, find out correct duration (https://github.com/ably/specification/pull/200/files#r1825086288)
let waitDuration = 0.5
logger.log(message: "Got error \(error) attempting to detach contributor \(contributor); will retry detaching contributors in \(waitDuration)s", level: .debug)
// TODO: what do to if cancelled?
try! await clock.sleep(timeInterval: waitDuration)
// TODO: this is a bit messy
continue detachAllContributorsLoop
}
}
}

break
}

// TODO: what if this had already happened sometime during the above detaches?
logger.log(message: "RETRY waiting for \(triggeringContributor) to enter ATTACHED", level: .debug)
waitForAttached: for await stateChange in await triggeringContributor.channel.subscribeToState() {
switch stateChange.current {
// CHA-RL5d
case .attached:
logger.log(message: "RETRY completed waiting for \(triggeringContributor) to enter ATTACHED", level: .debug)
break waitForAttached
// CHA-RL5e
case .failed:
guard let contributorErrorReason = stateChange.reason else {
preconditionFailure("Contributor entered FAILED but state change’s reason is not set")
}
logger.log(message: "RETRY failed waiting for \(triggeringContributor) to enter ATTACHED, since it entered FAILED with error \(contributorErrorReason)", level: .debug)

// TODO: which error to use? https://github.com/ably/specification/pull/200/files#r1829344042
changeStatus(to: .failed(error: contributorErrorReason))
return
case .attaching, .detached, .detaching, .initialized, .suspended:
break
@unknown default:
break
}
}

// CHA-RL5f
// TODO: put spec points for the other CHA-RL5f1* points once they’re clarified
do {
try await performAttachmentCycle(trigger: .retryOperation(id: operationID))
} catch {
// CHA-RL5f2
// TODO: document the circumstances in which the attachment cycle throws an error (it should only be FAILED)
logger.log(message: "RETRY operation attachment cycle resulted in \(error); ending RETRY", level: .debug)
}
}
}
Loading

0 comments on commit cb45a3f

Please sign in to comment.