Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine error messages about not having a propolis address #7263

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 25 additions & 16 deletions nexus/src/app/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1610,11 +1610,19 @@ impl super::Nexus {
if let Some(max_bytes) = params.max_bytes {
request = request.max_bytes(max_bytes);
}
if let Some(from_start) = params.from_start {
request = request.from_start(from_start);
}
if let Some(most_recent) = params.most_recent {
request = request.most_recent(most_recent);
match (params.from_start, params.most_recent) {
(Some(from_start), None) => {
request = request.from_start(from_start);
}
(None, Some(most_recent)) => {
request = request.most_recent(most_recent);
}
_ => {
return Err(Error::invalid_request(
"Exactly one of 'from_start' \
or 'most_recent' must be specified.",
));
Comment on lines +1621 to +1624
Copy link
Member Author

@iximeow iximeow Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't super love this, and maybe @hawkw's new dropshot error type stuff can help? this reproduces a check that's done in both Propolis and propolis-sim. when they see invalid parameters, they respond with a 400 whose message is the same as here. my thinking is, the connection to Propolis could error for other reasons, and for any other reason it would be correctly categorized as a 500. and i didn't want to just forward any 400 error from Propolis out to users here because Propolis errors aren't necessarily intended for direct end-user consumption...

if parameter validation gets more strict on the Propolis side, though, things checked there and not here will be 500s..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hm. I think once #7196 we could have Propolis return a structured error here. But, since we are constructing the request in Nexus, I think it's reasonable-ish to say that any invalid request sent to Propolis is arguably an "internal server error" and we should be validating it here beforehand. On the other hand, duplicating the validation also feels bad. I dunno...

}
}
let data = request
.send()
Expand Down Expand Up @@ -1714,29 +1722,30 @@ impl super::Nexus {
match vmm.runtime.state {
DbVmmState::Running
| DbVmmState::Rebooting
| DbVmmState::Migrating => {
Ok((vmm.clone(), SocketAddr::new(vmm.propolis_ip.ip(), vmm.propolis_port.into())))
}
| DbVmmState::Migrating => Ok((
vmm.clone(),
SocketAddr::new(
vmm.propolis_ip.ip(),
vmm.propolis_port.into(),
),
)),

DbVmmState::Starting
| DbVmmState::Stopping
| DbVmmState::Stopped
| DbVmmState::Failed
| DbVmmState::Creating => {
| DbVmmState::Creating
| DbVmmState::Destroyed
| DbVmmState::SagaUnwound => {
Err(Error::invalid_request(format!(
"cannot connect to serial console of instance in state \"{}\"",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried to make this a stronger error type than Error here, so we could have nice "serial console" errors for the serial console path, and "propolis" errors for the region{, snapshot}_replacement saga path. but this runs into a bunch of new wrinkles like, instance_lookup.lookup_for() can fail in an Error-y way, db_datastore.instance_fetch_with_vmm() can fail in an Error-y way.

so after spinning a little too long on error plumbing here, i decided to take the different approach of picking the lowest common denominator message that makes sense for any caller of propolis_addr_for_instance. maybe "cannot connect to admin socket" or something would be more accurate, but i don't want to be confusing in a user-facing error message - there's no secret backdoor socket in your instance! we're just talking to the VMM!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think just making it generic is fine. 's fair. a slightly different wording is something like "this operation cannot be performed on instances in state ...", which is also generic but suggests that the specific thing you tried to do can't be done in that state. I worry a little that "administer an instance" is broad enough that it does include operations you can perform in these states --- i.e., is deleting an instance "administering" it?

"cannot administer instance in state \"{}\"",
state.effective_state(),
)))
}

DbVmmState::Destroyed | DbVmmState::SagaUnwound => Err(Error::invalid_request(
"cannot connect to serial console of instance in state \"Stopped\"",
)),
}
} else {
Err(Error::invalid_request(format!(
"instance is {} and has no active serial console \
server",
"instance is {} and cannot be administered",
state.effective_state(),
)))
}
Expand Down
90 changes: 90 additions & 0 deletions nexus/tests/integration_tests/instances.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5430,6 +5430,8 @@ async fn test_instance_serial(cptestctx: &ControlPlaneTestContext) {
// Make sure we get a 404 if we try to access the serial console before creation.
let instance_serial_url =
get_instance_url(format!("{}/serial-console", instance_name).as_str());
let instance_serial_stream_url =
get_instance_url(format!("{}/serial-console", instance_name).as_str());
let error: HttpErrorResponseBody = NexusRequest::expect_failure(
client,
StatusCode::NOT_FOUND,
Expand Down Expand Up @@ -5517,6 +5519,47 @@ async fn test_instance_serial(cptestctx: &ControlPlaneTestContext) {
}
assert_eq!(&actual[..expected.len()], expected);

// Now try querying the serial output history endpoint with bad parameters.
// Check that the errors are indicative here.
let builder =
RequestBuilder::new(client, http::Method::GET, &instance_serial_url)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

// We provided neither from_start nor most_recent...
assert_eq!(
error.message,
"Exactly one of 'from_start' or 'most_recent' must be specified.",
);

let builder = RequestBuilder::new(
client,
http::Method::GET,
&format!("{}&from_start=0&most_recent=0", instance_serial_url),
)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

// We provided both from_start and most_recent...
assert_eq!(
error.message,
"Exactly one of 'from_start' or 'most_recent' must be specified.",
);

// Request a halt and verify both the immediate state and the finished state.
let instance = instance_next;
let instance_next =
Expand All @@ -5527,6 +5570,53 @@ async fn test_instance_serial(cptestctx: &ControlPlaneTestContext) {
> instance.runtime.time_run_state_updated
);

// As the instance is now stopping, we can't connect to its serial console
// anymore. We also can't get its cached data; the (simulated) Propolis is
// going away.

let builder = RequestBuilder::new(
client,
http::Method::GET,
&instance_serial_stream_url,
)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

assert_eq!(
error.message,
"cannot administer instance in state \"stopping\"",
);

// Have to pass some offset for the cached data request otherwise we'll
// error out of Nexus early on while validating parameters, before
// discovering the serial console is unreachable.
let builder = RequestBuilder::new(
client,
http::Method::GET,
&format!("{}&from_start=0", instance_serial_url),
)
.expect_status(Some(http::StatusCode::BAD_REQUEST));

let error = NexusRequest::new(builder)
.authn_as(AuthnMode::PrivilegedUser)
.execute()
.await
.unwrap()
.parsed_body::<dropshot::HttpErrorResponseBody>()
.unwrap();

assert_eq!(
error.message,
"cannot administer instance in state \"stopping\"",
);

let instance = instance_next;
instance_simulate(nexus, &instance_id).await;
let instance_next =
Expand Down
Loading