Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Service Bus Namespace' Continues Running Even After $ make tre-stop #3953

Open
BiologyGeek opened this issue May 27, 2024 · 40 comments
Open
Labels
question Further information is requested

Comments

@BiologyGeek
Copy link

Hello team,

Is it expected behavior for the 'Service Bus Namespace' to keep running even after executing the $ make tre-stop command?

This screenshot was captured after running the $ make tre-stop command:
image

Given that the Premium tier of this service is not inexpensive, is there a way to turn it off or disable it when not needed?

@BiologyGeek BiologyGeek added the question Further information is requested label May 27, 2024
@jonnyry
Copy link
Collaborator

jonnyry commented May 28, 2024

It's not possible to temporarily stop the Service Bus (or suspend the billing) without deletion. Thread below when I posed a similar question:

#3782

@BiologyGeek
Copy link
Author

BiologyGeek commented Jun 5, 2024

It's not possible to temporarily stop the Service Bus (or suspend the billing) without deletion. Thread below when I posed a similar question:

#3782

Thank you @jonnyry!

I deleted the 'Service Bus Namespace', but this resulted in abnormal activity in the 'Log Analytics workspace' and a lot of data ingestion, which caused a higher cost than the Service Bus Namespace itself.

Is there a way to prevent abnormal activity after removing the Service Bus Namespace? @marrobi

image

@marrobi
Copy link
Member

marrobi commented Jun 5, 2024

@BiologyGeek I guess the logs are coming from the API web app and resource processor VMSS.

So if you stop both of them, as per the other issue you raised, stopping the web app won't save money, but would hopefully stop these errors being logged.

@marrobi
Copy link
Member

marrobi commented Jun 5, 2024

It might be someone could look at using standard SKU service bus and having a config for users who don't require the service bus to be on a private network - for example for development purposes.

@jonnyry
Copy link
Collaborator

jonnyry commented Jun 5, 2024

Would it possible to switch out for one of the other (less expensive) queue/event type Azure services... Queue Storage, Event Grid, Event Hubs... are there features/characteristics of the Service Bus that we specifically require?

@marrobi
Copy link
Member

marrobi commented Jun 5, 2024

I think its session support. @damoodamoo @tamirkamara may be able to advise.

@damoodamoo
Copy link
Member

we do require session support for ordered delivery unfortunately. think that's also part of standard SKU though, so a 'dev' switch to allow it to be deployed in standard would probably be the best bet...

@jonnyry
Copy link
Collaborator

jonnyry commented Jun 27, 2024

Thanks @marrobi @damoodamoo

I'm guessing some additional network configuration would be required once the Service Bus SKU was switched to Standard, since the private endpoints & VNET integration are no longer available...

  • An outboud app rule in the Firewall to allow access to the Service Bus public FQDN ports 5671, 5672, 443. It would be nice to tie down the source IPs/subnets if possible.
  • The adding an IP restriction on the Service Bus for the Firewall's public IP? This appears to be possible via the REST API, even though its not visible in the Azure portal.

Would that do it?

In terms of locking down the source IPs/subnets, am I right in thinking the following components connect to the Service Bus?:

  • Resource Processor
  • API
  • Airlock Processor Function App
  • anything else?

@marrobi
Copy link
Member

marrobi commented Jun 27, 2024

I think that's it.

Re the firewall rules, you can do it in an ARM template ( https://learn.microsoft.com/en-us/azure/service-bus-messaging/service-bus-ip-filtering#use-resource-manager-template ), so if not supported in TF, would think can do it using AzAPI provider.

@marrobi
Copy link
Member

marrobi commented Jun 27, 2024

think can do it here - https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/servicebus_namespace#ip_rules

@jonnyry
Copy link
Collaborator

jonnyry commented Jun 27, 2024

Thanks.

Re opening the Azure Firewall outbound to the service bus FQDN, is looking trickier than I first thought -

  • a network rule is required rather than application rule since we need to open ports 443,5671,5672
  • network rules don't support FQDN resolution, unless you turn on DNS on the Firewall Policy. The TRE has it turned off currently, plus its only supported in Firewalls Standard & Premium - could do this, though it doesn't help in achieving the aim of lowering the SKUs for non prod TRE instances.

Attempted to use a network rule with a Service Tag instead of FQDN:

  • Service Bus Service tag is only maintained for the Service Bus Premium:

image

An network rule on IP does work, however I'm imagining the IP will not stay same for long.

Not sure there's any easy solution to this one!

@marrobi
Copy link
Member

marrobi commented Jun 27, 2024

Is the purpose of this PR to reduce costs when NOT in production? If so then does the IP filtering matter? As long as the none VNet service buss is only enabled by a clear flag.

@jonnyry
Copy link
Collaborator

jonnyry commented Jun 27, 2024

Is the purpose of this PR to reduce costs when NOT in production?

Yes correct - to reduce the cost of dev/test instances.

Production would use premium SKUs (private endpoints/VNET integration etc).

If so then does the IP filtering matter? As long as the none VNet service buss is only enabled by a clear flag.

I suppose not (or less so anyway). The firewall still needs opening to allow traffic out to the service bus public IP - and would be preferable if it wasn't open to any destination.

@marrobi
Copy link
Member

marrobi commented Jun 28, 2024

We already do this for local dev:

ip_rules = var.enable_local_debugging ? [local.myip] : null

I'm not sure it matters if its open to the internet for dev purposes?

@jonnyry
Copy link
Collaborator

jonnyry commented Jun 28, 2024

No I agree, not that important.

However it's opening the firewall in the outbound direction to the service bus FQDN that's tricky...

image

It can be opened to the Service Bus IP, but it's not ideal - I don't know what the frequency of the IP changing is.

@TonyWildish-BH
Copy link
Contributor

adding my £0.02, I don't think we need the Service Bus at all. It's just a FIFO queue, and there are much cheaper ways to implement that than a premium tier Service Bus, especially given that the traffic is so low that performance will never be an issue.

I also don't think it's a good idea to have different architectural flavours in dev/test vs. production, that's asking for trouble.

So I'd like to see this expense removed from the production instance(s), not just dev or test.

@marrobi
Copy link
Member

marrobi commented Jul 1, 2024

@TonyWildish-BH There probably are other ways, but as with everything there is a time and effort to implement vs the actual cost of using the managed offering. Maybe you can suggest a design and submit a PR?

(agree test and prod should be consistent, but for dev, less so - we often develop using local compute for the API, resource processor etc so we can debug and have a shorter dev loop)

@damoodamoo
Copy link
Member

@TonyWildish-BH We use Service Bus with sessions for guaranteed ordered delivery. This is required when multiple operations stack up against a single resource, and there are multiple nodes/threads servicing those requests.

I'm not hearing that the cost is really a factor in production, so if it's a case of saving costs in a dev then implementing a switch to use Standard SKU and skip a few PEs sounds pretty reasonable to me.

It's a pain to pay so much more for private networking, but it's definitely a requirement for most prod workloads.

@jonnyry
Copy link
Collaborator

jonnyry commented Jul 1, 2024

adding my £0.02, I don't think we need the Service Bus at all. It's just a FIFO queue, and there are much cheaper ways to implement that than a premium tier Service Bus, especially given that the traffic is so low that performance will never be an issue.

I also don't think it's a good idea to have different architectural flavours in dev/test vs. production, that's asking for trouble.

So I'd like to see this expense removed from the production instance(s), not just dev or test.

@TonyWildish-BH yes I've recently come to that conclusion also. An enterprise message queue seems unnecessary (and costly) for tens or hundreds of messages a day.

I've parked trying to refactor the service bus to a use a Standard SKU for dev/test as there's too many gnarly changes required to make it work - as you say asking for trouble when your dev/test flavour is that different from prod.

@damoodamoo unfortunately more than just removing a few PEs. Here's are the key issues:

  1. Outbound Firewall rule. Service bus Standard runs on a public IP, therefore traffic requires a route out of the firewall. Not possible to lock down to an FQDN due to non 443 ports (without re-upgrading the firewall SKU which defeats the object). You can refactor the Service Bus to use AMQP over websockets to allow you use a FW Application Rule on the FQDN, but the library has a bug that causes this to fail: Service bus websocket connection break after a minute or so Azure/azure-sdk-for-python#31067

  2. Firewall deployment catch 22 During initial deployment, the TRE is deployed (inc the service bus) before the firewall is in place. Therefore the resource processor has a connection out to the service bus without the firewall's routing rules in place. As the resource processor installs the firewall routing rules, the resources processor's service bus connection breaks (since it can no longer go direct to the internet). This causes all kinds of fun, with the resource processor stuck in a loop attempting to install the firewall. No easy way out of that, without a reasonable amount of rewrite.

  3. Large messages - according to this comment we might encounter large messages:

# The returned payload might be large, especially for errors.
# Cosmos is the final destination of the messages where 2048 is the limit.
max_message_size_in_kilobytes = 2048 # default=1024

Service Bus Standard SKU won't cope with these.

@TonyWildish-BH
Copy link
Contributor

thanks for the quick feedback. We've got it on our backlog to do something about the Service Bus, but it's not risen far enough up the stack yet, probably in a couple of months. Will be happy to post more details here when we get there.

@damoodamoo
Copy link
Member

@jonnyry - thanks for the comments there, was a bunch of stuff i'd not realised.

@marrobi
Copy link
Member

marrobi commented Jul 4, 2024

This might be another option to reduce dev costs when available: Azure/azure-service-bus#223 (comment)

@jonnyry
Copy link
Collaborator

jonnyry commented Jul 4, 2024

OK useful to know.

Summarising potential options (and adding a few new ones) - please add any more to the list you have!

1. Resolve issues (above) in Service Bus SKU Standard

See above for issues encountered when attempting to change Service Bus to SKU Standard.

2. Delete the Service Bus on TRE stop

Previously proposed by @marrobi - alter the TRE stop/start scripts to terraform destroy and apply the Service Bus and directly associated resources (PEs etc).

3. Integrate the Service Bus emulator

Wait for Service Bus emulator (see above) and integrate that in dev/test. Currently estimated for end of 2024.

4. Use different queue product

Swap out Service Bus for a different queue product, e.g. RabbitMQ.

5. Use other Azure queue technology

Consider using one of the other queuing products in Azure and determine whether its behaviour meets the needs of the TRE - FIFO & large messages. Can we augment/workaround any limitations in behaviour (e.g. no large messages/lack of FIFO)?

6. SQL queue library

Given the relatively low message throughput consider a queue library on top of a SQL instance, e.g. https://pypi.org/project/pq/

@jonnyry
Copy link
Collaborator

jonnyry commented Jul 10, 2024

@marrobi wondering what your thoughts are on replacing the Service Bus for a containerised version of RabbitMQ? Would this be accepted as a PR?

@marrobi
Copy link
Member

marrobi commented Jul 10, 2024

For dev purposes as an option or completely? What compute would it run on?

@damoodamoo thoughts?

@jonnyry
Copy link
Collaborator

jonnyry commented Jul 10, 2024

Dev and production. Not looked closely at the compute, but something like Azure Container Apps.

@marrobi
Copy link
Member

marrobi commented Jul 10, 2024

It could go on the resource processor VM, we used a VMSS as expected to scale out to multiple instances, but never seen the need. the TF deployments tend to be low CPU.

FWIW, could also put the API on it rather than the web app, and use something like Portainer to manage.

Only worry is in production is single instance and needs to be "supported" by the user rather than being a managed service.

@TonyWildish-BH
Copy link
Contributor

Another option for RabbitMQ is to use the service from the Azure Marketplace. Trade some of the savings for better maintenance etc.

@marrobi
Copy link
Member

marrobi commented Jul 10, 2024

Thanks @TonyWildish-BH . Marketplace are a challenge for us as we have to have a credit card associated with the subscription to pay the vendor. This isn't an option for us internally, as we couldn't deploy the marketplace option.

@damoodamoo
Copy link
Member

I'd have a concern that in order to reduce cost, we're increasing complexity. What is the target running cost of a TRE in Prod or Dev that we're trying to get to?

@BiologyGeek
Copy link
Author

Another option for RabbitMQ is to use the service from the Azure Marketplace. Trade some of the savings for better maintenance etc.

Thanks @TonyWildish-BH . Marketplace are a challenge for us as we have to have a credit card associated with the subscription to pay the vendor. This isn't an option for us internally, as we couldn't deploy the marketplace option.

Guys, as RabbitMQ is open source, is it really necessary to use marketplace offers?

@TonyWildish-BH
Copy link
Contributor

Guys, as RabbitMQ is open source, is it really necessary to use marketplace offers?

that depends entirely on who you want to pay for the maintenance. You can DIY, or you can use Marketplace and let someone else do it for you. I prefer the latter, we don't have bandwidth to be sysadmins in the cloud as well.

@jonnyry
Copy link
Collaborator

jonnyry commented Jul 18, 2024

It could go on the resource processor VM, we used a VMSS as expected to scale out to multiple instances, but never seen the need. the TF deployments tend to be low CPU.

I like that idea, although given the resource processor is a VM scale set, does it have a stable IP / hostname?

@marrobi
Copy link
Member

marrobi commented Nov 29, 2024

May be of interest - https://github.com/Azure/azure-service-bus-emulator-installer

@marrobi
Copy link
Member

marrobi commented Jan 6, 2025

@jonnyry @TonyWildish-BH I had a brief look at the emulator for development/test scenarios.

It was looking promising until get onto event grid subscriptions with the airlock. Event grid is heavily used by the airlock, and potentially future scenarios. These integrations requires a service bus resource ID, as per https://learn.microsoft.com/en-us/azure/event-grid/handler-service-bus, which the emulator does not have. If anyone has a suggestion would be good to hear, but at the moment, moving away form Azure PaaS resources will require a lot of DIY effort in stringing things together.

@marrobi
Copy link
Member

marrobi commented Jan 6, 2025

This should reduce service bus cost down to under c £10 a month - #4256

Would only recommend for development purposes and isn't fully tested.

Couple of bits to finish off.

@jonnyry @TonyWildish-BH is this something that would be valuable?

@marrobi
Copy link
Member

marrobi commented Jan 6, 2025

Actualyl just seen this - #3953 (comment) :-D

@jonnyry
Copy link
Collaborator

jonnyry commented Jan 6, 2025

@marrobi very much so :-)

though few gnarly challenges to solve above.

@jonnyry
Copy link
Collaborator

jonnyry commented Jan 6, 2025

Here's my original attempt if it's any use: https://github.com/nwsde/nwsde-azuretre/commits/jr/54-service-bus-sku/

@TonyWildish-BH
Copy link
Contributor

TonyWildish-BH commented Jan 6, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants