Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airlock fails due to DNS timeout - returns "Request failed due to an unknown reason." #3767

Closed
marrobi opened this issue Oct 26, 2023 · 7 comments · Fixed by #3769
Closed
Assignees
Labels
bug Something isn't working

Comments

@marrobi
Copy link
Member

marrobi commented Oct 26, 2023

Seen this a few times on multiple deployments, seen in tests today.

"new_status":"submitted","previous_status":"draft","type":"import"
Failed to resolve 'stalimiptre0db03523.blob.core.windows.net' ([Errno -2] Name or service not known)

In this code:

def create_container(account_name: str, request_id: str):
try:
container_name = request_id
blob_service_client = BlobServiceClient(account_url=get_account_url(account_name),
credential=get_credential())
blob_service_client.create_container(container_name)
logging.info(f'Container created for request id: {request_id}.')
except ResourceExistsError:
logging.info(f'Did not create a new container. Container already exists for request id: {request_id}.')

Root cause is the DNS record does not appear in the DNS zone, although it says it is there on the private endpoint:

image

Failed processing Airlock request with ID: '2c3aa93d-fa41-41fe-a6e7-83673bc1baa1', changing request status to 'failed'. Traceback (most recent call last): File "/home/site/wwwroot/StatusChangedQueueTrigger/__init__.py", line 37, in main handle_status_changed(request_properties, stepResultEvent, dataDeletionEvent, request_files) File "/home/site/wwwroot/StatusChangedQueueTrigger/__init__.py", line 73, in handle_status_changed blob_operations.create_container(containers_metadata.dest_account_name, req_id) File "/home/site/wwwroot/shared_code/blob_operations.py", line 31, in create_container blob_service_client.create_container(container_name) File "/usr/local/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_blob_service_client.py", line 563, in create_container container.create_container( File "/usr/local/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_container_client.py", line 308, in create_container return self._client.container.create( # type: ignore File "/usr/local/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_generated/operations/_container_operations.py", line 982, in create pipeline_response: PipelineResponse = self._client._pipeline.run( # pylint: disable=protected-access File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 230, in run return first_node.send(pipeline_request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/policies/_redirect.py", line 197, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_shared/policies.py", line 549, in send raise err File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_shared/policies.py", line 523, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/policies/_authentication.py", line 126, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_shared/policies.py", line 312, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 86, in send response = self.next.send(request) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 119, in send self._sender.send(request.http_request, **request.context.options), File "/usr/local/lib/python3.8/site-packages/azure/storage/blob/_shared/base_client.py", line 332, in send return self._transport.send(request, **kwargs) File "/usr/local/lib/python3.8/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 381, in send raise error azure.core.exceptions.ServiceRequestError: <urllib3.connection.HTTPSConnection object at 0x7dc72cdff370>: Failed to resolve 'stalimiptre0db03523.blob.core.windows.net' ([Errno -2] Name or service not known)
@marrobi marrobi added the bug Something isn't working label Oct 26, 2023
@marrobi marrobi changed the title Airlock fails due to DNS timeout - returns "Unknown Error" Airlock fails due to DNS timeout - returns "Request failed due to an unknown reason." Oct 26, 2023
@marrobi
Copy link
Member Author

marrobi commented Oct 26, 2023

In the logs I can see various events, including two separate creates with the same FQDN but different IPs on the same DNS zone.

"responseBody": "{\"id\":\"/subscriptions/bb0916b2-1642-4efb-897d-a41ecc9fd132/resourceGroups/rg-tre0db03523/providers/Microsoft.Network/privateDnsZones/privatelink.blob.core.windows.net/A/stalimiptre0db03523\",\"name\":\"stalimiptre0db03523\",\"type\":\"Microsoft.Network/privateDnsZones/A\",\"etag\":\"89265689-ac8d-4f1c-8952-5c4cae12f7d6\",\"properties\":{\"metadata\":{\"creator\":\"created by private endpoint pe-stg-import-inprogress-blob-tre0db03523 with resource guid 6e336f1c-5dc4-4a5d-b0bc-b0e5474922f5\"},\"fqdn\":\"stalimiptre0db03523.privatelink.blob.core.windows.net.\",\"ttl\":10,\"aRecords\":[{\"ipv4Address\":\"10.0.144.5\"}],\"isAutoRegistered\":false}}",

"fqdn":"stalimiptre0db03523.privatelink.blob.core.windows.net.","ttl":10,"aRecords":[{"ipv4Address":"10.0.144.5"}]

        "responseBody": "{\"id\":\"/subscriptions/bb0916b2-1642-4efb-897d-a41ecc9fd132/resourceGroups/rg-tre0db03523/providers/Microsoft.Network/privateDnsZones/privatelink.blob.core.windows.net/A/stalimiptre0db03523\",\"name\":\"stalimiptre0db03523\",\"type\":\"Microsoft.Network/privateDnsZones/A\",\"etag\":\"e9c0add1-69d2-43e6-8532-0453b21e7fda\",\"properties\":{\"metadata\":{\"creator\":\"created by private endpoint stg-ip-import-blob-tre0db03523-ws-6cab with resource guid b4d3bb5f-31a4-45f5-bccf-7021e1661799\"},\"fqdn\":\"stalimiptre0db03523.privatelink.blob.core.windows.net.\",\"ttl\":10,\"aRecords\":[{\"ipv4Address\":\"10.1.1.4\"}],\"isAutoRegistered\":false}}",

"fqdn":"stalimiptre0db03523.privatelink.blob.core.windows.net.","ttl":10,"aRecords":[{"ipv4Address":"10.1.1.4"}]

The second one is with the airlock import review workspace. This is why we get 403s, or when the review workspace is deleted the A record is deleted.

We cannot have two records in the same zone with the same fqdn.

@marrobi
Copy link
Member Author

marrobi commented Oct 26, 2023

Looks like the issue has been seen before - #3215

@tamirkamara any thoughts? I'm not sur eon the best approach to resolve this.

This is the code:

resource "azurerm_private_endpoint" "sa_import_inprogress_pe" {
name = "stg-ip-import-blob-${local.workspace_resource_name_suffix}"
location = var.location
resource_group_name = azurerm_resource_group.ws.name
subnet_id = module.network.services_subnet_id
lifecycle { ignore_changes = [tags] }
private_dns_zone_group {
name = "pdzg-stg-ip-import-blob-${local.workspace_resource_name_suffix}"
private_dns_zone_ids = [data.azurerm_private_dns_zone.blobcore.id]
}
private_service_connection {
name = "psc-stg-ip-import-blob-${local.workspace_resource_name_suffix}"
private_connection_resource_id = data.azurerm_storage_account.sa_import_inprogress.id
is_manual_connection = false
subresource_names = ["Blob"]
}
tags = local.tre_workspace_tags
}

I'm also confused as to why this has been working. Maybe if two addresses it sues the IP in the appropriate subnet, and its the delete that is deleting both?

@marrobi
Copy link
Member Author

marrobi commented Oct 26, 2023

https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-dns

Existing Private DNS Zones linked to a single service should not be associated with two different Private Endpoints. This will cause a deletion of the initial A-record and result in resolution issue when attempting to access that service from each respective Private Endpoint. However, linking a Private DNS Zones with private endpoints associated with different services would not face this resolution constraint.

@marrobi
Copy link
Member Author

marrobi commented Oct 26, 2023

This is how export is handled:

resource "azurerm_storage_account_network_rules" "sa_export_inprogress_rules" {
storage_account_id = azurerm_storage_account.sa_export_inprogress.id
# The Airlock processor is unable to copy blobs from the export-inprogress storage account when the only method of access from the Airlock processor is a private endpoint in the core VNet,
# so we need to allow the Airlock processor subnet to access this storage account without using a private endpoint.
# https://github.com/microsoft/AzureTRE/issues/2098
virtual_network_subnet_ids = [var.airlock_processor_subnet_id]
default_action = var.enable_local_debugging ? "Allow" : "Deny"
bypass = ["AzureServices"]
}

I propose we do the same in the core deployment to allow access to the import in progress account from the airlock processor subnet and then leave the private endpoint in the import review workspace.

@marrobi
Copy link
Member Author

marrobi commented Oct 26, 2023

Reason started seeing this in our tests is that we now have airlock review workspace being installed and deleted in the extended e2e tests #3704

@marrobi
Copy link
Member Author

marrobi commented Oct 27, 2023

This is how export is handled:

resource "azurerm_storage_account_network_rules" "sa_export_inprogress_rules" {
storage_account_id = azurerm_storage_account.sa_export_inprogress.id
# The Airlock processor is unable to copy blobs from the export-inprogress storage account when the only method of access from the Airlock processor is a private endpoint in the core VNet,
# so we need to allow the Airlock processor subnet to access this storage account without using a private endpoint.
# https://github.com/microsoft/AzureTRE/issues/2098
virtual_network_subnet_ids = [var.airlock_processor_subnet_id]
default_action = var.enable_local_debugging ? "Allow" : "Deny"
bypass = ["AzureServices"]
}

I propose we do the same in the core deployment to allow access to the import in progress account from the airlock processor subnet and then leave the private endpoint in the import review workspace.

This won't work, as if are two review workspaces and one gets deleted, the record will disappear.

Could leave the core private endpoint as is. Add a private dns zone specific to the import in progress storage account in the airlock review workspace.

@eladiw
Copy link
Contributor

eladiw commented Oct 28, 2023

@anatbal

@anatbal anatbal assigned anatbal and unassigned anatbal Oct 29, 2023
marrobi added a commit that referenced this issue Nov 7, 2023
* Airlock fails due to DNS timeout - returns "Request failed due to an unknown reason."
Fixes #3767

* Update changelog description

* Word smithing

* Add HACK comment to more easily id items pending delete

---------

Co-authored-by: Sven Aelterman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants