-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Design Proposal] Event-Based Traffic Mirroring Setup/Teardown #35
Comments
Overall seems like a good plan, few comments
|
Thanks for the review! In order:
Hopefully that all makes sense? [1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-service-event-list.html |
Updated the FAQ to address the question about using SNS/SQS instead |
Ok I think I understand. The only other feedback is priority. Unless its relatively "free" the phase 1 |
The auto-firing is (hypothetically) extremely easy to set up; should just be a few lines of CDK once the rest of phase 1 is in place. Additionally and philosophically, I'd argue it's actually higher priority than listening for specific event types because it serves as a backstop that hypothetically functions for all valid traffic sources. If we never implemented Phase 2, the automated scan would provide basic (if not ideal) capability for all resource types, while if the reverse were true we'd only have support for a few specific resource types, and wouldn't have an automated way to "catch" failures if something wacky happened to the configuration process for a specific ENI (i.e. listening for the AWS Service Events is an edge-triggered system). Obviously, we could begin listening to the dead letter queue and actioning failed AWS Service Events... but the automated scan gives us an equivalent capability that also works for all resource types. |
Ya doing the scheduling part is easy, I didn't know if you were already going to be doing the scanning part. Since it won't be CDK anymore I'm assuming there is a lot of code to write for the scanning? Maybe I still don't understand the design. If we never implement phase 2 we have failed. My philosophy is opposite, if you implement the scan first you might miss cases that the scan fixes for you and you might never do phase 2. For a MVP phase 2 is much more important. If I go to a customer and say you will have to wait on avg 30 seconds for each event before we detect it, we've failed. |
Fair enough, though it's probably all academic because we'll do both Phase 1 and Phase 2. We're already doing the scan part of this in our CLI's client-side code; I'm just going to move it to a Lambda. |
ah ok! awesome :) |
You may find this follow-up task for the work helpful to understand what's being proposed: #36 |
Proposal
It is proposed to add event-based setup of traffic mirroring to the Arkime Cloud tooling. The capability would be rolled out in phases. The first phase is a scheduled event that triggers updates to the per-ENI Mirroring configuration on a regular cadence. The second phase is adding rules to listen for the AWS EventBridge events natively fired by AWS Services (e.g. EC2 Autoscaling) on state changes in order to more practively update the per-ENI Mirroring configuration. The third phase would be to automatically manage per-Subnet Mirroring configuration instead of just the per-ENI configuration.
Background - Existing Solution
The existing solution uses VPC Traffic Mirroring [1] to send a copy of the user's traffic from the User VPC through a Gateway Load Balancer [2] to Capture Nodes in the Capture VPC. The Capture Nodes are running a copy of the Arkime Capture process. The source of mirrored traffic must currently be a Network Interface [3] in the user's VPC.
There are three levels of configuration required to make this work.
Per-VPC:
Per-Subnet:
Per-ENI:
Currently, the
add-vpc
CLI operation creates these resources based on a point-in-time understanding of the User VPC's subnets and ENIs, and theremove-vpc
CLI operation tears them down. The Per-VPC and Per-Subnet resources are managed using CDK/CloudFormation and the Per-ENI resources are managed using the Python SDK (boto).[1] https://docs.aws.amazon.com/vpc/latest/mirroring/what-is-traffic-mirroring.html
[2] https://docs.aws.amazon.com/elasticloadbalancing/latest/gateway/introduction.html
[3] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html
Background - AWS EventBridge
AWS EventBridge [1] is a service that provides an event bus that both AWS Services and user-defined applications can use to communicate state changes. A sample event might be that an EC2 Autoscaling Group successfully launched a new EC2 Instance. Users can set up rules to listen for specific event types on a given bus, perform transformations of the event messages, and direct the messages to a target which can take action (e.g. a Lambda function). All rules that apply to an event fire, and each rule can send the event to multiple targets. Each target can be configured with a different retry policy, and events that fail to be actioned can be sent to a dead letter queue. The delivery guarantee for a given target is at-least-once. AWS EventBridge supports both cross-region and cross-account operation.
[1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html
Phase 1 Proposal
It is proposed to create an EventBridge Bus for each Arkime Cluster. This will be accomplished using CDK, and bundled with the existing Capture VPC resource(s). We create the Bus per-Cluster because the default quota for Buses per account/region is fairly low; the default TPS throttling rate per bus is fairly high; it should be easy to distinguish between events meant for each User VPC; and it should be possible to move to per-VPC Buses later if necessary.
It is proposed that we create per-VPC Rules and Lambda Functions to watch for events on the per-Cluster Bus and action them, and bundle them with the other per-VPC resources deployed via CDK. We make these resources per-VPC because we want our Lambda functions to have as few permissions as possible. While we currently have the entire system operate within a single AWS Account/Region, in the future we would like to enable a single Arkime Cluster to monitor traffic in many User VPCs spread across multiple AWS Accounts and Regions. In that scenario, we don't want our Lambda Functions have access across all User accounts/regions; just the ones required to action a specific VPC. There's also no apparent downside to beginning this segregation now.
The Lambda Functions we create will effectively just run the Python code currently being performed by our
add-vpc
andremove-vpc
CLI operations to set up/tear down ENI-specific mirroring configuration (excluding all the CDK-related behavior). Theadd-vpc
andremove-vpc
CLI commands will be updated to emit events to the Cluster bus to trigger the Lambda functions.Additionally, we will have scheduled events fire every minute to continuously scan for changes in the ENIs and trigger the add/remove lambdas.
At the end of Phase 1, we will have a system in place that ensures that the per-ENI configuration is checked and updated at least once per minute.
Phase 2 Proposal
It is proposed that we build on Phase 1 by beginning to listen to the existing events continuously emitted by AWS Services on their state changes. A couple examples include the events that AWS EC2 Autoscaling emits to EventBridge when new instances start/stop running, and that AWS ECS emits when containers start/stop running. These events are natively emitted to a default EventBridge that exists in every AWS Account without requiring any user action.
Creating EventBridge Rules to listen to these events would enable us to create/destroy mirroring configuration at the moment that the state-change occurs rather than waiting for the next scheduled scan. The downside is that every AWS Service emits different events with different formats, so rules will need to be created for each scenario. We would add these Rules to the per-VPC CDK configuration, as the rules are inherently tied to a specific AWS Account/Region via the default EventBridge they listen to, and we want to enable an easy transition to multi-Region/multi-Account setups in the future.
Starting with high-value event types (such as EC2/ECS changes) seems reasonable, with additional event types added incrementally as they are identified.
Phase 3 Proposal
It is proposed that we automatically update our per-Subnet mirroring configuration using either scheduled scans for changes or emitted VPC events, similar to how we update per-ENI configuration. This likely means changing our per-Subnet configuration from being managed by CDK/CloudFormation to being managed by direct SDK invocations. Currently, a human needs to handle when subnets within a User VPC change (see [1]).
[1] #32
FAQS
Why use EventBridge instead of using SNS/SQS directly?
SNS/SQS are general queuing and notification solutions; EventBridge is specifically designed for handling AWS State changes. AWS services have out-of-the-box integration [1] into EventBridge to make it easy to take action when changes occur, and emit events following standardized schemas. For the example of a change in EC2 Autoscaling Capacity, the EC2 service already emits an event to EventBridge without any effort on our part [2]. If we used SNS/SQS, we'd have to detect the state change and emit the event ourselves.
[1] https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-service-event-list.html
[2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/automating-ec2-auto-scaling-with-eventbridge.html
How can users initiate creation of mirroring resources themselves?
An example use-case might be that a user wants to ensure that their EC2 instances are having their traffic captured before instance spinnup is allowed to complete.
In this case, the user can emit our standardized event for creating per-ENI resources as part of their User Data Script (or ECS startup command) and wait for the expected entry to appear in our data store for it (currently, SSM Parameter Store).
How will the system behave for longer-lived (EC2 instances) and shorter lived (Lambda containers) resources?
The system is designed for longer-lived resources, such as EC2 instances and ECS containers, and will behave well for them starting in Phase 1, with capture beginning no later than 1 minute after the resource is created. Phase 2 probably brings the delay down to ~1 second.
Shorter lived and ephemeral resources (such as AWS Lambda Functions) are tricky to deal with given the underlying constrains imposed by VPC Traffic Mirroring. The traffic of Lambda Functions is hypothetically mirrorable (though I have not tested this), and longer-lived Functions should be caught by the scheduled scans - if the Function in question is executing inside the User VPC. Given current constraints, very short-lived resources would likely be best addressed by having the user manually trigger an event to set up mirroring for the resource before continuing with the computer operation, but this necessarily imposes additional latency to the operation they are trying to perform (unclear on what that latency value is).
Ideally, VPC Traffic Mirroring would be improved in a manner that obviates the need for per-ENI configuration.
How the system will handle multiple, concurrent and/or conflicting instructions
The actual operations being performed at the per-ENI level appear fairly simple and reasonably easy to make idempotent. The actual resources/state for each ENI is as follows:
Multiple creation attempts for the same ENI would only ever create a Mirroring Session with the same configuration, and write the same metadata to the same Parameter Store key. Multiple deletions would only ever delete the same Mirroring Session and Parameter Store key. If later duplicates of the same operation fail, there shouldn't be user impact. As a result, the at-least-once delivery guaranteed by EventBridge to the Target Lambda functions should not cause problems.
EventBridge does not guarantee ordering for simultaneous events, so it's possible that a Create and Delete operation could have indeterminate ordering, but that should be resolved either way during the next scheduled scan of the User-VPC (e.g. Phase 1 deliverable).
The text was updated successfully, but these errors were encountered: