Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data_masking): add new sensitive data masking utility #2197

Merged
merged 72 commits into from
Sep 27, 2023

Conversation

seshubaws
Copy link
Contributor

@seshubaws seshubaws commented May 2, 2023

Issue number:
#1858

Summary

Changes

Please provide a summary of what's being changed

As per the above RFC and this issue: #1173, this is the initial PR for adding a data masker for sensitive information like PII into Powertools.

This feature is meant to mask, encrypt, and decrypt any and all data types entirely, or if provided a list of keys for a given dictionary (even nested keys), it has the ability to mask, encrypt, and decrypt the corresponding values in the dictionary.

In this PR, there are additions for:

  1. A new encryption Base Provider class that can mask, encrypt, and decrypt data, giving users the ability to install any encryption library they want and use it with our Data Masker.
  2. A Data Masker class that takes in an encryption provider, the abstraction through which the customer would mask, encrypt, and/or decrypt their data.
  3. An implementation of how the AWS Encryption SDK would implement the Base Provider class, which is an out of the box encryption provider for users to mask/encrypt their data.

NOTE: Measured latency for encrypting and decrypting nested fields in a 10KB json blob, came out to 5ms.

User experience

Please share what the user experience looks like before and after this change

This would be a new feature, so before a user would have to implement this logic themselves.

AFTER

from aws_lambda_powertools.utilities.data_masking.constants import KMS_KEY_ARN
from aws_lambda_powertools.utilities.data_masking import DataMasking
from aws_lambda_powertools.utilities.data_masking.provider.kms.aws_encryption_sdk import AwsEncryptionSdkProvider

def lambda_handler(event, context: LambdaContext):

    data_masker = DataMasking()
    data = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "[email protected]",
               "address": {
                    "street": "123 Main St", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }

    masked = data_masker.mask(data=data, fields=["email", "address.street"])
    """
    masked = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "*****",
               "address": {
                    "street": "*****", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """

    encryption_provider = AwsEncryptionSdkProvider(keys=[KMS_KEY_ARN])
    data_masker = DataMasking(provider=encryption_provider)

    encrypted = data_masker.encrypt(data=data, fields=["email", "address.street"])
    """
    encrypted = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "InRoaXMgaXMgYSBzdHJpbmciHsLZGx2na-XzP_TB5Bf2LNU1bLc",
               "address": {
                    "street": "XMgYSB_KDddaDJYMb-JpbmGnagTklwQ-msdaDLP", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """
 
    decrypted = data_masker.decrypt(data=encrypted, fields=["email", "address.street"])
    """
    decrypted = {
              "id": 1,
               "name": "John Doe",
               "age": 30,
               "email": "[email protected]",
               "address": {
                    "street": "123 Main St", 
                    "city": "Anytown", 
                    "state": "CA", 
                    "zip": "12345"
               },
      }
   """

Checklist

If your change doesn't seem to apply, please leave them unchecked.

Is this a breaking change?

RFC issue number:

Checklist:

  • Migration process documented
  • Implement warnings (if it can live side by side)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

@pull-request-size pull-request-size bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 2, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented May 2, 2023

Thanks a lot for your first contribution! Please check out our contributing guidelines and don't hesitate to ask whatever you need.
In the meantime, check out the #python channel on our AWS Lambda Powertools Discord: Invite link

@github-actions github-actions bot added the feature New feature or functionality label May 2, 2023
@leandrodamascena leandrodamascena linked an issue May 2, 2023 that may be closed by this pull request
2 tasks
@leandrodamascena leandrodamascena self-requested a review May 8, 2023 13:43
@leandrodamascena
Copy link
Contributor

leandrodamascena commented May 8, 2023

Hi @seshubaws! Thank you so much for submitting this PR and for all the work put into it, this feature is a real improvement on the project as we certainly have a lot of customers who want it. The CI is failing for some reason that I haven't looked into yet, but at this point I see room for us to improve on the following and make the developer/user experience even better:

  • Organization of folders/files: perhaps it is better to reorganize folders and files in order to have smaller files and with a kind of single responsibility, something like this:

    • base.py - for the base class
    • provider.py - for the provider class
    • one file per provider
  • Write some performance tests for us to make this public and hear some feedback from the community.

  • Make some code changes to maintain an architecture pattern similar to the one we use.

  • Write documentation

  • Check the dependencies and what we are missing to include

  • Tests in general

I'm really excited to move forward with this PR. Would you rather I convert the draft to "Ready" and push some commits, or do I review and make some suggestions for you to work on?
Let's collaborate to make it a success! 🚀

@seshubaws
Copy link
Contributor Author

Hey @leandrodamascena, thanks for the comments! I just got back from PTO, I'll start working on revising this to be in smaller files like base.py, provider.py, etc. but you can convert this draft to Ready and push some commits as well!

@seshubaws seshubaws marked this pull request as ready for review May 17, 2023 20:10
@seshubaws seshubaws requested a review from a team as a code owner May 17, 2023 20:10
@boring-cyborg boring-cyborg bot added the dependencies Pull requests that update a dependency file label May 17, 2023
@seshubaws
Copy link
Contributor Author

Hey all, I'm running into some confusing when trying to mock tests for the AWS Encryption SDK. I tried using Stubber to stub a KMS key, but the encryption SDK doesn’t accept that mock key to actually encrypt anything, and Stubber doesn't accept the SDK as a client since it’s not an AWS service, just an SDK. Are there are any examples on how to mock this SDK?

Outside sources say try using the moto library, but I don't want to bring in more dependencies unless strictly necessary. @heitorlessa @rubenfonseca any thoughts on this?

@boring-cyborg boring-cyborg bot added the typing Static typing definition related issues (mypy, pyright, etc.) label May 19, 2023
@codecov-commenter
Copy link

codecov-commenter commented Sep 27, 2023

Codecov Report

Attention: Patch coverage is 77.27273% with 25 lines in your changes missing coverage. Please review.

Project coverage is 95.88%. Comparing base (4b90ed2) to head (c01ea35).
Report is 991 commits behind head on develop.

Files with missing lines Patch % Lines
...es/data_masking/provider/kms/aws_encryption_sdk.py 48.88% 23 Missing ⚠️
...s_lambda_powertools/utilities/data_masking/base.py 94.44% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2197      +/-   ##
===========================================
- Coverage    96.13%   95.88%   -0.26%     
===========================================
  Files          186      193       +7     
  Lines         8131     8241     +110     
  Branches      1525     1536      +11     
===========================================
+ Hits          7817     7902      +85     
- Misses         252      276      +24     
- Partials        62       63       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sonarcloud
Copy link

sonarcloud bot commented Sep 27, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@leandrodamascena leandrodamascena self-requested a review September 27, 2023 19:03
@leandrodamascena
Copy link
Contributor

Just curious - is this feature close? I have a team that would love to rip out some custom data masking we've implemented, and would favor a solution to be built into power tools

Hello @justinhauer! We are merging this PR today and hope to release this feature in our next release - next week. We still have some work to do, mainly on the documentation side, but this is one of the main Powertools news this year and I'm happy that you already have some cases to use it! As soon as we officially release this utility, I will ping you.

Thanks

Copy link
Contributor

@leandrodamascena leandrodamascena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @seshubaws, I have no words to describe the amount of great work you've done here. Implementing a utility that handles encryption, keys, key material, and a complex SDK is not simple at all!

I'm approving this PR and let start writing the documentation. We will probably detect some user experience improvements, but they are more cosmetic changes than logical ones.

image

@heitorlessa
Copy link
Contributor

This is AWESOME - echoing what Leandro said about the effort. I know we'll make more internal changes to get to an unwritten standard yet but that shouldn't be a blocking factor.

A few major pieces we should handle in a separate PR before we release this feature:

  • Lambda Layer. We need to get customers feedback on whether we include Encryption SDK as part of the Layer. It might increase ~6M in the final Layer, after removing AWS SDK and such. We need a comment in the RFC to drive customers and the community to discuss whether we include it or not. I favour including it as long as we stay within 10M boundary given how hard it is to bring these type of mixed languages dependencies in customers builds (default goes ~100M uncompressed).
  • Dataclasses/Classes and Pydantic. We need to add support for dataclasses and pydantic models while keeping the return type accurate (generics!). If a customer uses a dataclass/pydantic model along with fields, we should return the same instance with the values encrypted/masked - tremendous value here in UX while helping customers avoid perf pitfall of deserialization+serialization overhead.
  • Update RFC. We should update the linked RFC with the latest UX and once again after the docs PR is complete. This will serve as a future document for other languages and any discussions around it too.

@heitorlessa
Copy link
Contributor

Forgot two things..

@leandrodamascena we're gonna need a new label for this new feature to track future changes (changelog, issues, effort, etc.), and update our issue template as well as workflow that labels PRs based on title ;)

@seshubaws you took one of the hardest roadmap items and I can't begin to appreciate your resiliency throughout this long review cycle. I hope you learned a thing or two along the way too <3

Merging..

@heitorlessa heitorlessa changed the title feat(data_masking): add new data masking utility feat(data_masking): add new sensitive data masking utility Sep 27, 2023
@heitorlessa heitorlessa added the data-masking Sensitive Data Masking feature label Sep 27, 2023
@github-actions
Copy link
Contributor

⚠️Large PR detected⚠️

Please consider breaking into smaller PRs to avoid significant review delays. Ignore if this PR has naturally grown to this size after reviews.

@heitorlessa heitorlessa merged commit 35fb001 into aws-powertools:develop Sep 27, 2023
17 checks passed
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 27, 2023

Awesome work, congrats on your first merged pull request and thank you for helping improve everyone's experience!

@justinhauer
Copy link

Just curious - is this feature close? I have a team that would love to rip out some custom data masking we've implemented, and would favor a solution to be built into power tools

Hello @justinhauer! We are merging this PR today and hope to release this feature in our next release - next week. We still have some work to do, mainly on the documentation side, but this is one of the main Powertools news this year and I'm happy that you already have some cases to use it! As soon as we officially release this utility, I will ping you.

Thanks

This is great, thank you and others for all the great work in this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commons data-masking Sensitive Data Masking feature dependencies Pull requests that update a dependency file feature New feature or functionality internal Maintenance changes size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tests typing Static typing definition related issues (mypy, pyright, etc.)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Sensitive data masking utility Sensitive data masking utility
6 participants