Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of three new predefined recognizers, improved regex for IN_PAN #1323

Open
wants to merge 47 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
818fe90
IN_PAN pattern recognizer
devopam Jun 26, 2023
87a1aae
refined IN_PAN regex
devopam Jun 27, 2023
8756c93
Update recognizer_registry.py
devopam Jun 28, 2023
2f85d5d
Fixed Lint errors
devopam Jun 28, 2023
1b47061
Merge branch 'main' of https://github.com/devopam/presidio
devopam Jun 28, 2023
b0d1ce8
Added more test cases in test_in_pan_recognizer.py
devopam Jul 4, 2023
b3e94ed
Merge branch 'main' into main
devopam Jul 9, 2023
838402f
Merge branch 'main' into main
devopam Jul 9, 2023
d4ae26d
Merge branch 'main' into main
omri374 Jul 11, 2023
1e81cfb
Merge branch 'main' of https://github.com/devopam/presidio
devopam Jan 11, 2024
88c6c1f
added IN_AADHAAR recognizer
devopam Jan 15, 2024
b4edab4
Merge branch 'microsoft:main' into main
devopam Jan 15, 2024
2d01bd0
Update in_aadhaar_recognizer.py
devopam Jan 16, 2024
b7c6e65
Merge branch 'main' into main
devopam Jan 16, 2024
2434bb5
Update in_aadhaar_recognizer.py
devopam Jan 17, 2024
b6db593
added utility function class
devopam Jan 23, 2024
2dd5cec
Merge branch 'main' into main
devopam Jan 23, 2024
dfb2d26
Merge branch 'main' into main
omri374 Jan 28, 2024
fd28708
Create test_analyzer_utils.py
devopam Jan 28, 2024
f0c9737
Update test_recognizer_registry.py
devopam Jan 29, 2024
a67f19f
Merge branch 'main' into main
omri374 Jan 30, 2024
8383e08
Merge branch 'microsoft:main' into main
devopam Jan 31, 2024
37b2f97
Merge branch 'microsoft:main' into main
devopam Feb 9, 2024
57b2294
added predefined recognizer : IN_VEHICLE_REGISTRATION
devopam Feb 9, 2024
365be21
review comments incorporated
devopam Feb 12, 2024
3cdec15
Merge branch 'main' into main
devopam Feb 12, 2024
28f8bec
Merge branch 'main' into main
omri374 Feb 15, 2024
bc059ce
review comments incorporated
devopam Feb 15, 2024
1ffbb8b
added null/min vehicle number size
devopam Feb 15, 2024
b05399f
Merge branch 'main' into main
devopam Feb 15, 2024
2a4708b
incorporated review comments
devopam Feb 18, 2024
22003a4
Merge branch 'main' into main
devopam Feb 19, 2024
3f00fdc
Merge branch 'main' into main
devopam Feb 21, 2024
424174d
added two predefined recognizers : ISIN, CFI
devopam Feb 24, 2024
b0767aa
added three predefined recognizers, improvements
devopam Mar 4, 2024
4133632
merged main branch conflicts
devopam Mar 4, 2024
d1f2fc6
removed pycountry
devopam Mar 9, 2024
93a79cf
Merge branch 'main' into main
devopam Mar 9, 2024
3053088
Merge branch 'main' into main
omri374 Mar 13, 2024
65a2e70
review feedback incorporation
devopam Mar 14, 2024
f4a1541
Merge branch 'main' into main
devopam Mar 14, 2024
4391cd4
interim commit - not ready for merging
devopam Mar 25, 2024
acf7331
Merge branch 'microsoft:main' into main
devopam Mar 25, 2024
e26da0f
Merge branch 'microsoft:main' into main
devopam Mar 31, 2024
e040ecc
incorporated review suggestions
devopam Apr 17, 2024
7d6ee38
Merge branch 'main' into main
omri374 Apr 23, 2024
64407fb
interim commit
devopam May 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 21 additions & 18 deletions docs/supported_entities.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,22 @@ For more information, refer to the [adding new recognizers documentation](analyz

### Global

|Entity Type | Description | Detection Method |
| --- | --- | --- |
|CREDIT_CARD |A credit card number is between 12 to 19 digits. <https://en.wikipedia.org/wiki/Payment_card_number>|Pattern match and checksum|
|CRYPTO|A Crypto wallet number. Currently only Bitcoin address is supported|Pattern match, context and checksum|
|DATE_TIME|Absolute or relative dates or periods or times smaller than a day.|Pattern match and context|
|EMAIL_ADDRESS|An email address identifies an email box to which email messages are delivered|Pattern match, context and RFC-822 validation|
|IBAN_CODE|The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors.|Pattern match, context and checksum|
|IP_ADDRESS|An Internet Protocol (IP) address (either IPv4 or IPv6).|Pattern match, context and checksum|
|NRP|A person’s Nationality, religious or political group.|Custom logic and context|
|LOCATION|Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains|Custom logic and context|
|PERSON|A full person name, which can include first names, middle names or initials, and last names.|Custom logic and context|
|PHONE_NUMBER|A telephone number|Custom logic, pattern match and context|
|MEDICAL_LICENSE|Common medical license numbers.|Pattern match, context and checksum|
|URL|A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet|Pattern match, context and top level url validation|
| Entity Type | Description | Detection Method |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|
| CFI | CFI (Classification of Financial Instruments) is a six letter code to classify a financial instrument as per ISO 10962 | Pattern match, context |
| CREDIT_CARD | A credit card number is between 12 to 19 digits. <https://en.wikipedia.org/wiki/Payment_card_number> | Pattern match and checksum |
| CRYPTO | A Crypto wallet number. Currently only Bitcoin address is supported | Pattern match, context and checksum |
| DATE_TIME | Absolute or relative dates or periods or times smaller than a day. | Pattern match and context |
| EMAIL_ADDRESS | An email address identifies an email box to which email messages are delivered | Pattern match, context and RFC-822 validation |
| IBAN_CODE | The International Bank Account Number (IBAN) is an internationally agreed system of identifying bank accounts across national borders to facilitate the communication and processing of cross border transactions with a reduced risk of transcription errors. | Pattern match, context and checksum |
| IP_ADDRESS | An Internet Protocol (IP) address (either IPv4 or IPv6). | Pattern match, context and checksum |
| ISIN | An ISIN ( International Securities Identification Number), 12 character unique identifier used to recognize a security as per ISO 6166 | Pattern match, context |
| NRP | A person’s Nationality, religious or political group. | Custom logic and context |
| LOCATION | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains | Custom logic and context |
| PERSON | A full person name, which can include first names, middle names or initials, and last names. | Custom logic and context |
| PHONE_NUMBER | A telephone number | Custom logic, pattern match and context |
| MEDICAL_LICENSE | Common medical license numbers. | Pattern match, context and checksum |
| URL | A URL (Uniform Resource Locator), unique identifier used to locate a resource on the Internet | Pattern match, context and top level url validation |

### USA

Expand Down Expand Up @@ -80,11 +82,12 @@ For more information, refer to the [adding new recognizers documentation](analyz
|AU_MEDICARE| Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system| Pattern match, context, and checksum |

### India
| FieldType | Description |Detection Method|
|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--- |
| IN_PAN | The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers. | Pattern match, context |
| IN_AADHAAR | Indian government issued unique 12 digit individual identity number | Pattern match, context, and checksum |
| FieldType | Description |Detection Method|
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--- |
| IN_PAN | The Indian Permanent Account Number (PAN) is a unique 12 character alphanumeric identifier issued to all business and individual entities registered as Tax Payers. | Pattern match, context |
| IN_AADHAAR | Indian government issued unique 12 digit individual identity number | Pattern match, context, and checksum |
| IN_VEHICLE_REGISTRATION | Indian government issued transport (govt, personal, diplomatic, defence) vehicle registration number | Pattern match, context, and checksum |
| IN_GSTIN | Indian government issued unique goods and services tax identification number | Pattern match, context, and checksum |
| IN_VOTER | Indian Election Commission issued 10 digit alpha numeric voter id for all indian citizens (age 18 or above) | Pattern match, context |
| IN_PASSPORT | Indian Passport Number | Pattern match, Context |

Expand Down
175 changes: 174 additions & 1 deletion presidio-analyzer/presidio_analyzer/analyzer_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from typing import List, Tuple
import csv
import os


class PresidioAnalyzerUtils:
Expand All @@ -9,6 +11,20 @@ class PresidioAnalyzerUtils:
logic for re-usability and maintainability
"""

__country_master_file_path__ = "presidio_analyzer/data/country_master.csv"
__country_master__ = []

def __init__(self):
# provision to override the default path for future need
__country_master_file_path__ = "presidio_analyzer/data/country_master.csv"
__country_master_file_path__ = (
__country_master_file_path__
if __country_master_file_path__
else self.__country_master_file_path__
)

self.__load_country_master__()

@staticmethod
def is_palindrome(text: str, case_insensitive: bool = False):
"""
Expand Down Expand Up @@ -36,13 +52,33 @@ def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
text = text.replace(search_string, replacement_string)
return text

@staticmethod
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
def get_luhn_mod_n(input_str: str, alphabet="0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ") -> bool:

"""
Check if the given input number has a valid last checksum as per LUHN algorithm.

https://en.wikipedia.org/wiki/Luhn_mod_N_algorithm
:param alphabet: input alpha-numeric list of characters to determine mod 'N'
:param input_str: the alpha numeric string to be checked for LUHN algorithm
:return: True/False
"""
if len(alphabet) == 0:
return False

charset = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
n = len(charset)
luhn_input = tuple(alphabet.index(i) for i in reversed(str(input_str)))
return (
sum(luhn_input[::2]) + sum(sum(divmod(i * 2, n)) for i in luhn_input[1::2])
) % n == 0

@staticmethod
def is_verhoeff_number(input_number: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def is_verhoeff_number(input_number: int):
def is_verhoeff_number(input_number: int) -> bool:

"""
Check if the input number is a true verhoeff number.

:param input_number:
:return:
:return: Bool
"""
__d__ = [
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
Expand Down Expand Up @@ -73,3 +109,140 @@ def is_verhoeff_number(input_number: int):
for i in range(len(inverted_number)):
c = __d__[c][__p__[i % 8][inverted_number[i]]]
return __inv__[c] == 0

def __load_country_master__(self):
"""
Load various standards as defined in Country specific metadata.

:return: None
"""
if os.path.isfile(self.__country_master_file_path__) is not True:
raise FileNotFoundError()
else:
with open(
file=self.__country_master_file_path__,
devopam marked this conversation as resolved.
Show resolved Hide resolved
mode="r",
newline="",
encoding="utf-8",
) as csvfile:
if csv.Sniffer().has_header(csvfile.readline()) is not True:
raise Exception(
"Header missing in file: {}".format(
self.__country_master_file_path__
)
)
csvfile.seek(0) # read the header as well, hence start from beginning
country_info = csv.DictReader(csvfile, fieldnames=None)
self.__country_master__ = list(country_info)

if len(self.__country_master__) <= 1:
raise Exception(
"Blank file: {} detected.".format(self.__country_master_file_path__)
)

def __get_country_master_full_data__(self, iso_code: str = ""):
"""
Fetch all country information for a specific column (index).

:param iso_code:
:return:
"""
supported_codes = [
"ISO3166-1-Alpha-2",
"ISO3166-1-Alpha-3",
"ISO3166-1-Numeric",
"ISO4217-Alpha-3",
"ISO4217-Numeric",
]
if iso_code.strip() not in supported_codes:
return None
else:
# return full country list for given code
country_information = [
country[iso_code] for country in self.__country_master__
]
country_information = list(filter(None, country_information))
return country_information

def get_country_codes(self, iso_code: str):
"""
Fetch all defined country codes per required ISO format.

:param iso_code: currently supporting : ISO3166-1-Alpha-2,
ISO3166-1-Alpha-3, ISO3166-1-Numeric
:return: List of country codes in provided ISO format.
"""
supported_codes = [
"ISO3166-1-Alpha-2",
"ISO3166-1-Alpha-3",
"ISO3166-1-Numeric",
]
if iso_code.strip() not in supported_codes:
print("Code Invalid: ")
return None
else:
# return full country list for given code
return self.__get_country_master_full_data__(iso_code=iso_code)

def get_currency_codes(self, iso_code: str = ""):
"""
...x .c ,xcRetrieve all defined currency codes across countries.

:param iso_code: currently supporting : ISO4217-Alpha-3, ISO4217-Numeric
:return: List of currency codes in provided ISO format.
"""
supported_codes = ["ISO4217-Alpha-3", "ISO4217-Numeric"]
if iso_code.strip() not in supported_codes:
return None
else:
# return full country list for given code
return self.__get_country_master_full_data__(iso_code=iso_code)

def get_full_country_information(self, lookup_key: str, lookup_index: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please define return type (List[str]?)

"""
Fetch additional information through lookup_index in index of lookup_key.

:param lookup_key: Item to be searched
:param lookup_index: A valid index_name out of available values
English_short_name_using_title_case, English_full_name,
FIFA_country_code, International_olympic_committee_country_code,
ISO3166-1-Alpha-2,ISO3166-1-Alpha-3, ISO3166-1-Numeric,
International_licence_plate_country_code, Country_code_top_level_domain,
Currency_Name, ISO4217-Alpha-3, ISO4217-Numeric, Capital_City, Dialing_Code
:return: Dictionary object with additional information enriched from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says it returns a dictionary, but it looks like the code returns a list

master lookup

"""
allowed_indices = [
"English_short_name_using_title_case",
"English_full_name",
"FIFA_country_code",
"International_olympic_committee_country_code",
"ISO3166-1-Alpha-2",
"ISO3166-1-Alpha-3",
"ISO3166-1-Numeric",
"International_licence_plate_country_code",
"Country_code_top_level_domain",
"Currency_Name",
"ISO4217-Alpha-3",
"ISO4217-Numeric",
"Capital_City",
"Dialing_Code",
]
if (
lookup_index is None
or len(lookup_index.strip()) == 0
or lookup_index not in allowed_indices
):
print("Lookup Index problem")
return None
elif lookup_key is None or len(lookup_key.strip()) == 0:
print("Lookup Key issue")
return None
else:
return list(
filter(
lambda country: country[lookup_index] == lookup_key,
self.__country_master__,
)
)
Loading
Loading