Accepted at the Data and Tool Showcase Track of the 20th International Conference on Mining Software Repositories (MSR 2023)
Throughout 2021, GitGuardian’s monitoring of public GitHub repositories revealed a two-fold increase in the number of secrets (API keys, Authentication Tokens, Database Credentials, and other credentials) exposed compared to 2020, accumulating more than six million secrets. However, no benchmark dataset is publicly available for researchers and tool developers to improve secret detection tools to avoid secret leakage.
We present a large and versatile dataset, SecretBench, consisting of 97,479 manually labeled secrets of various secret types extracted from 818 public GitHub repositories. The dataset contains 15,084 true secrets out of 97,479 secrets. Our dataset covers 49 programming languages and 311 file types. We have made the dataset available for researchers and tool developers in Google BigQuery and Cloud Storage.
The dataset is stored in Google BigQuery and Cloud Storage. First, you need to create a Google Cloud Account. Google Cloud gives a $300 free credit after opening the account. You can run SQL queries in Google BigQuery to access the secrets and download repositories and related files from Google Cloud Storage.
- Google BigQuery Dataset id (dev-range-332204.secretbench.secrets): Google BigQuery contains 97,479 secrets with ground truth info, whether the secret is true or false. Addition metadata information regarding the secrets such as commit id, start line, and end line can also be accessed. More details of the metadata is described in Data Overview section.
- Google Cloud Storage (bucket-name: secretbench): The mined 818 public GitHub repositories are stored in Google Cloud Storage in a "Repos.zip" file. We have also stored the individual files containing the secrets in a "Files.zip" file.
Important: The researchers and developers who want to use our dataset need to contact us. Since the dataset contains sensitive information, a data protection agreement has to be signed with us to avoid any unethical use of the data. Later, we will give access to the dataset using their email addresses.
We used 761 regular expression patterns to collect candidate repositories containing secrets. A snapshot of 10 regular expression patterns is presented in the below table. The complete list of regular expression patterns can be found here.
Pattern ID | Secret Type | Regular Expression | Source |
---|---|---|---|
65 | AWS API Secret | \b([A-Za-z0-9+/]{40})[ \r\n'"\x60] | TruffleHog |
71 | Azure Client Secret | (?i)(%s).{0,20}([a-z0-9_.-~]{34}) | TruffleHog |
216 | Dropbox API Key | \b(sl.[A-Za-z0-9-_]{130,140})\b | TruffleHog |
237 | Facebook Access Token | EAACEdEose0cBA[0-9A-Za-z]+ | Meli et al. |
278 | Generic Pattern | (?i)(?:pass|token|cred|secret|key)(?:.|[\n\r]){0,40}(\b[\x21-\x7e]{16,64}\b) | TruffleHog |
290 | Github Token | \b((?:ghp|gho|ghu|ghs|ghr)_[a-zA-Z0-9]{36,255})\b | TruffleHog |
605 | Slack Token | (xoxb|xoxp|xapp|xoxa|xoxr)-[0-9]{10,13}-[a-zA-Z0-9-]* | TruffleHog |
640 | Stripe API Key | [rs]k_live_[a-zA-Z0-9]{20,30} | TruffleHog |
691 | Twitter Access Token | (?i)(?:twitter)(?:.|[\n\r]){0,40}\b[1-9][0-9]+-[0-9a-zA-Z]{40}\b | Meli et al. |
747 | Youtube/Google OAuth ID | [0-9]+-[0-9A-Za-z_]{32}.apps.googleusercontent.com | Meli et al. |
We curated 818 public Github repositories and extracted 97,479 candidate secrets. Out of 97,479 secrets, we labeled 15,084 secrets as true secrets. Each secret is manually labeled by finding out whether the secret is actual or not after inspecting the secret and the source code context of the secret. Below we present an overview of the SecretBench data.
Field Name | Description | Data Type |
---|---|---|
id | Unique identifier of the secret. | String |
secret | Candidate secret string. The secret is surrounded by "[]" parenthesis. | String |
repo_name | Name of the repository. For example: "setu1421/SecretBench" | String |
domain | Domain of the repository such as GitHub | String |
commit_id | Commit hash where the secret is added. For example: "a074a5afe1d2663fda756c1bf3c87bad426cf7de" | String |
file_path | File path where the secret is included. For example: "dev.config" and "config/test.env". | String |
file_type | Type of the file such as .py and .config. | String |
start_line | Start line no. in the file where the secret is present. | Integer |
end_line | End line no. in the file where the secret is present. For secrets present in a single line, the start_line and end_line will be same. | Integer |
start_column | Start index of the secret in the start line. | Integer |
end_column | End index of the secret in the end line. | Integer |
committer_email | Email address of the developer who committed the secret. | String |
commit_date | The timestamp of the commit. For example: 2018-10-24T21:22:19Z | TimeStamp |
label | The ground truth label of the secret. "True" for actual secret and "False" for fake/dummy secret. | Boolean |
is_template | Flag to indicate if the secret is a placeholder such as "MY_PASSWORD" and "Place_Your_Token_Here". | Boolean |
in_url | Flag to indicate if the secret is part of URL such as "http://user:[email protected]". | Boolean |
entropy | Shannon entropy value of the secret. | Float |
character_set | Characters used in the secret such as NumberOnly, CharOnly and Any. | String |
has_words | Flag to indicate if any common English word of at least length of 4 is present within the secret. | Boolean |
length | Length of the secret. | Integer |
is_multiline | Flag to indicate if the secret is present in multiple lines. Most of the time true for private keys. | Boolean |
category | The category of the secret. The secrets are categorized in eight categories. See section Secret Categorization. | String |
file_identifier | Unique identifier of the file to check the secret from local system. | String |
repo_identifier | Unique identifier of the repository to check the secret from local system. | String |
comment | A description of the secret types such as Slack Token, AWS Access Key ID and Value with "key" as part of attribute name. | String |
The "repo_identifier" and "file_identifier" can be used to locate the specific repository and the file where the secret is present. The repositories and files can be downloaded from Google Cloud Storage. See Section How to Use.
The secrets present in our dataset in categorized into eight categories. In the table below, we present the number of total candidate secrets, true secrets, and the category description.
Category Name | Description | True Secrets | Total Secrets |
---|---|---|---|
Private Key | This category contains the private keys such as cryptographic RSA private key and EC private key. | 5,789 | 8,584 |
API Key and Secret | This category contains any API Keys and secret such as Twillo API key and Stripe API key. | 4,529 | 5,162 |
Authentication Key and Token | This category contains the access keys and tokens such as AWS Access Key ID and Slack Token. | 3,569 | 5,833 |
Generic Secret | This category contains any generic secrets such application package secret, recaptcha site key. | 334 | 439 |
Database and Server URL | This category contains the database and server URLs. For example. any mongoDB connection string and any FTP server url. | 162 | 9,970 |
Password | This category contains any plain text passwords. | 150 | 705 |
Username | This category contains any plain text usernames. | 27 | 96 |
Other | This category contains other possible secrets such as Package Key ID or any random string. | 524 | 66,690 |
Our dataset covers 49 programming languages. Note that each GitHub repository can have multiple programming languages. The top 10 programming languages based on the number of repositories is presented below. The full list of programming languages can be found here.
Language Name | No. of Repository |
---|---|
Shell | 459 |
JavaScript | 414 |
Python | 312 |
Java | 180 |
Ruby | 172 |
C | 128 |
Batch | 124 |
C++ | 111 |
PHP | 107 |
Go | 88 |
Our dataset consists of secrets present in 311 file types. Below we present the top 5 file types based on the number of candidate secrets in our dataset. The full list of file types can be found here.
File type | Description | Total Secrets |
---|---|---|
js | Javascript file | 10,412 |
nix | Package Manager File | 8,623 |
json | JavaScript Object Notation File | 8,132 |
txt | Text File | 7,737 |
xml | Extensible Markup Language File | 6,429 |
In addition, we present the top 5 file types based on the number of true secrets in our dataset.
File type | Description | Total Secrets |
---|---|---|
txt | Text File | 2,935 |
toml | Configuration File | 1,985 |
js | Javascript file | 1,583 |
html | Hypertext Markup Language File | 1,337 |
pem | Privacy Enhanced Mail Format File | 813 |
- Currently, our dataset consists of only GitHub repositories. We will expand our dataset by including repositories from other version control systems such as Gitlab and Bitbucket.
- We will enrich our dataset with more features related to secrets, such as whether the secrets have parentheses (possible function call) and begin with a $ sign (possible variable). The complete list of our additional features is available here.
This project is licensed under the terms of the MIT license. Please check LICENSE for more details.
Since our dataset contains sensitive information, we will make available to dataset only to researchers and tool developers. The researchers and tool developers will sign an agreement to protect the data from any unethical use.
Please email us if you want to contribute. See Authors section for contact information.
- Setu Kumar Basak ([email protected])
- Lorenzo Neil ([email protected])
- Bradley Reaves ([email protected])
- Laurie Willams ([email protected])
BibTex
@INPROCEEDINGS{10174157,
author={Basak, Setu Kumar and Neil, Lorenzo and Reaves, Bradley and Williams, Laurie},
booktitle={2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR)},
title={SecretBench: A Dataset of Software Secrets},
year={2023},
volume={},
number={},
pages={347-351},
doi={10.1109/MSR59073.2023.00053}}
Plain Text
S. K. Basak, L. Neil, B. Reaves and L. Williams, "SecretBench: A Dataset of Software Secrets," 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia, 2023, pp. 347-351, doi: 10.1109/MSR59073.2023.00053.