Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor FIPS & unify CLI actions #275

Merged
merged 83 commits into from
Dec 8, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
a900b90
unify CLI actions
adamjanovsky Oct 25, 2022
0e12ba4
merge dataset constructors, from_web_latest code
adamjanovsky Oct 25, 2022
7aff8bf
unify _get_certs_by_name methods
adamjanovsky Oct 25, 2022
d0d9f91
unify get_keywords_df method
adamjanovsky Oct 25, 2022
1b150aa
unify and generalize dataset method get_keywords_df()
adamjanovsky Oct 27, 2022
5110b52
root_dir setter for FIPSDataset
adamjanovsky Oct 27, 2022
39c89c1
WiP: refactor FIPS get_certs_from_web()
adamjanovsky Oct 27, 2022
9433658
implement artifact download FIPS
adamjanovsky Oct 27, 2022
933b469
refactor tests unittest -> pytest
adamjanovsky Nov 4, 2022
80af3a2
add type hint for json serialization
adamjanovsky Nov 9, 2022
6134479
new object to hold auxillary datasets
adamjanovsky Nov 9, 2022
fad4fbf
use temp folders for cc analysis test data
adamjanovsky Nov 9, 2022
e26fb0c
mark further download tests with xfail
adamjanovsky Nov 9, 2022
14c369b
fix xfail marker on cpe_dset_from_web test
adamjanovsky Nov 9, 2022
8482d82
pandas tests, cve_dset, cpe_dset unify json_path approach
adamjanovsky Nov 10, 2022
c6913dc
merge main
adamjanovsky Nov 10, 2022
24f11fb
test maintenance updates
adamjanovsky Nov 10, 2022
9052138
fix paths handling in CPEDataset, CVEDataset
adamjanovsky Nov 11, 2022
67c6295
cleanup path issues
adamjanovsky Nov 11, 2022
1bc1c7d
auxillary dataset processing CC
adamjanovsky Nov 11, 2022
ca5d4c8
fix mypy error in cli.py
adamjanovsky Nov 11, 2022
b41fa12
fix error in mu dset tests
adamjanovsky Nov 11, 2022
d51c65c
FIPS policy pdf convert refactoring
adamjanovsky Nov 11, 2022
f7c5915
cleanup in fips code structure
adamjanovsky Nov 11, 2022
9fbba9e
common interface for Dataset.analyze_certificates()
adamjanovsky Nov 16, 2022
c0ca076
merge main
adamjanovsky Nov 16, 2022
2a694c4
delete plot_graph() of FIPSDataset
adamjanovsky Nov 16, 2022
9462624
analyce_certificate() interface, delete dead code
adamjanovsky Nov 16, 2022
01e4156
FIPSDataset new parsing of html modules
adamjanovsky Nov 17, 2022
315270a
fix tests
adamjanovsky Nov 23, 2022
2871384
refactor algorithm extraction from policy tables
adamjanovsky Nov 23, 2022
048f3f6
delete InternalState.errors of cert objects
adamjanovsky Nov 23, 2022
4e582e1
deduplicate FIPSAlgorithm data structures
adamjanovsky Nov 23, 2022
0568ca4
remove graphviz requirement
adamjanovsky Nov 23, 2022
d7603e1
move AlgorithmDataset to AuxillaryDatasets class
adamjanovsky Nov 23, 2022
35c5734
Refactor FIPSAlgorithm objects
adamjanovsky Nov 25, 2022
25d42fc
update flake8 CI workflow
adamjanovsky Nov 25, 2022
67fc667
update flake8 config
adamjanovsky Nov 25, 2022
97dce48
cleanup
adamjanovsky Nov 25, 2022
4d0ae40
clean-up, update docs, cli
adamjanovsky Nov 25, 2022
cb879f3
fix json objects for fips test
adamjanovsky Nov 29, 2022
5895c85
rename dependency -> references of transitive vulns
adamjanovsky Nov 29, 2022
c6d826c
fips refactor reference computation
adamjanovsky Nov 30, 2022
fc49b6a
implement transitive vuln. search for FIPS
adamjanovsky Nov 30, 2022
cae2dc2
restrict usage of fresh bool param
adamjanovsky Nov 30, 2022
e062f3e
improve dataset processing logging
adamjanovsky Nov 30, 2022
5b0a7cb
fix table extraction from fips policies
adamjanovsky Dec 2, 2022
f681244
fix reference computation fips
adamjanovsky Dec 2, 2022
08ff031
update readme
adamjanovsky Dec 2, 2022
3fbf5f0
random fixes for cc pipeline
adamjanovsky Dec 2, 2022
6953dfb
fix CC notebooks
adamjanovsky Dec 2, 2022
6d7a907
random fixes in FIPS notebooks
adamjanovsky Dec 2, 2022
2f21854
move label studio interface layout file
adamjanovsky Dec 2, 2022
91b0973
update readme
adamjanovsky Dec 2, 2022
4ddae8a
introduce pyupgrade
adamjanovsky Dec 2, 2022
6d66552
bump scipy, dependabot errors on it
adamjanovsky Dec 2, 2022
40206cd
bump pillow lib
adamjanovsky Dec 2, 2022
25dcec9
bump Github action versions
adamjanovsky Dec 2, 2022
ca0c4e2
convert examples to notebooks
adamjanovsky Dec 2, 2022
86a62cb
fips normalize embodiment string
adamjanovsky Dec 2, 2022
6ce7007
unify from __future__ import annotations
adamjanovsky Dec 5, 2022
4e62ae1
Update sec_certs/dataset/common_criteria.py
adamjanovsky Dec 5, 2022
1a3502a
Update sec_certs/dataset/fips.py
adamjanovsky Dec 5, 2022
8f7a14b
entry guard
adamjanovsky Dec 5, 2022
4085c61
revive tests settings
adamjanovsky Dec 5, 2022
b37eaaf
fix here, fix there
adamjanovsky Dec 5, 2022
6c02383
rename dataset of maintenance updates
adamjanovsky Dec 5, 2022
8ac389a
Update sec_certs/dataset/common_criteria.py
adamjanovsky Dec 5, 2022
bc5a532
Update sec_certs/model/cpe_matching.py
adamjanovsky Dec 5, 2022
4712279
chain.from_iterable() now working with generator expessions
adamjanovsky Dec 5, 2022
f14dfe3
fix getitem on fips dataset
adamjanovsky Dec 5, 2022
a1ec986
test config global fixture
adamjanovsky Dec 6, 2022
577300e
add pyupgrade into linter pipeline
adamjanovsky Dec 6, 2022
ed8813e
reimplement dataset serialization constraints
adamjanovsky Dec 7, 2022
29dd48c
delete pp dataset json
adamjanovsky Dec 7, 2022
52bddce
update docs
adamjanovsky Dec 7, 2022
30ef160
attempt to fix pipelines
adamjanovsky Dec 8, 2022
0bcda6b
don't download spacy model test pipeline
adamjanovsky Dec 8, 2022
7710ef8
test pipeline ubuntu 20.04
adamjanovsky Dec 8, 2022
7af69ca
disable CPE from web test
adamjanovsky Dec 8, 2022
be7f6d7
try ubuntu 22.04 test runner
adamjanovsky Dec 8, 2022
7d59063
cli print -> click.echo()
adamjanovsky Dec 8, 2022
4574a3d
FIPSCertificate no longer hashable
adamjanovsky Dec 8, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,9 +161,9 @@ def main(
),
ProcessingStep(
"download",
"download_all_pdfs",
"download_artifacts",
J08nY marked this conversation as resolved.
Show resolved Hide resolved
precondition="meta_sources_parsed",
precondition_error_msg="Error: You want to download all pdfs, but the data from the cert. framework website was not parsed. You must use 'build' action first.",
precondition_error_msg="Error: You want to download all artifacts, but the data from the cert. framework website was not parsed. You must use 'build' action first.",
pre_callback_func=None,
),
ProcessingStep(
Expand Down
10 changes: 5 additions & 5 deletions sec_certs/dataset/common_criteria.py
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ def get_certs_from_web(
csv_certs = self._get_all_certs_from_csv(get_active, get_archived)
self._merge_certs(csv_certs, cert_source="csv")

# TODO: Someway along the way, 3 certificates get lost. Investigate and fix.
# Someway along the way, 3 certificates get lost.
logger.info("Adding HTML certificates to CommonCriteria dataset.")
html_certs = self._get_all_certs_from_html(get_active, get_archived)
self._merge_certs(html_certs, cert_source="html")
Expand All @@ -290,7 +290,6 @@ def get_certs_from_web(

self._set_local_paths()
self.state.meta_sources_parsed = True
self.process_protection_profiles()

def _get_all_certs_from_csv(self, get_active: bool, get_archived: bool) -> Dict[str, "CommonCriteriaCert"]:
"""
Expand Down Expand Up @@ -538,7 +537,7 @@ def _download_targets(self, fresh: bool = True) -> None:
)

@serialize
def download_all_pdfs(self, fresh: bool = True) -> None:
def download_all_artifacts(self, fresh: bool = True) -> None:
"""
Downloads all pdf files associated with certificates of the datset.

Expand Down Expand Up @@ -804,14 +803,15 @@ def process_maintenance_updates(self) -> CCDatasetMaintenanceUpdates:
update_dset: CCDatasetMaintenanceUpdates = CCDatasetMaintenanceUpdates(
{x.dgst: x for x in updates}, root_dir=self.mu_dataset_path, name="Maintenance updates"
)
update_dset.download_all_pdfs()
update_dset.download_all_artifacts()
update_dset.convert_all_pdfs()
update_dset._extract_data()

return update_dset

def process_auxillary_datasets(self) -> None:
raise NotImplementedError
self.process_protection_profiles()
# TODO: Also process MUs


class CCDatasetMaintenanceUpdates(CCDataset, ComplexSerializableType):
Expand Down
6 changes: 3 additions & 3 deletions sec_certs/dataset/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ def __bool__(self):
def __init__(
self,
certs: Dict[str, CertSubType] = dict(),
root_dir: Optional[Path] = None,
root_dir: Optional[Union[str, Path]] = None,
name: Optional[str] = None,
description: str = None,
state: Optional[DatasetInternalState] = None,
Expand All @@ -58,7 +58,7 @@ def __init__(

if not root_dir:
root_dir = Path.cwd() / (type(self).__name__).lower()
self._root_dir = root_dir
self._root_dir = Path(root_dir)
self.timestamp = datetime.now()
self.sha256_digest = "not implemented"

Expand Down Expand Up @@ -203,7 +203,7 @@ def process_auxillary_datasets(self) -> None:
raise NotImplementedError("Not meant to be implemented by the base class.")

@abstractmethod
def download_all_pdfs(self, cert_ids: Optional[Set[str]] = None) -> None:
def download_all_artifacts(self, cert_ids: Optional[Set[str]] = None) -> None:
raise NotImplementedError("Not meant to be implemented by the base class.")

@abstractmethod
Expand Down
105 changes: 47 additions & 58 deletions sec_certs/dataset/fips.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import itertools
import logging
import shutil
from pathlib import Path
from typing import List, Optional, Set
from typing import Dict, Final, List, Optional, Set

import numpy as np
import pandas as pd
Expand All @@ -14,7 +16,6 @@
from sec_certs.model.dependency_finder import DependencyFinder
from sec_certs.sample.fips import FIPSCertificate
from sec_certs.serialization.json import ComplexSerializableType, serialize
from sec_certs.utils import helpers as helpers
from sec_certs.utils import parallel_processing as cert_processing
from sec_certs.utils.helpers import fips_dgst

Expand All @@ -26,6 +27,12 @@ class FIPSDataset(Dataset[FIPSCertificate], ComplexSerializableType):
Class for processing of FIPSCertificate samples. Inherits from `ComplexSerializableType` and base abstract `Dataset` class.
"""

LIST_OF_CERTS_HTML: Final[Dict[str, str]] = {
adamjanovsky marked this conversation as resolved.
Show resolved Hide resolved
"fips_modules_active.html": constants.FIPS_ACTIVE_MODULES_URL,
"fips_modules_historical.html": constants.FIPS_HISTORICAL_MODULES_URL,
"fips_modules_revoked.html": constants.FIPS_REVOKED_MODULES_URL,
}

@property
def policies_dir(self) -> Path:
return self.root_dir / "security_policies"
Expand All @@ -52,13 +59,21 @@ def _extract_data(self, redo: bool = False) -> None:
for keyword, cert in keywords:
self.certs[cert.dgst].pdf_data.keywords = keyword

def download_all_pdfs(self, cert_ids: Optional[Set[str]] = None) -> None:
def download_all_artifacts(self, cert_ids: Optional[Set[str]] = None) -> None:
"""
Downloads all pdf files related to the certificates specified with cert_ids.

:param Optional[Set[str]] cert_ids: cert_ids to download the pdfs foor, defaults to None
:raises RuntimeError: If no cert_ids are specified, raises.
"""
# TODO: The code below was migrated here from get_certs_web()
# self.policies_dir.mkdir(exist_ok=True)
# self.algorithms_dir.mkdir(exist_ok=True)
# logger.info("Downloading certificate html and security policies")
# self._download_all_htmls(cert_ids)
# self.download_all_pdfs(cert_ids)
# self.web_scan(cert_ids, redo=redo_web_scan, update_json=False)

sp_paths, sp_urls = [], []
self.policies_dir.mkdir(exist_ok=True)
if cert_ids is None:
Expand Down Expand Up @@ -118,45 +133,36 @@ def convert_all_pdfs(self) -> None:
FIPSCertificate.convert_pdf_file, tuples, config.n_threads, progress_bar_desc="Converting to txt"
)

# TODO: this "test" parameter is nasty.
def _prepare_dataset(self, test: Optional[Path] = None, update: bool = False) -> Set[str]:
if test:
html_files = [test]
else:
html_files = [
Path("fips_modules_active.html"),
Path("fips_modules_historical.html"),
Path("fips_modules_revoked.html"),
]
helpers.download_file(constants.FIPS_ACTIVE_MODULES_URL, Path(self.web_dir / "fips_modules_active.html"))
helpers.download_file(
constants.FIPS_HISTORICAL_MODULES_URL, Path(self.web_dir / "fips_modules_historical.html")
)
helpers.download_file(constants.FIPS_REVOKED_MODULES_URL, Path(self.web_dir / "fips_modules_revoked.html"))
def _download_html_resources(self) -> None:
logger.info("Downloading HTML files that list FIPS certificates.")
html_urls = list(FIPSDataset.LIST_OF_CERTS_HTML.values())
html_paths = [self.web_dir / x for x in FIPSDataset.LIST_OF_CERTS_HTML.keys()]
self._download_parallel(html_urls, html_paths)

# Parse those files and get list of currently processable files (always)
cert_ids: Set[str] = set()
for f in html_files:
cert_ids |= self._get_certificates_from_html(self.web_dir / f, update)
def _get_all_certs_from_html_sources(self) -> Set[FIPSCertificate]:
return set(
itertools.chain.from_iterable(
[self._get_certificates_from_html(self.web_dir / x) for x in self.LIST_OF_CERTS_HTML.keys()]
)
)

return cert_ids
def _get_certificates_from_html(self, html_file: Path) -> Set[FIPSCertificate]:
logger.debug(f"Getting certificate ids from {html_file}")

def _get_certificates_from_html(self, html_file: Path, update: bool = False) -> Set[str]:
logger.info(f"Getting certificate ids from {html_file}")
with open(html_file, "r", encoding="utf-8") as handle:
html = BeautifulSoup(handle.read(), "html5lib")

table = [x for x in html.find(id="searchResultsTable").tbody.contents if x != "\n"]
entries: Set[str] = set()
cert_ids: Set[int] = set()

for entry in table:
if isinstance(entry, NavigableString):
continue
cert_id = entry.find("a").text
if cert_id not in entries:
entries.add(cert_id)
if cert_id not in cert_ids:
cert_ids.add(int(cert_id))

return entries
return {FIPSCertificate(cert_id) for cert_id in cert_ids}

@serialize
def web_scan(self, cert_ids: Set[str], redo: bool = False) -> None:
Expand All @@ -172,10 +178,8 @@ def web_scan(self, cert_ids: Set[str], redo: bool = False) -> None:
self.certs[dgst] = FIPSCertificate.from_html_file(
self.web_dir / f"{cert_id}.html",
FIPSCertificate.InternalState(
(self.policies_dir / str(cert_id)).with_suffix(".pdf"),
(self.web_dir / str(cert_id)).with_suffix(".html"),
False,
None,
False,
False,
),
self.certs.get(dgst),
Expand All @@ -195,36 +199,21 @@ def _set_local_paths(self) -> None:
cert.set_local_paths(self.policies_dir, self.web_dir)

@serialize
def get_certs_from_web(
self,
# TODO: REMOVE THIS TEST ARGUMENT, OMG!
test: Optional[Path] = None,
update: bool = False,
redo_web_scan=False,
) -> None:
"""Downloads HTML search pages, parses them, populates the dataset,
and performs `web-scan` - extracting information from CMVP pages for
each certificate.

Args:
test (Optional[Path], optional): Path to dataset used in testing. Defaults to None.
update (bool, optional): Whether to update dataset with new entries. Defaults to False.
redo_web_scan (bool, optional): Whether to redo the `web-scan` functionality. Defaults to False.
"""
logger.info("Downloading required html files")

def get_certs_from_web(self, to_download: bool = True, keep_metadata: bool = True) -> None:
self.web_dir.mkdir(parents=True, exist_ok=True)
self.policies_dir.mkdir(exist_ok=True)
self.algorithms_dir.mkdir(exist_ok=True)

# Download files containing all available module certs (always)
cert_ids = self._prepare_dataset(test, update)
if to_download:
self._download_html_resources()

logger.info("Adding empty FIPS certificates into FIPSDataset.")
adamjanovsky marked this conversation as resolved.
Show resolved Hide resolved
self.certs = {x.dgst: x for x in self._get_all_certs_from_html_sources()}
logger.info(f"The dataset now contains {len(self)} certificates.")

logger.info("Downloading certificate html and security policies")
self._download_all_htmls(cert_ids)
self.download_all_pdfs(cert_ids)
if not keep_metadata:
shutil.rmtree(self.web_dir)

self.web_scan(cert_ids, redo=redo_web_scan, update_json=False)
self._set_local_paths()
self.state.meta_sources_parsed = True

@serialize
def process_auxillary_datasets(self) -> None:
Expand Down
2 changes: 1 addition & 1 deletion sec_certs/dataset/fips_algorithm.py
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ def from_dict(cls, dct: Dict[str, Any]) -> "FIPSAlgorithmDataset":
def convert_all_pdfs(self):
raise NotImplementedError("Not meant to be implemented")

def download_all_pdfs(self, cert_ids: Optional[Set[str]] = None) -> None:
def download_all_artifacts(self, cert_ids: Optional[Set[str]] = None) -> None:
raise NotImplementedError("Not meant to be implemented")

def __getitem__(self, item: str) -> FIPSAlgorithm:
Expand Down
3 changes: 3 additions & 0 deletions sec_certs/sample/certificate.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,9 @@ def __eq__(self, other: object) -> bool:
return False
return self.dgst == other.dgst

def __hash__(self) -> int:
return hash(self.dgst)
J08nY marked this conversation as resolved.
Show resolved Hide resolved

def to_dict(self) -> Dict[str, Any]:
return {
**{"dgst": self.dgst},
Expand Down
Loading