Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor FIPS & unify CLI actions #275

Merged
merged 83 commits into from
Dec 8, 2022

Conversation

adamjanovsky
Copy link
Collaborator

@adamjanovsky adamjanovsky commented Oct 25, 2022

High-level plan

Tests

  • Add tests that check working pandas serialization
  • Allow for default paths of CPEDataset and CVEDataset
    • Or forbid it and demand explicit paths (that will not get serialized omg). Either way, act consistently.
  • Get rid of class design in the favour of fiixtures
  • Refactor module design
  • Test maintenance updates
  • Select more appropriate names for the methods
  • Migrate to pytest
  • Divide logic of tests into download / rest.
  • Allow download test to fail on unreliable resources, test download on resources that we trust (e.g. our server)
    • Forbid download on non-download test by auto-use fixture that raises exception on requests.get / requests.post

FIPS

Processing steps progress:

  • get_certs_from_web()
  • process_auxillary_datasets()
  • download_all_artifacts()
  • convert_all_pdfs()
  • analyze_certificates()
    • _compute_references()
    • _compute_transitive_vulnerabilities()

Random notes to do there:

  • Refactor FIPS to hold auxilary datasets in the designated class
  • Implement new folder structure
  • Rewrite redo -> fresh, implement reatempting computations on failed samples
  • Introduce InternalState both to FIPSCertificate and to FIPSDataset

Done outside of the processing steps:

  • Delete plot graph methods
  • [ ] Add examples of plot graph into FIPS reference notebook
  • Verify intermediate results on small-scale dataset before proceeding further
  • Verify if FIPS webpages are loading on scroll (historical modules). If yes, figure out how to force full load with BeautifulSoup
    • This was not confirmed by manual analysis.
  • Implement copying of FIPSDataset contents
  • Unify CLIs, closes FIPS: Unify CLI actions with CC #212
  • FIPSAlgorithmDataset should reside in own json in auxillary datasets

It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.

  • Address the above

Incomplete: See parse_html_main, the logic there just is not good.

  • Address the above

Misc

  • Refactor CC to hold auxilary datasets in the designated class
  • Update docs
  • Update Docker image
  • Restrict usage of fresh on places where it does not makes sense
  • Update project readme
  • Revisit all TODOs (internal check)
  • Unify logging messages.
  • Update code documentation
  • PyUpgrade and flake plugin into linting pipeline

Sanity check

  • Perform full FIPS run
  • Perform full CC run
  • Run all notebooks

Still part of refactoring, but to-be-addressed by separate PRs

@J08nY
Copy link
Member

J08nY commented Oct 25, 2022

I applaud this effort! When I get more time hopefully I can help as well.

In the meantime, let me add a few pointers/issues I hit when working with the FIPS code:

  • The attributes of the certificate that are extracted and processed are unclear and incomplete.
    • It is unclear what things like clean_cert_ids, or algorithms in the various certicate subobjects (PdfData and WebData) are, are they processed or raw, what sort of cleaning was done and what method needs to be called to do it? Specifically this cleaning that is done on FIPS certs is split between the dataset class and the certificate class in a nasty way, this should be unified and moved, maybe even to a separate class similar to how CC certificate ID canonicalization is separate.
    • Incomplete: See parse_html_main, the logic there just is not good.
  • Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

@adamjanovsky
Copy link
Collaborator Author

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

@J08nY
Copy link
Member

J08nY commented Oct 26, 2022

* Data from the FIPS algorithm dataset is not utilized and mined fully. We can follow the links to the algorithm page and get more data that will help us.

Thanks for the input. This I'll probably leave for future work, this PR is going to be giant anyway. Could you please create a separate issue to address this?

Did this in #276.

sec_certs/dataset/dataset.py Outdated Show resolved Hide resolved
@adamjanovsky
Copy link
Collaborator Author

I'll re-implement the folder structure of FIPS Dataset. Outline:

dset
├── FIPS Dataset.json
├── auxillary_datasets
│   └── algorithms
│       └── fips_algorithms.json
├── certs
│   ├── modules
│   └── policies
│       ├── pdf
│       └── txt
└── web
    ├── fips_modules_active.html
    ├── fips_modules_historical.html
    └── fips_modules_revoked.html

sec_certs/sample/fips.py Outdated Show resolved Hide resolved
sec_certs/sample/fips.py Outdated Show resolved Hide resolved
cli.py Outdated Show resolved Hide resolved
@adamjanovsky
Copy link
Collaborator Author

adamjanovsky commented Oct 27, 2022

@J08nY looking at the code here and there, I planned the following:

  • I decomposed whole processing pipeline into four steps: (i) parse metadata, (ii) download artifacts, (iii) process artifacts, (iv) analyze artifacts
  • Each child class would hold a property that would return a list of callables for each step, e.g. artifact_download_methods, see
    def artifact_download_methods(self) -> List[Callable]:
    return [self._download_modules, self._download_policies]
  • Parent class would just orchestrate these steps, i.e. call all required methods (that must have no arguments which so far holds), see
    def download_all_artifacts(self, fresh: bool = True) -> None:
    if self.state.meta_sources_parsed is False:
    logger.error("Attempting to download pdfs while not having csv/html meta-sources parsed. Returning.")
    return
    for method in self.artifact_download_methods:
    method(fresh)
    if fresh:
    for method in self.artifact_download_methods:
    method(False)
    self.state.artifacts_downloaded = True
  • I try to introduce the state variables and execute the method on the individual samples in the same fashion as in CC world.

This comes with some advantages:

  • Refactoring is very fast, because I just follow the design already implemented in CC
  • It should also produce fewer errors, same reasoning as above

Some disadvantages that cross my mind:

  • Whatever we're unsatisfied with in CC will probably hit also FIPS 😆
  • There's a lot of spaghetti code that we could possibly get rid of. E.g., see _download_policies() and _download_modules():
    def _download_policies(self, fresh: bool = True) -> None:
    self.policies_pdf_dir.mkdir(exist_ok=True)
    if fresh:
    logger.info("Downloading PDF security policies.")
    else:
    logger.info("Attempting re-download of failed PDF security policies.")
    certs_to_process = [x for x in self if x.state.policy_is_ok_to_download(fresh)]
    cert_processing.process_parallel(
    FIPSCertificate.download_policy,
    certs_to_process,
    config.n_threads,
    progress_bar_desc="Downloading PDF security policies",
    )
    ,
    def _download_modules(self, fresh: bool = True) -> None:
    self.module_dir.mkdir(exist_ok=True)
    if fresh:
    logger.info("Downloading HTML cryptographic modules.")
    else:
    logger.info("Attempting re-download of failed HTML cryptographic modules.")
    certs_to_process = [x for x in self if x.state.module_is_ok_to_download(fresh)]
    cert_processing.process_parallel(
    FIPSCertificate.download_module,
    certs_to_process,
    config.n_threads,
    progress_bar_desc="Downloading HTML modules",
    )
    . They are very similar, just one variable that changes in multiple places + some logging. The same problem applies to methods executed on the individual samples
  • I am not certain whether we want to attempt to solve the spaghetti code by adding additional layer. First, I'm not sure how to do that without getattr("lame string of some variable"). Second, if we break the pattern, we would need to introduce new design...

Now's the time to speak up if you have comments/tips/whatever :).

@J08nY
Copy link
Member

J08nY commented Oct 27, 2022

Interesting approach but I am worried about a few things:

  • Passing callables around smells. For example "(that must have no arguments which so far holds)" as a limitation sounds pretty bad. Maybe the solution here is to not have the superclass orchestrate but instead have it the other way around? I dunno, but passing callables seems to smell and locks us in a thing that is very constraining.
  • I would maybe have a quick look at design patterns to see if some fits? https://refactoring.guru/design-patterns/catalog

I guess I am really not sure what problem you are trying to solve by passing the callables. I will have to look at the code.

@J08nY
Copy link
Member

J08nY commented Oct 28, 2022

Looking at the code a bit. I think now I would just match the FIPS Dataset API to the CC Dataset API manually, even at the cost of some code duplication. I don't feel like it is a good time to rework the handling like this during this FIPS unification.

@J08nY
Copy link
Member

J08nY commented Oct 28, 2022

Ah yes, lets do this: https://refactoring.guru/design-patterns/template-method and don't pass callables around. Just call the methods directly, even if there might be some duplication I think it makes sense for future flexibility.

@adamjanovsky
Copy link
Collaborator Author

I was thinking about the logic of the download testing. At the moment, the situation is as follows:

  • Tests that don't test download functionality from unreliable servers (CC, FIPS), but use it (when this is difficult to get rid of) check whether the download succeeded and raise pytest.xfail if it did not
  • Tests that test download from unreliable servers were decorated with xfail. The problem with tested functionality is discovered only on repeated failures.
    • We could possibly re-run these tests, but that would require pytest plugins?
  • Tests that download from reliable servers were left as they were.

@adamjanovsky
Copy link
Collaborator Author

It seems the whole PP dataset was included in the test data, was this intended? It shows up at almost 77k lines. I believe just downloading it from our site is a reasonable alternative. Although as it is already in git, removing it now is pointless (without a rebase and history editing the repository will have it forever).

Oh this hurts. It was added by accident. I'll probably rewrite the history...

@adamjanovsky
Copy link
Collaborator Author

Connected with the root_dir handling and serialization is the DUMMY_PATH issue. I see the current solution as an acceptable one, although I think there is room for improvement, we just need to identify the core issue in the API design and assumptions to see what needs to change. I think the current situation arises because our assumptions in different places in the code are not consistent or clear, specifically assumptions about whether a dataset object is backed by its JSON (directory structure) at different points in time as it is constructed from different sources.

You're perfectly right about the cause.

One possible way to resolve this is to say that the dataset object is always backed by its directory structure/JSON and explicitly require the path in all constructors, even the web ones. Other possible way is the opposite, to not assume the dataset is always backed, then disable the serialization features unless a path is set.

I was actually opting for the second, which is kind of the current state. The serialization will fail unless a path is set. See

if self.json_path == constants.DUMMY_NONEXISTING_PATH: # type: ignore
raise ValueError(f"json_path variable for {type(self)} was not yet set.")

It's just implemented with that DUMMY_NONEXISTING_PATH variable...

@adamjanovsky
Copy link
Collaborator Author

I dislike that the root_dir property setter on dataset classes actually serializes the dataset. I think this is highly unintuitive and unexpected. I propose a more explicit solution in a comment in the review. The TL;DR; is to create a different method for moving/copying the dataset and make it the only method to change the root_dir.

I agree this can be counterintuitive. I'll take a look at that, thanks for spotting this.

@adamjanovsky adamjanovsky requested a review from J08nY December 7, 2022 13:40
Copy link
Member

@J08nY J08nY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm except the hash impl.

sec_certs/cli.py Outdated Show resolved Hide resolved
sec_certs/constants.py Show resolved Hide resolved
sec_certs/dataset/dataset.py Show resolved Hide resolved
sec_certs/sample/certificate.py Outdated Show resolved Hide resolved
@adamjanovsky adamjanovsky merged commit 846144e into main Dec 8, 2022
@adamjanovsky adamjanovsky deleted the issue/212-FIPS-Unify-CLI-actions-with-CC branch December 8, 2022 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FIPS: Unify CLI actions with CC
2 participants