Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rudimentary profiling to processing pipeline #288

Closed
J08nY opened this issue Dec 5, 2022 · 1 comment · Fixed by #353
Closed

Add rudimentary profiling to processing pipeline #288

J08nY opened this issue Dec 5, 2022 · 1 comment · Fixed by #353
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@J08nY
Copy link
Member

J08nY commented Dec 5, 2022

From #275 (comment):

logger.info("Extracting report keywords")

Log entries like this could be replaced with some elegant way of tracking how long these stages and steps of processing take. Like a context manager that:

  • Logs this message when entered and stores the time.
  • Logs some exit message when exited, together with the elapsed time.
@J08nY J08nY added the enhancement New feature or request label Dec 5, 2022
@J08nY J08nY added the good first issue Good for newcomers label Dec 8, 2022
@J08nY
Copy link
Member Author

J08nY commented Apr 27, 2023

Here is some manually extracted data from a full CC run on the server.
Commit: 6448911
Total duration: 9h:36m:15s

What When Length
Initial CSV/HTML download + process 2023-04-26 13:43:03,860 0h0m
CPEDataset from JSON 2023-04-26 13:43:43,700 0h1m
CVEDataset from JSON 2023-04-26 13:44:01,253 0h0m
PPDataset 2023-04-26 13:44:32,354 0h0m
MU dataset - download reports 2023-04-26 13:44:32,637 0h3m
MU dataset - download targets 2023-04-26 13:47:36,796 0h4m
MU dataset - convert reports 2023-04-26 13:51:33,456 0h5m
MU dataset - convert targets 2023-04-26 13:56:41,531 0h11m
MU dataset - extract report meta 2023-04-26 14:07:21,226 0h0m
MU dataset - extract target meta 2023-04-26 14:07:23,571 0h0m
MU dataset - extract report frontpage 2023-04-26 14:07:54,051 0h0m
MU dataset - extract target frontpage 2023-04-26 14:07:56,402 0h0m
MU dataset - extract report keywords 2023-04-26 14:08:03,717 0h0m
MU dataset - extract target keywords 2023-04-26 14:08:29,043 0h6m
CC scheme pages 2023-04-26 14:14:40,720 0h15m
download reports 2023-04-26 14:29:18,751 0h33m
download targets 2023-04-26 15:02:33,414 0h38m
convert reports 2023-04-26 15:40:24,238 2h27m
convert targets 2023-04-26 18:07:29,028 3h9m
extract report meta 2023-04-26 21:18:53,521 0h3m
extract target meta 2023-04-26 21:21:46,177 0h7m
extract report frontpage 2023-04-26 21:28:41,745 0m1m
extract target frontpage 2023-04-26 21:29:45,351 0h2m
extract report keywords 2023-04-26 21:31:30,754 0h16m
extract target keywords 2023-04-26 21:47:04,355 1h1m
heuristics - cert_id 2023-04-26 22:48:02,540 0h0m
heuristics - cpe match 2023-04-26 22:48:02,729 0h6m
heuristics - cve 2023-04-26 22:54:21,816 0h2m
heuristics - references 2023-04-26 22:56:16,638 0h0m
heuristics - transitive vulns 2023-04-26 22:56:18,026 0h0m
heuristics - cert labs 2023-04-26 22:56:38,557 0h0m
heuristics - SARs 2023-04-26 22:56:38,622 0h23m
End 2023-04-26 23:19:19,853

Some numbers:

The resulting dataset has 5326 certificates.
In total, we identified 22546 vulnerabilities in 367 vulnerable certificates.
There were total of 151 certificates skipped due to duplicity

The biggest culprits in the runtime are the OCR in our pdf to text conversion and the download from CC pages.

J08nY added a commit to J08nY/sec-certs that referenced this issue Jul 26, 2023
@J08nY J08nY mentioned this issue Jul 26, 2023
J08nY added a commit to J08nY/sec-certs that referenced this issue Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant