Skip to content

Commit

Permalink
update homepage
Browse files Browse the repository at this point in the history
  • Loading branch information
SinclairCoder committed Apr 30, 2024
1 parent 8163fb8 commit 46fb9b9
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 12 deletions.
Binary file removed imgs/mathpile-features.png
Binary file not shown.
Binary file removed imgs/mathpile-overview.png
Binary file not shown.
17 changes: 5 additions & 12 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -176,15 +176,15 @@ <h1 class="title is-1 publication-title">Benchmarking Benchmark Leakage in Large


<!-- Twitter Link. -->
<!-- <span class="link-block">
<a href="https://twitter.com/_akhaliq/status/1740571256234057798" target="_blank"
<span class="link-block">
<a href="https://twitter.com/SinclairWang1/status/1785298912942948633" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-twitter"></i>
</span>
<span>Twitter by AK</span>
<span>Twitter</span>
</a>
</span> -->
</span>
</div>

</div>
Expand Down Expand Up @@ -699,14 +699,7 @@ <h2 class="title is-3">📊 N-gram Accuracy Helps Instance-level Leakage Detecti
</figure>


<p>We can observe that many models can all ngrams of an example from benchmark training set even test set.
Surprisingly, Qwen-1.8B can accurately predict all 5-grams in 223 examples from the GSM8K training set and
67 from the MATH training set, with an additional 25 correct predictions even in the MATH test set. We
would like to emphasize that the n-gram accuracy metric can mitigate issues in our detection pipeline,
particularly when the training and test datasets are simultaneously leaked and remain undetected. However,
this also has its limitations; it can only detect examples that are integrated into the model training in
their original format and wording, unless we know the organizational format of the training data used by
the model in advance.
<p>We can observe that many models can pricisely predict all ngrams of an example from benchmark training set even test set. Surprisingly, Qwen-1.8B can accurately predict all 5-grams in 223 examples from the GSM8K training set and 67 from the MATH training set, with an additional 25 correct predictions even in the MATH test set. We would like to emphasize that the n-gram accuracy metric can mitigate issues in our detection pipeline, particularly when the training and test datasets are simultaneously leaked and remain undetected. However, this also has its limitations; it can only detect examples that are integrated into the model training in their original format and wording, unless we know the organizational format of the training data used by the model in advance.
</p>

</div>
Expand Down

0 comments on commit 46fb9b9

Please sign in to comment.