update homepage

GAIR-NLP · Apr 30, 2024 · 46fb9b9 · 46fb9b9
1 parent 8163fb8
commit 46fb9b9
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 12 deletions.
diff --git a/imgs/mathpile-features.png b/imgs/mathpile-features.png
diff --git a/imgs/mathpile-overview.png b/imgs/mathpile-overview.png
diff --git a/index.html b/index.html
@@ -176,15 +176,15 @@ <h1 class="title is-1 publication-title">Benchmarking Benchmark Leakage in Large
 
 
                 <!-- Twitter Link. -->
-                <!-- <span class="link-block">
-                <a href="https://twitter.com/_akhaliq/status/1740571256234057798" target="_blank"
+                <span class="link-block">
+                <a href="https://twitter.com/SinclairWang1/status/1785298912942948633" target="_blank"
                    class="external-link button is-normal is-rounded is-dark">
                   <span class="icon">
                       <i class="fab fa-twitter"></i>
                   </span>
-                  <span>Twitter by AK</span>
+                  <span>Twitter</span>
                 </a>
-              </span> -->
+              </span>
               </div>
 
             </div>
@@ -699,14 +699,7 @@ <h2 class="title is-3">📊 N-gram Accuracy Helps Instance-level Leakage Detecti
             </figure>
 
 
-            <p>We can observe that many models can all ngrams of an example from benchmark training set even test set.
-              Surprisingly, Qwen-1.8B can accurately predict all 5-grams in 223 examples from the GSM8K training set and
-              67 from the MATH training set, with an additional 25 correct predictions even in the MATH test set. We
-              would like to emphasize that the n-gram accuracy metric can mitigate issues in our detection pipeline,
-              particularly when the training and test datasets are simultaneously leaked and remain undetected. However,
-              this also has its limitations; it can only detect examples that are integrated into the model training in
-              their original format and wording, unless we know the organizational format of the training data used by
-              the model in advance.
+            <p>We can observe that many models can pricisely predict all ngrams of an example from benchmark training set even test set.          Surprisingly, Qwen-1.8B can accurately predict all 5-grams in 223 examples from the GSM8K training set and 67 from the MATH training set, with an additional 25 correct predictions even in the MATH test set. We               would like to emphasize that the n-gram accuracy metric can mitigate issues in our detection pipeline,  particularly when the training and test datasets are simultaneously leaked and remain undetected. However,           this also has its limitations; it can only detect examples that are integrated into the model training in               their original format and wording, unless we know the organizational format of the training data used by               the model in advance.
             </p>
 
           </div>