From 1656945a7e270178d8a7cb7925fd3237cf6a5fca Mon Sep 17 00:00:00 2001
From: "github-actions[bot]"
 <41898282+github-actions[bot]@users.noreply.github.com>
Date: Mon, 4 Nov 2024 07:04:08 +0000
Subject: [PATCH] Deployed 6b3222f with MkDocs version: 1.6.1

---
 leaderboard/index.html   | 6 +++---
 search/search_index.json | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/leaderboard/index.html b/leaderboard/index.html
index 7f7c47e..138a521 100644
--- a/leaderboard/index.html
+++ b/leaderboard/index.html
@@ -556,15 +556,15 @@ <h1 id="scicode-leaderboard">SciCode Leaderboard</h1>
 </thead>
 <tbody>
 <tr>
-<td>OpenAI o1-preview</td>
+<td>🥇OpenAI o1-preview</td>
 <td>7.7%</td>
 </tr>
 <tr>
-<td>Claude3.5-Sonnet</td>
+<td>🥈Claude3.5-Sonnet</td>
 <td>4.6%</td>
 </tr>
 <tr>
-<td>Deepseek-Coder-v2</td>
+<td>🥉Deepseek-Coder-v2</td>
 <td>3.1%</td>
 </tr>
 <tr>
diff --git a/search/search_index.json b/search/search_index.json
index 922a3a6..771075b 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"SciCode: A Research Coding Benchmark Curated by Scientists","text":"<p> Minyang Tian<sup>1,2*\u2021</sup>, Luyu Gao<sup>3*</sup>, Shizhuo Dylan Zhang<sup>1</sup>, Xinan Chen<sup>1\u2020</sup>, Cunwei Fan<sup>1\u2020</sup>, Xuefei Guo<sup>1\u2020</sup>, Roland Haas<sup>1\u2020</sup>, Pan Ji<sup>4\u2020</sup>, Kittithat Krongchon<sup>1\u2020</sup>, Yao Li<sup>1\u2020</sup>, Shengyan Liu<sup>1\u2020</sup>, Di Luo<sup>5,6,11\u2020</sup>, Yutao Ma<sup>7\u2020</sup>, Hao Tong<sup>1\u2020</sup>, Kha Trinh<sup>7\u2020</sup>, Chenyu Tian<sup>8\u2020</sup>, Zihan Wang<sup>1\u2020</sup>, Bohao Wu<sup>1\u2020</sup>, Yanyu Xiong<sup>9\u2020</sup>, Shengzhu Yin<sup>1\u2020</sup>, Minhui Zhu<sup>1\u2020</sup>, Kilian Lieret<sup>10</sup>, Yanxin Lu<sup>1</sup>, Genglin Liu<sup>1</sup>, Yufeng Du<sup>1</sup>, Tianhua Tao<sup>1</sup>, Ofir Press<sup>10</sup>, Jamie Callan<sup>3</sup>, Eliu Huerta<sup>1,2,7\u2021</sup>, Hao Peng<sup>1\u2021</sup> </p> <p> <sup>1</sup>University of Illinois Urbana-Champaign   <sup>2</sup>Argonne National Laboratory   <sup>3</sup>Carnegie Mellon University   <sup>4</sup>University of North Carolina at Chapel Hill   <sup>5</sup>Massachusetts Institute of Technology   <sup>6</sup>Harvard University   <sup>7</sup>University of Chicago   <sup>8</sup>University of Texas at Austin   <sup>9</sup>Stanford University   <sup>10</sup>Princeton University   <sup>11</sup>The NSF AI Institute for Artificial Intelligence and Fundamental Interactions   </p> <p> * Equal contribution lead authors. \u2020 Data curation, alphabetical order. \u2021 Corresponding to: {mtian8, haopeng}@illinois.edu, elihu@{anl.gov, uchicago.edu} </p> <ul> <li> <p> Paper</p> <p>Learn all the details</p> <p> Read the paper</p> </li> <li> <p> Dataset</p> <p>Browse all the problems</p> <p> Download Dataset</p> </li> <li> <p> Github Repo</p> <p>Learn how to evaluate your model</p> <p> Installation &amp; usage</p> </li> <li> <p> FAQ</p> <p> Read the FAQ</p> </li> <li> <p> Leaderboard</p> <p>How good are LMs at science, really? (Coming soon...)</p> <p> Browse the results</p> </li> </ul>"},{"location":"#introduction","title":"Introduction","text":"<p> SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. </p>"},{"location":"#overview","title":"Overview","text":"<p> SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. This diverse selection ensures a comprehensive representation of the natural sciences, where extensive code development is essential. SciCode is mainly drawn from the scripts that scientists use in their everyday workflow. Many of these have been used in one or more publications, demonstrating their robustness and correctness.  Among various coding necessities, Scicode mainly focuses on: 1. Numerical methods. 2. Simulation of systems. 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM\u2019s science capability. The below figure is an example of the combination of 1 and 3.  In designing test cases for evaluation, we incorporate domain-specific test cases in addition to numerical cases. These tests are extracted from real scientific workflows: scientists must design domain-specific test cases to verify code accuracy by reproducing results published in papers or matching analytical solutions derived from theoretical models. Each problem goes through 3 rounds of validation (i.e. by in-domain scientists, out-of-domain scientists, GPT4) for quality control. </p> <p></p>"},{"location":"#benchmark-statistics","title":"Benchmark Statistics","text":"Fields Subfields Mathematics Numerical Linear Algebra (8), Computational Mechanics (5), Computational Finance (1) Physics Condensed Matter Physics (13), Optics (10), Quantum Information/Computing (6), Computational Physics (5), Astrophysics (2), Particle Physics (1) Chemistry Quantum Chemistry (5), Computational Chemistry (3) Biology Ecology (6), Biochemistry (1), Genetics (1) Material Science Semiconductor Materials (7), Molecular Modeling (6) <p>Left: Distribution of Main Problems   Right: Distribution of Subproblems</p> <p> We include several research problems that are built upon or reproduce methods used in Nobel Prize-winning studies to highlight current trends in scientific research: the self-consistent field (SCF) method for density functional theory (DFT) calculations (The Nobel Prize in Chemistry 1998), the PMNS matrix for neutrino oscillation in matter (The Nobel Prize in Physics 2015), the Haldane model for the anomalous quantum Hall effect (The Nobel Prize in Physics 2016), optical tweezer simulations for microscopic thermodynamics (The Nobel Prize in Physics 2018), and the replica method for spin glasses (The Nobel Prize in Physics 2021). </p>"},{"location":"#experiment-results","title":"Experiment Results","text":"<p> We evaluate our model using zero-shot prompts. We keep the prompts general and design different ones for different evaluation setups only to inform the model about the tasks. We keep prompts the same across models and fields, and they contain the model\u2019s main and sub-problem instructions and code for previous subproblems. The standard setup means the model is tested without background knowledge and carrying over generated solutions to previous subproblems. The scientists' annotated background provides the necessary knowledge and reasoning steps to solve the problems, shifting the evaluation\u2019s focus more towards the models\u2019 coding and instruction-following capabilities. </p> <p> </p>"},{"location":"#citation","title":"Citation","text":"<pre><code>@misc{tian2024scicode,\n    title={SciCode: A Research Coding Benchmark Curated by Scientists},\n    author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},\n    year={2024},\n    eprint={2407.13168},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI}\n}\n</code></pre>"},{"location":"_footer/","title":"footer","text":"<ul> <li> <p> Something broken?  Report bug</p> </li> <li> <p> Something unclear?  Ask question</p> </li> </ul>"},{"location":"example_problem/","title":"Example: Calculate Chern numbers for the Haldane Model","text":""},{"location":"example_problem/#main-problem-and-dependencies","title":"Main Problem and Dependencies","text":"<p>1. Generate an array of Chern numbers for the Haldane model on a hexagonal lattice by sweeping the following parameters: the on-site energy to next-nearest-neighbor coupling constant ratio (\\(m/t_2\\) from -6 to 6 with \\(N\\) samples) and the phase (\\(\\phi\\) from -\\(\\pi\\) to \\(\\pi\\) with \\(N\\) samples) values. Given the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), the next-nearest-neighbor coupling constant \\(t_2\\), the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), and the number of sweeping grid points \\(N\\) for \\(m/t_2\\) and \\(\\phi\\).</p> <p><pre><code>'''\nInputs:\ndelta : float\n    The grid size in kx and ky axis for discretizing the Brillouin zone.\na : float\n    The lattice spacing, i.e., the length of one side of the hexagon.\nt1 : float\n    The nearest-neighbor coupling constant.\nt2 : float\n    The next-nearest-neighbor coupling constant.\nN : int\n    The number of sweeping grid points for both the on-site energy to next-nearest-neighbor coupling constant ratio and phase.\n\nOutputs:\nresults: matrix of shape(N, N)\n    The Chern numbers by sweeping the on-site energy to next-nearest-neighbor coupling constant ratio (m/t2) and phase (phi).\nm_values: array of length N\n    The swept on-site energy to next-nearest-neighbor coupling constant ratios.\nphi_values: array of length N\n    The swept phase values.\n'''\n</code></pre> <pre><code># Package Dependencies\nimport numpy as np\nimport cmath\nfrom math import pi, sin, cos, sqrt\n</code></pre></p>"},{"location":"example_problem/#subproblems","title":"Subproblems","text":"<p>1.1 Write a Haldane model Hamiltonian on a hexagonal lattice, given the following parameters: wavevector components \\(k_x\\) and \\(k_y\\) (momentum) in the x and y directions, lattice spacing \\(a\\), nearest-neighbor coupling constant \\(t_1\\), next-nearest-neighbor coupling constant \\(t_2\\), phase \\(\\phi\\) for the next-nearest-neighbor hopping, and the on-site energy \\(m\\).</p> <p>Scientists Annotated Background:</p> <p>Source: Haldane, F. D. M. (1988). Model for a quantum Hall effect without Landau levels: Condensed-matter realization of the\" parity anomaly\". Physical review letters, 61(18).</p> <p>We denote \\(\\{\\mathbf{a}_i\\}\\) are the vectors from a B site to its three nearest-neighbor A sites, and \\(\\{\\mathbf{b}_i\\}\\) are next-nearest-neighbor distance vectors, then we have</p> \\[ {\\mathbf{a}_1} = (0,a), \\] \\[ {\\mathbf{a}_2} = (\\sqrt 3 a/2, - a/2), \\] \\[ {\\mathbf{a}_3} = ( - \\sqrt 3 a/2, - a/2) \\] \\[ {\\mathbf{b}_1} = {\\mathbf{a}_2} - {\\mathbf{a}_3} = (\\sqrt 3 a,0), \\] \\[ {\\mathbf{b}_2} = {\\mathbf{a}_3} - {\\mathbf{a}_1} = ( - \\sqrt 3 a/2, - 3a/2), \\] \\[ {\\mathbf{b}_3} = {\\mathbf{a}_1} - {\\mathbf{a}_2} = ( - \\sqrt 3 a/2,3a/2) \\] <p>Then the Haldane model on a hexagonal lattice can be written as</p> \\[ H(k) = {d_0}I + {d_1}{\\sigma _1} + {d_2}{\\sigma _2} + {d_3}{\\sigma _3} \\] \\[{d_0} = 2{t_2}\\cos \\phi \\sum\\nolimits_i {\\cos (\\mathbf{k} \\cdot {\\mathbf{b}_i})} = 2{t_2}\\cos \\phi \\left[ {\\cos \\left( {\\sqrt 3 {k_x}a} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 + 3{k_y}a/2} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 - 3{k_y}a/2} \\right)} \\right] \\] \\[ {d_1} = {t_1}\\sum\\nolimits_i {\\cos (\\mathbf{k} \\cdot {\\mathbf{a}_i})}  = {t_1}\\left[ {\\cos \\left( {{k_y}a} \\right) + \\cos \\left( {\\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right)} \\right]\\\\ \\] \\[ {d_2} = {t_1}\\sum\\nolimits_i {\\sin (\\mathbf{k} \\cdot {\\mathbf{a}_i})}  = {t_1}\\left[ {\\sin \\left( {{k_y}a} \\right) + \\sin \\left( {\\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right)} \\right] \\\\ \\] \\[ {d_3} = m - 2{t_2}\\sin \\phi \\sum\\nolimits_i {\\sin (\\mathbf{k} \\cdot {\\mathbf{b}_i})}  = m - 2{t_2}\\sin \\phi \\left[ {\\sin \\left( {\\sqrt 3 {k_x}a} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 + 3{k_y}a/2} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 - 3{k_y}a/2} \\right)} \\right] \\\\ \\] <p>where \\(\\sigma_i\\) are the Pauli matrices and \\(I\\) is the identity matrix. <pre><code>def calc_hamiltonian(kx, ky, a, t1, t2, phi, m):\n    \"\"\"\n    Function to generate the Haldane Hamiltonian with a given set of parameters.\n\n    Inputs:\n    kx : float\n        The x component of the wavevector.\n    ky : float\n        The y component of the wavevector.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    phi : float\n        The phase ranging from -\u03c0 to \u03c0.\n    m : float\n        The on-site energy.\n\n    Output:\n    hamiltonian : matrix of shape(2, 2)\n        The Haldane Hamiltonian on a hexagonal lattice.\n    \"\"\"\n</code></pre> <pre><code># test case 1\nkx = 1\nky = 1\na = 1\nt1 = 1\nt2 = 0.3\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># Test Case 2\nkx = 0\nky = 1\na = 0.5\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># Test Case 3\nkx = 1\nky = 0\na = 0.5\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> 1.2 Calculate the Chern number using the Haldane Hamiltonian, given the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), the next-nearest-neighbor coupling constant \\(t_2\\), the phase \\(\\phi\\) for the next-nearest-neighbor hopping, and the on-site energy \\(m\\).</p> <p>Scientists Annotated Background:</p> <p>Source: Fukui, Takahiro, Yasuhiro Hatsugai, and Hiroshi Suzuki. \"Chern numbers in discretized Brillouin zone: efficient method of computing (spin) Hall conductances.\" Journal of the Physical Society of Japan 74.6 (2005): 1674-1677.</p> <p>Here we can discretize the two-dimensional Brillouin zone into grids with step \\(\\delta {k_x} = \\delta {k_y} = \\delta\\). If we define the U(1) gauge field on the links of the lattice as \\(U_\\mu (\\mathbf{k}_l) := \\frac{\\left\\langle n(\\mathbf{k}_l)\\middle|n(\\mathbf{k}_l + \\hat{\\mu})\\right\\rangle}{\\left|\\left\\langle n(\\mathbf{k}_l)\\middle|n(\\mathbf{k}_l + \\hat{\\mu})\\right\\rangle\\right|}\\), where \\(\\left|n(\\mathbf{k}_l)\\right\\rangle\\) is the eigenvector of Hamiltonian at \\(\\mathbf{k}_l\\), \\(\\hat{\\mu}\\) is a small displacement vector in the direction \\(\\mu\\) with magnitude \\(\\delta\\), and \\(\\mathbf{k}_l\\) is one of the momentum space lattice points \\(l\\). The corresponding curvature (flux) becomes</p> \\[ F_{xy}(\\mathbf{k}_l) := \\ln \\left[U_x(\\mathbf{k}_l)U_y(\\mathbf{k}_l+\\hat{x})U_x^{-1}(\\mathbf{k}_l+\\hat{y})U_y^{-1}(\\mathbf{k}_l)\\right] \\] <p>and the Chern number of a band can be calculated as</p> <p>$$ c = \\frac{1}{2\\pi i} \\Sigma_l F_{xy}(\\mathbf{k}_l), $$ where the summation is over all the lattice points \\(l\\). Note that the Brillouin zone of a hexagonal lattice with spacing \\(a\\) can be chosen as a rectangle with \\(0 \\le {k_x} \\le k_{x0} = 2\\sqrt 3 \\pi /(3a),0 \\le {k_y} \\le k_{y0} = 4\\pi /(3a)\\). <pre><code>def compute_chern_number(delta, a, t1, t2, phi, m):\n    \"\"\"\n    Function to compute the Chern number with a given set of parameters.\n\n    Inputs:\n    delta : float\n        The grid size in kx and ky axis for discretizing the Brillouin zone.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    phi : float\n        The phase ranging from -\u03c0 to \u03c0.\n    m : float\n        The on-site energy.\n\n    Output:\n    chern_number : float\n        The Chern number, a real number that should be close to an integer. The imaginary part is cropped out due to the negligible magnitude.\n    \"\"\"\n</code></pre></p> <pre><code># test case 1\ndelta = 2 * np.pi / 200\na = 1\nt1 = 4\nt2 = 1\nphi = 1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># test case 2\ndelta = 2 * np.pi / 100\na = 1\nt1 = 1\nt2 = 0.3\nphi = -1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># test case 3\ndelta = 2 * np.pi / 100\na = 1\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <p>1.3 Make a 2D array of Chern numbers by sweeping the parameters: the on-site energy to next-nearest-neighbor coupling ratio (\\(m/t_2\\) from -6 to 6 with \\(N\\) samples) and phase (\\(\\phi\\) from -\\(\\pi\\) to \\(\\pi\\) with \\(N\\) samples) values. Given the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), and the next-nearest-neighbor coupling constant \\(t_2\\). <pre><code>def compute_chern_number_grid(delta, a, t1, t2, N):\n    \"\"\"\n    Function to calculate the Chern numbers by sweeping the given set of parameters and returns the results along with the corresponding swept next-nearest-neighbor coupling constant and phase.\n\n    Inputs:\n    delta : float\n        The grid size in kx and ky axis for discretizing the Brillouin zone.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    N : int\n        The number of sweeping grid points for both the on-site energy to next-nearest-neighbor coupling constant ratio and phase.\n\n    Outputs:\n    results: matrix of shape(N, N)\n        The Chern numbers by sweeping the on-site energy to next-nearest-neighbor coupling constant ratio (m/t2) and phase (phi).\n    m_values: array of length N\n        The swept on-site energy to next-nearest-neighbor coupling constant ratios.\n    phi_values: array of length N\n        The swept phase values.\n    \"\"\"\n</code></pre></p>"},{"location":"example_problem/#domain-specific-test-cases","title":"Domain Specific Test Cases","text":"<p>Both the \\(k\\)-space and sweeping grid sizes are set to very rough values to make the computation faster, feel free to increase them for higher accuracy.</p> <p>At zero on-site energy, the Chern number is 1 for \\(\\phi &gt; 0\\), and the Chern number is -1 for \\(\\phi &lt; 0\\).</p> <p>For complementary plots, we can see that these phase diagrams are similar to the one in the original paper: Fig.2 in Haldane, F. D. M. (1988). To achieve a better match, decrease all grid sizes.</p> <p>Compare the following three test cases. We can find that the phase diagram is independent of the value of \\(t_1\\), and the ratio of \\(t_2/t_1\\), which is consistent with our expectations.</p> <p><pre><code># Test Case 1\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 4.0\nt2 = 1.0\nN = 40\n</code></pre> </p> <p><pre><code># Test Case 2\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 5.0\nt2 = 1.0\nN = 40\n</code></pre> </p> <p><pre><code># Test Case 3\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 1.0\nt2 = 0.2\nN = 40\n</code></pre> </p>"},{"location":"faq/","title":"FAQ","text":"<ul> <li> <p>How do you know that the subproblems are non-overlapping and complete? I assume this is up to the judgment of the question-writers and the in-domain validators, but it'd be nice if there was some way of more formally confirming this (not sure that's possible though).   In practice, we differentiate subproblems based on their context and their role within the broader problem-solving framework (i.e. given previous subproblem-code pairs). For example, the same function for calculating a derivative could be used in different contexts: computing the force from a potential at step 3 in one main problem, and computing a velocity at step 5 in another main problem. These subproblems, although based on the same mathematical operation, are not considered overlapping because the physics and context are totally different.</p> </li> <li> <p>SciCode doesn\u2019t actually have other scientists try and complete the problems, so your notion of \u201chuman validation\u201d is somewhat weak (I think the validators just look at the problems and say \u201cyep this looks good\u201d or \u201cI think this is bad\u201d with feedback). Ideally, we would have both in-domain and out-of-domain scientists fully engage with the problem by writing and executing the code themselves. This would allow us to systematically record their mistakes, feedback, and discussions, resulting in an ideal version of both the question design and code solution. However, given the limited capacity of our scientists pool, we need to maximize the efficiency of contributions while still achieving a high-quality validation process. Here\u2019s what we believe is the most efficient approach:</p> <ul> <li>In-Domain Scientists: In-domain scientists focus on ensuring the scientific accuracy of the problem. For each major problem, they verify that the design is sound, check the subproblem methodology, and confirm that the solution correctly reproduces published results or conforms to established theoretical models. Their expertise ensures that the scientific underpinnings of the problem are accurate.</li> <li>Out-of-Domain Scientists: Out-of-domain scientists help by reviewing the question design and background information. The background is designed to augment graduate-level people who do not have specific domain expertise to successfully solve the problem. Therefore, while they may not be experts in the specific field, they can still ensure that the information provided is clear and sufficient to solve the problem.</li> <li>Leveraging GPT-4 for Reproduction: Instead of requiring human scientists to reproduce the problem, we delegate this task to GPT-4. This is both cost-effective and efficient. The scientists\u2019 primary responsibility then becomes performing error analysis on the outputs, identifying areas where the question design, solution code, or test cases may need refinement.</li> </ul> <p>This streamlined process enables us to make the most of our scientific expertise while still ensuring robust validation of the problems. By focusing human effort on critical evaluation points and using AI for reproduction tasks, we can iterate and validate problems with greater speed and precision.</p> </li> <li> <p>I don\u2019t think that SciCode validates/confirms that solutions to these problems don\u2019t exist online. For each main problem, our approach ensures that it reproduces results from a published paper or established theoretical model, using these references as a benchmark. This is a standard practice in scientists\u2019 research workflow, and these benchmark results are often accessible online. </p> <p>The challenge lies in designing specific question-code pairs that faithfully reproduce those results. These question-code pairs are either derived from our own research workflows or created from scratch for topics we consider fundamental or important to the domain.</p> <p>Take, for example, the 2D Ising model, where the transition temperature is kT/J = 2.269. While this result is easily found in this paper, reproducing it with the simulation method requires significant effort. </p> <p>By doing so, we ensure that the problems, while based on known results, still demand an additional level of effort and insight, making them meaningful and non-trivial challenges.</p> </li> <li> <p>What's the difference between the numerical tests (inputs/outputs), and \u201cdomain-specific tests\u201d? The difference between numerical tests (inputs/outputs) and domain-specific tests lies in how scientists assess the correctness of a problem.</p> <p>Numerical tests focus on data points, ensuring that the inputs and outputs match expected numerical values. For example, in a molecular dynamics problem, numerical tests would check the final positions and velocities of atoms under a specific potential in a closed box after some time t.</p> <p>Domain-specific tests, on the other hand, involve checking whether the solution follows the relevant physical laws or constraints. Scientists don\u2019t just stop at numerical data; We also verify whether any physical laws are violated. In the same molecular dynamics example, apart from verifying positions and velocities, we would also check the conservation of energy and momentum. In a closed box, total energy, angular momentum, and linear momentum should remain the same before and after time t.</p> <p>The domain-specific test is a critical part of validating correctness not only for SciCode but also for real research workflows in natural science domains.</p> </li> <li> <p>What were the kinds of updates made during in-domain validation?</p> <p>During in-domain validation, all aspects of the problem can be updated to ensure accuracy and alignment with domain-specific expectations:</p> <ul> <li>Question Design: Refining or rewriting the problem to better align with the desired scientific solution or to ensure clarity in how the problem is framed.</li> <li>Formulation of Background: Updating the derivation or contextual background to provide a more accurate or comprehensive explanation of the problem and its relevance to the domain.</li> <li>Methodology: Revising the methods used to solve the problem, which involves adjusting algorithms, simulation techniques, or approaches to ensure the solution is robust and scientifically valid.</li> <li>Domain-specific Tests: Updating the domain-specific checks that validate the fundamental physical laws or constraints.</li> </ul> <p>At this stage, we have even totally rewritten a couple of problems to ensure that the question design can be properly paired with the code, making the problem-solving process more effective and scientifically rigorous.</p> </li> <li> <p>How can I trust the out-of-domain validation?     The role of out-of-domain validation is not to ensure scientific correctness but rather to verify that the problem is presented in a way that is clear, complete, and accessible to someone outside the specific field. The primary focus of out-of-domain validators is to ensure that the combination of the question and its background information provides enough context for someone to solve the problem, even if they lack domain-specific expertise.</p> </li> <li> <p>It seems that in the prompt template definition, the prompts with and without backgrounds are assigned the other way around:   DEFAULT_PROMPT_TEMPLATE = Path(\"eval\", \"data\", \"background_comment_template.txt\").read_text()   BACKGOUND_PROMPT_TEMPLATE = Path(\"eval\", \"data\", \"multistep_template.txt\").read_text()   Are the numbers reported in the paper run with these prompts?</p> <p>Yes, DEFAULT_PROMPT_TEMPLATE is our standard setup where we ask the model to generate the related background itself. BACKGOUND_PROMPT_TEMPLATE is the template where we will put in the scientist-annotated background.</p> </li> <li> <p>For subproblems 13.6, 62.1, 76.2, it seems like the model-generated outputs are ignored and replaced with the files in the eval folder - is this how the evaluations were run in the paper? And why are these problems evaluated this way?     These three problem-code pairs are provided as given context in order to control uncertainty and reduce the degrees of freedom in the evaluation process. By doing so, we limit the model\u2019s randomness in problem-solving. Without this context, the evaluation would allow for too many possible solutions, leading to inconsistent results.</p> </li> <li> <p>In line 66, if self.previous_llm_code[prev_step] is None the previous steps are populated with saved model outputs after running them through extract_function_name and get_function_from_code; otherwise the previous steps are populated with extract_python_script. It doesn't seem like the first case is invoked except for the subproblems 13.6, 62.1, 76.2 - can you confirm that extract_python_script was used for the numbers in the paper?     This setup is designed specifically to handle cases where the model is interrupted mid-step and needs to resume from that point.</p> </li> </ul>"},{"location":"leaderboard/","title":"Leaderboard","text":"<p>"},{"location":"leaderboard/#scicode-leaderboard","title":"SciCode Leaderboard","text":"Model Main Problem Resolve Rate OpenAI o1-preview 7.7% Claude3.5-Sonnet 4.6% Deepseek-Coder-v2 3.1% GPT-4o 1.5% GPT-4-Turbo 1.5% OpenAI o1-mini 1.5% Gemini 1.5 Pro 1.5% Claude3-Opus 1.5% Claude3-Sonnet 1.5% Qwen2-72B-Instruct 1.5% Llama-3.1-405B-Instruct 0% Llama-3.1-70B-Instruct 0% Mixtral-8x22B-Instruct 0% Llama-3-70B-Chat 0% <p>How to submit</p> <p>Want to submit your own model? Head over to the documentation.</p>"},{"location":"leaderboard_table/","title":"Leaderboard table","text":"date author model score 240712 scicode gpt4o 0.8 240712 scicode gpt4 0.8"},{"location":"problems/","title":"Problem List","text":""},{"location":"problems/#numerical-linear-algebra","title":"Numerical Linear Algebra","text":"<p>1_Conjugate_Gradient</p> <p>3_Gauss_Seidel</p> <p>4_Incomplete_Cholesky</p> <p>5_Lanczos</p> <p>9_Weighted_Jacobi</p> <p>29_Gram_Schmidt_orthogonalization</p> <p>31_independent_component_analysis</p> <p>74_Householder_QR</p>"},{"location":"problems/#computational-mechanics","title":"Computational Mechanics","text":"<p>18_NURBS</p> <p>24_Burgers_equation</p> <p>40_Spliting_Operator</p> <p>54_SUPG</p> <p>78_Chaotic_Dynamics_Pendulum</p>"},{"location":"problems/#computational-finance","title":"Computational Finance","text":"<p>63_Estimating_Stock_Option_Price</p>"},{"location":"problems/#condensed-matter-physics","title":"Condensed Matter Physics","text":"<p>17_linear_tetrahedron_method</p> <p>20_phonon_angular_momentum</p> <p><sup>*</sup>33_phase_diagram_chern_haldane_model</p> <p>38_Reciprocal_lattice_vector</p> <p>48_MEELS_conversion</p> <p><sup>*</sup>50_Replica_symmetry_breaking</p> <p>62_dmrg</p> <p>67_LEG_Dyson_equation_bulk</p> <p>69_LEG_Dyson_equation_semi_infinite</p> <p>72_ising_model</p> <p>73_Xray_conversion_II</p> <p>75_graphene_tight_binding</p>"},{"location":"problems/#optics","title":"Optics","text":"<p>2_Gaussian_Beam_Focus</p> <p>6_Spatial_filters_I</p> <p>7_Spatial_filters_II</p> <p>8_Spatial_filters_III</p> <p><sup>*</sup>14_Brownian_motion_in_the_optical_tweezer</p> <p>22_Beam_translation_reexpansion</p> <p>28_Gaussian_Beam_Intensity</p> <p><sup>*</sup>32_Multiparticle_dynamics_in_the_optical_tweezer_array</p> <p>37_ray_optics_spherical_aberration</p> <p>43_two_end_fiber_laser_generator</p>"},{"location":"problems/#quantum-informationcomputing","title":"Quantum Information/Computing","text":"<p>11_GADC_entanglement</p> <p>19_n_tangle</p> <p>23_Blahut_Arimoto</p> <p>59_VQE</p> <p>65_GHZ_protocol_fidelity</p> <p>71_GADC_rev_coherent_info</p>"},{"location":"problems/#computational-physics","title":"Computational Physics","text":"<p>13_Maxwell_Equation_Solver</p> <p>15_Crank_Nicolson_for_time_dependent_Schrodinger</p> <p>45_finite_difference_heat_equation</p> <p>52_Shooting_algo_H_atom</p> <p>57_1D_harmonic_oscillator_numerov_shooting</p>"},{"location":"problems/#astrophysics","title":"Astrophysics","text":"<p>49_nbody</p> <p>58_Tolman_Oppenheimer_Volkoff_star</p>"},{"location":"problems/#particle-physics","title":"Particle Physics","text":"<p><sup>*</sup>70_neutrino_oscillation</p>"},{"location":"problems/#quantum-chemistry","title":"Quantum Chemistry","text":"<p><sup>*</sup>12_Schrodinger_DFT_with_SCF</p> <p>30_helium_slater_jastrow_wavefunction</p> <p>46_helium_atom_vmc</p> <p>66_kolmogorov_crespi_potential</p> <p>68_helium_atom_dmc</p>"},{"location":"problems/#computational-chemistry","title":"Computational Chemistry","text":"<p>10_ewald_summation</p> <p>16_Davidson_method</p> <p>60_Widom_particle_insertion</p>"},{"location":"problems/#ecology","title":"Ecology","text":"<p>25_CRM_in_chemostat</p> <p>26_CRM_in_serial_dilution</p> <p>41_Structural_stability_in_serial_dilution</p> <p>53_Stochastic_Lotka_Volterra</p> <p>55_Swift_Hohenberg</p> <p>56_temporal_niches</p>"},{"location":"problems/#biochemistry","title":"Biochemistry","text":"<p>44_two_mer_entropy</p>"},{"location":"problems/#genetics","title":"Genetics","text":"<p>76_protein_dna_binding</p>"},{"location":"problems/#semiconductor-materials","title":"Semiconductor Materials","text":"<p>21_Absorption_coefficient_for_alloy_GaAlAs</p> <p>27_Design_trade_offs_for_high_speed_photodetectors</p> <p>34_PN_diode_band_diagram</p> <p><sup>*</sup>35_Quantum_Dot_Absorption_Spectrum</p> <p>36_Quasi_Fermi_levels_of_photo_resistor_out_of_equilibrium</p> <p>39_Reflection_spectra_for_a_Distributed_Bragg_Reflector</p> <p>42_The_threshold_current_for_multi_quantum_well_lasers</p>"},{"location":"problems/#molecular-modeling","title":"Molecular Modeling","text":"<p>47_Internal_Energy</p> <p>51_Simple_Molecular_Dynamics</p> <p>64_GCMC</p> <p>77_Berendsen_thermostat</p> <p>79_Nose_Hoover_chain_thermostat</p> <p>80_Anderson_thermostat</p> <p>*: Problems that are related to Nobel Prize-winning researches</p>"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"SciCode: A Research Coding Benchmark Curated by Scientists","text":"<p> Minyang Tian<sup>1,2*\u2021</sup>, Luyu Gao<sup>3*</sup>, Shizhuo Dylan Zhang<sup>1</sup>, Xinan Chen<sup>1\u2020</sup>, Cunwei Fan<sup>1\u2020</sup>, Xuefei Guo<sup>1\u2020</sup>, Roland Haas<sup>1\u2020</sup>, Pan Ji<sup>4\u2020</sup>, Kittithat Krongchon<sup>1\u2020</sup>, Yao Li<sup>1\u2020</sup>, Shengyan Liu<sup>1\u2020</sup>, Di Luo<sup>5,6,11\u2020</sup>, Yutao Ma<sup>7\u2020</sup>, Hao Tong<sup>1\u2020</sup>, Kha Trinh<sup>7\u2020</sup>, Chenyu Tian<sup>8\u2020</sup>, Zihan Wang<sup>1\u2020</sup>, Bohao Wu<sup>1\u2020</sup>, Yanyu Xiong<sup>9\u2020</sup>, Shengzhu Yin<sup>1\u2020</sup>, Minhui Zhu<sup>1\u2020</sup>, Kilian Lieret<sup>10</sup>, Yanxin Lu<sup>1</sup>, Genglin Liu<sup>1</sup>, Yufeng Du<sup>1</sup>, Tianhua Tao<sup>1</sup>, Ofir Press<sup>10</sup>, Jamie Callan<sup>3</sup>, Eliu Huerta<sup>1,2,7\u2021</sup>, Hao Peng<sup>1\u2021</sup> </p> <p> <sup>1</sup>University of Illinois Urbana-Champaign   <sup>2</sup>Argonne National Laboratory   <sup>3</sup>Carnegie Mellon University   <sup>4</sup>University of North Carolina at Chapel Hill   <sup>5</sup>Massachusetts Institute of Technology   <sup>6</sup>Harvard University   <sup>7</sup>University of Chicago   <sup>8</sup>University of Texas at Austin   <sup>9</sup>Stanford University   <sup>10</sup>Princeton University   <sup>11</sup>The NSF AI Institute for Artificial Intelligence and Fundamental Interactions   </p> <p> * Equal contribution lead authors. \u2020 Data curation, alphabetical order. \u2021 Corresponding to: {mtian8, haopeng}@illinois.edu, elihu@{anl.gov, uchicago.edu} </p> <ul> <li> <p> Paper</p> <p>Learn all the details</p> <p> Read the paper</p> </li> <li> <p> Dataset</p> <p>Browse all the problems</p> <p> Download Dataset</p> </li> <li> <p> Github Repo</p> <p>Learn how to evaluate your model</p> <p> Installation &amp; usage</p> </li> <li> <p> FAQ</p> <p> Read the FAQ</p> </li> <li> <p> Leaderboard</p> <p>How good are LMs at science, really? (Coming soon...)</p> <p> Browse the results</p> </li> </ul>"},{"location":"#introduction","title":"Introduction","text":"<p> SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of 16 subdomains from 6 domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. </p>"},{"location":"#overview","title":"Overview","text":"<p> SciCode sources challenging and realistic research-level coding problems across 6 natural science disciplines, covering a total of 16 subfields. This diverse selection ensures a comprehensive representation of the natural sciences, where extensive code development is essential. SciCode is mainly drawn from the scripts that scientists use in their everyday workflow. Many of these have been used in one or more publications, demonstrating their robustness and correctness.  Among various coding necessities, Scicode mainly focuses on: 1. Numerical methods. 2. Simulation of systems. 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM\u2019s science capability. The below figure is an example of the combination of 1 and 3.  In designing test cases for evaluation, we incorporate domain-specific test cases in addition to numerical cases. These tests are extracted from real scientific workflows: scientists must design domain-specific test cases to verify code accuracy by reproducing results published in papers or matching analytical solutions derived from theoretical models. Each problem goes through 3 rounds of validation (i.e. by in-domain scientists, out-of-domain scientists, GPT4) for quality control. </p> <p></p>"},{"location":"#benchmark-statistics","title":"Benchmark Statistics","text":"Fields Subfields Mathematics Numerical Linear Algebra (8), Computational Mechanics (5), Computational Finance (1) Physics Condensed Matter Physics (13), Optics (10), Quantum Information/Computing (6), Computational Physics (5), Astrophysics (2), Particle Physics (1) Chemistry Quantum Chemistry (5), Computational Chemistry (3) Biology Ecology (6), Biochemistry (1), Genetics (1) Material Science Semiconductor Materials (7), Molecular Modeling (6) <p>Left: Distribution of Main Problems   Right: Distribution of Subproblems</p> <p> We include several research problems that are built upon or reproduce methods used in Nobel Prize-winning studies to highlight current trends in scientific research: the self-consistent field (SCF) method for density functional theory (DFT) calculations (The Nobel Prize in Chemistry 1998), the PMNS matrix for neutrino oscillation in matter (The Nobel Prize in Physics 2015), the Haldane model for the anomalous quantum Hall effect (The Nobel Prize in Physics 2016), optical tweezer simulations for microscopic thermodynamics (The Nobel Prize in Physics 2018), and the replica method for spin glasses (The Nobel Prize in Physics 2021). </p>"},{"location":"#experiment-results","title":"Experiment Results","text":"<p> We evaluate our model using zero-shot prompts. We keep the prompts general and design different ones for different evaluation setups only to inform the model about the tasks. We keep prompts the same across models and fields, and they contain the model\u2019s main and sub-problem instructions and code for previous subproblems. The standard setup means the model is tested without background knowledge and carrying over generated solutions to previous subproblems. The scientists' annotated background provides the necessary knowledge and reasoning steps to solve the problems, shifting the evaluation\u2019s focus more towards the models\u2019 coding and instruction-following capabilities. </p> <p> </p>"},{"location":"#citation","title":"Citation","text":"<pre><code>@misc{tian2024scicode,\n    title={SciCode: A Research Coding Benchmark Curated by Scientists},\n    author={Minyang Tian and Luyu Gao and Shizhuo Dylan Zhang and Xinan Chen and Cunwei Fan and Xuefei Guo and Roland Haas and Pan Ji and Kittithat Krongchon and Yao Li and Shengyan Liu and Di Luo and Yutao Ma and Hao Tong and Kha Trinh and Chenyu Tian and Zihan Wang and Bohao Wu and Yanyu Xiong and Shengzhu Yin and Minhui Zhu and Kilian Lieret and Yanxin Lu and Genglin Liu and Yufeng Du and Tianhua Tao and Ofir Press and Jamie Callan and Eliu Huerta and Hao Peng},\n    year={2024},\n    eprint={2407.13168},\n    archivePrefix={arXiv},\n    primaryClass={cs.AI}\n}\n</code></pre>"},{"location":"_footer/","title":"footer","text":"<ul> <li> <p> Something broken?  Report bug</p> </li> <li> <p> Something unclear?  Ask question</p> </li> </ul>"},{"location":"example_problem/","title":"Example: Calculate Chern numbers for the Haldane Model","text":""},{"location":"example_problem/#main-problem-and-dependencies","title":"Main Problem and Dependencies","text":"<p>1. Generate an array of Chern numbers for the Haldane model on a hexagonal lattice by sweeping the following parameters: the on-site energy to next-nearest-neighbor coupling constant ratio (\\(m/t_2\\) from -6 to 6 with \\(N\\) samples) and the phase (\\(\\phi\\) from -\\(\\pi\\) to \\(\\pi\\) with \\(N\\) samples) values. Given the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), the next-nearest-neighbor coupling constant \\(t_2\\), the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), and the number of sweeping grid points \\(N\\) for \\(m/t_2\\) and \\(\\phi\\).</p> <p><pre><code>'''\nInputs:\ndelta : float\n    The grid size in kx and ky axis for discretizing the Brillouin zone.\na : float\n    The lattice spacing, i.e., the length of one side of the hexagon.\nt1 : float\n    The nearest-neighbor coupling constant.\nt2 : float\n    The next-nearest-neighbor coupling constant.\nN : int\n    The number of sweeping grid points for both the on-site energy to next-nearest-neighbor coupling constant ratio and phase.\n\nOutputs:\nresults: matrix of shape(N, N)\n    The Chern numbers by sweeping the on-site energy to next-nearest-neighbor coupling constant ratio (m/t2) and phase (phi).\nm_values: array of length N\n    The swept on-site energy to next-nearest-neighbor coupling constant ratios.\nphi_values: array of length N\n    The swept phase values.\n'''\n</code></pre> <pre><code># Package Dependencies\nimport numpy as np\nimport cmath\nfrom math import pi, sin, cos, sqrt\n</code></pre></p>"},{"location":"example_problem/#subproblems","title":"Subproblems","text":"<p>1.1 Write a Haldane model Hamiltonian on a hexagonal lattice, given the following parameters: wavevector components \\(k_x\\) and \\(k_y\\) (momentum) in the x and y directions, lattice spacing \\(a\\), nearest-neighbor coupling constant \\(t_1\\), next-nearest-neighbor coupling constant \\(t_2\\), phase \\(\\phi\\) for the next-nearest-neighbor hopping, and the on-site energy \\(m\\).</p> <p>Scientists Annotated Background:</p> <p>Source: Haldane, F. D. M. (1988). Model for a quantum Hall effect without Landau levels: Condensed-matter realization of the\" parity anomaly\". Physical review letters, 61(18).</p> <p>We denote \\(\\{\\mathbf{a}_i\\}\\) are the vectors from a B site to its three nearest-neighbor A sites, and \\(\\{\\mathbf{b}_i\\}\\) are next-nearest-neighbor distance vectors, then we have</p> \\[ {\\mathbf{a}_1} = (0,a), \\] \\[ {\\mathbf{a}_2} = (\\sqrt 3 a/2, - a/2), \\] \\[ {\\mathbf{a}_3} = ( - \\sqrt 3 a/2, - a/2) \\] \\[ {\\mathbf{b}_1} = {\\mathbf{a}_2} - {\\mathbf{a}_3} = (\\sqrt 3 a,0), \\] \\[ {\\mathbf{b}_2} = {\\mathbf{a}_3} - {\\mathbf{a}_1} = ( - \\sqrt 3 a/2, - 3a/2), \\] \\[ {\\mathbf{b}_3} = {\\mathbf{a}_1} - {\\mathbf{a}_2} = ( - \\sqrt 3 a/2,3a/2) \\] <p>Then the Haldane model on a hexagonal lattice can be written as</p> \\[ H(k) = {d_0}I + {d_1}{\\sigma _1} + {d_2}{\\sigma _2} + {d_3}{\\sigma _3} \\] \\[{d_0} = 2{t_2}\\cos \\phi \\sum\\nolimits_i {\\cos (\\mathbf{k} \\cdot {\\mathbf{b}_i})} = 2{t_2}\\cos \\phi \\left[ {\\cos \\left( {\\sqrt 3 {k_x}a} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 + 3{k_y}a/2} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 - 3{k_y}a/2} \\right)} \\right] \\] \\[ {d_1} = {t_1}\\sum\\nolimits_i {\\cos (\\mathbf{k} \\cdot {\\mathbf{a}_i})}  = {t_1}\\left[ {\\cos \\left( {{k_y}a} \\right) + \\cos \\left( {\\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right) + \\cos \\left( { - \\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right)} \\right]\\\\ \\] \\[ {d_2} = {t_1}\\sum\\nolimits_i {\\sin (\\mathbf{k} \\cdot {\\mathbf{a}_i})}  = {t_1}\\left[ {\\sin \\left( {{k_y}a} \\right) + \\sin \\left( {\\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 - {k_y}a/2} \\right)} \\right] \\\\ \\] \\[ {d_3} = m - 2{t_2}\\sin \\phi \\sum\\nolimits_i {\\sin (\\mathbf{k} \\cdot {\\mathbf{b}_i})}  = m - 2{t_2}\\sin \\phi \\left[ {\\sin \\left( {\\sqrt 3 {k_x}a} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 + 3{k_y}a/2} \\right) + \\sin \\left( { - \\sqrt 3 {k_x}a/2 - 3{k_y}a/2} \\right)} \\right] \\\\ \\] <p>where \\(\\sigma_i\\) are the Pauli matrices and \\(I\\) is the identity matrix. <pre><code>def calc_hamiltonian(kx, ky, a, t1, t2, phi, m):\n    \"\"\"\n    Function to generate the Haldane Hamiltonian with a given set of parameters.\n\n    Inputs:\n    kx : float\n        The x component of the wavevector.\n    ky : float\n        The y component of the wavevector.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    phi : float\n        The phase ranging from -\u03c0 to \u03c0.\n    m : float\n        The on-site energy.\n\n    Output:\n    hamiltonian : matrix of shape(2, 2)\n        The Haldane Hamiltonian on a hexagonal lattice.\n    \"\"\"\n</code></pre> <pre><code># test case 1\nkx = 1\nky = 1\na = 1\nt1 = 1\nt2 = 0.3\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># Test Case 2\nkx = 0\nky = 1\na = 0.5\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># Test Case 3\nkx = 1\nky = 0\na = 0.5\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(calc_hamiltonian(kx, ky, a, t1, t2, phi, m), target)\n</code></pre> 1.2 Calculate the Chern number using the Haldane Hamiltonian, given the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), the next-nearest-neighbor coupling constant \\(t_2\\), the phase \\(\\phi\\) for the next-nearest-neighbor hopping, and the on-site energy \\(m\\).</p> <p>Scientists Annotated Background:</p> <p>Source: Fukui, Takahiro, Yasuhiro Hatsugai, and Hiroshi Suzuki. \"Chern numbers in discretized Brillouin zone: efficient method of computing (spin) Hall conductances.\" Journal of the Physical Society of Japan 74.6 (2005): 1674-1677.</p> <p>Here we can discretize the two-dimensional Brillouin zone into grids with step \\(\\delta {k_x} = \\delta {k_y} = \\delta\\). If we define the U(1) gauge field on the links of the lattice as \\(U_\\mu (\\mathbf{k}_l) := \\frac{\\left\\langle n(\\mathbf{k}_l)\\middle|n(\\mathbf{k}_l + \\hat{\\mu})\\right\\rangle}{\\left|\\left\\langle n(\\mathbf{k}_l)\\middle|n(\\mathbf{k}_l + \\hat{\\mu})\\right\\rangle\\right|}\\), where \\(\\left|n(\\mathbf{k}_l)\\right\\rangle\\) is the eigenvector of Hamiltonian at \\(\\mathbf{k}_l\\), \\(\\hat{\\mu}\\) is a small displacement vector in the direction \\(\\mu\\) with magnitude \\(\\delta\\), and \\(\\mathbf{k}_l\\) is one of the momentum space lattice points \\(l\\). The corresponding curvature (flux) becomes</p> \\[ F_{xy}(\\mathbf{k}_l) := \\ln \\left[U_x(\\mathbf{k}_l)U_y(\\mathbf{k}_l+\\hat{x})U_x^{-1}(\\mathbf{k}_l+\\hat{y})U_y^{-1}(\\mathbf{k}_l)\\right] \\] <p>and the Chern number of a band can be calculated as</p> <p>$$ c = \\frac{1}{2\\pi i} \\Sigma_l F_{xy}(\\mathbf{k}_l), $$ where the summation is over all the lattice points \\(l\\). Note that the Brillouin zone of a hexagonal lattice with spacing \\(a\\) can be chosen as a rectangle with \\(0 \\le {k_x} \\le k_{x0} = 2\\sqrt 3 \\pi /(3a),0 \\le {k_y} \\le k_{y0} = 4\\pi /(3a)\\). <pre><code>def compute_chern_number(delta, a, t1, t2, phi, m):\n    \"\"\"\n    Function to compute the Chern number with a given set of parameters.\n\n    Inputs:\n    delta : float\n        The grid size in kx and ky axis for discretizing the Brillouin zone.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    phi : float\n        The phase ranging from -\u03c0 to \u03c0.\n    m : float\n        The on-site energy.\n\n    Output:\n    chern_number : float\n        The Chern number, a real number that should be close to an integer. The imaginary part is cropped out due to the negligible magnitude.\n    \"\"\"\n</code></pre></p> <pre><code># test case 1\ndelta = 2 * np.pi / 200\na = 1\nt1 = 4\nt2 = 1\nphi = 1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># test case 2\ndelta = 2 * np.pi / 100\na = 1\nt1 = 1\nt2 = 0.3\nphi = -1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <pre><code># test case 3\ndelta = 2 * np.pi / 100\na = 1\nt1 = 1\nt2 = 0.2\nphi = 1\nm = 1\nassert np.allclose(compute_chern_number(delta, a, t1, t2, phi, m), target)\n</code></pre> <p>1.3 Make a 2D array of Chern numbers by sweeping the parameters: the on-site energy to next-nearest-neighbor coupling ratio (\\(m/t_2\\) from -6 to 6 with \\(N\\) samples) and phase (\\(\\phi\\) from -\\(\\pi\\) to \\(\\pi\\) with \\(N\\) samples) values. Given the grid size \\(\\delta\\) for discretizing the Brillouin zone in the \\(k_x\\) and \\(k_y\\) directions (assuming the grid sizes are the same in both directions), the lattice spacing \\(a\\), the nearest-neighbor coupling constant \\(t_1\\), and the next-nearest-neighbor coupling constant \\(t_2\\). <pre><code>def compute_chern_number_grid(delta, a, t1, t2, N):\n    \"\"\"\n    Function to calculate the Chern numbers by sweeping the given set of parameters and returns the results along with the corresponding swept next-nearest-neighbor coupling constant and phase.\n\n    Inputs:\n    delta : float\n        The grid size in kx and ky axis for discretizing the Brillouin zone.\n    a : float\n        The lattice spacing, i.e., the length of one side of the hexagon.\n    t1 : float\n        The nearest-neighbor coupling constant.\n    t2 : float\n        The next-nearest-neighbor coupling constant.\n    N : int\n        The number of sweeping grid points for both the on-site energy to next-nearest-neighbor coupling constant ratio and phase.\n\n    Outputs:\n    results: matrix of shape(N, N)\n        The Chern numbers by sweeping the on-site energy to next-nearest-neighbor coupling constant ratio (m/t2) and phase (phi).\n    m_values: array of length N\n        The swept on-site energy to next-nearest-neighbor coupling constant ratios.\n    phi_values: array of length N\n        The swept phase values.\n    \"\"\"\n</code></pre></p>"},{"location":"example_problem/#domain-specific-test-cases","title":"Domain Specific Test Cases","text":"<p>Both the \\(k\\)-space and sweeping grid sizes are set to very rough values to make the computation faster, feel free to increase them for higher accuracy.</p> <p>At zero on-site energy, the Chern number is 1 for \\(\\phi &gt; 0\\), and the Chern number is -1 for \\(\\phi &lt; 0\\).</p> <p>For complementary plots, we can see that these phase diagrams are similar to the one in the original paper: Fig.2 in Haldane, F. D. M. (1988). To achieve a better match, decrease all grid sizes.</p> <p>Compare the following three test cases. We can find that the phase diagram is independent of the value of \\(t_1\\), and the ratio of \\(t_2/t_1\\), which is consistent with our expectations.</p> <p><pre><code># Test Case 1\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 4.0\nt2 = 1.0\nN = 40\n</code></pre> </p> <p><pre><code># Test Case 2\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 5.0\nt2 = 1.0\nN = 40\n</code></pre> </p> <p><pre><code># Test Case 3\ndelta = 2 * np.pi / 30\na = 1.0\nt1 = 1.0\nt2 = 0.2\nN = 40\n</code></pre> </p>"},{"location":"faq/","title":"FAQ","text":"<ul> <li> <p>How do you know that the subproblems are non-overlapping and complete? I assume this is up to the judgment of the question-writers and the in-domain validators, but it'd be nice if there was some way of more formally confirming this (not sure that's possible though).   In practice, we differentiate subproblems based on their context and their role within the broader problem-solving framework (i.e. given previous subproblem-code pairs). For example, the same function for calculating a derivative could be used in different contexts: computing the force from a potential at step 3 in one main problem, and computing a velocity at step 5 in another main problem. These subproblems, although based on the same mathematical operation, are not considered overlapping because the physics and context are totally different.</p> </li> <li> <p>SciCode doesn\u2019t actually have other scientists try and complete the problems, so your notion of \u201chuman validation\u201d is somewhat weak (I think the validators just look at the problems and say \u201cyep this looks good\u201d or \u201cI think this is bad\u201d with feedback). Ideally, we would have both in-domain and out-of-domain scientists fully engage with the problem by writing and executing the code themselves. This would allow us to systematically record their mistakes, feedback, and discussions, resulting in an ideal version of both the question design and code solution. However, given the limited capacity of our scientists pool, we need to maximize the efficiency of contributions while still achieving a high-quality validation process. Here\u2019s what we believe is the most efficient approach:</p> <ul> <li>In-Domain Scientists: In-domain scientists focus on ensuring the scientific accuracy of the problem. For each major problem, they verify that the design is sound, check the subproblem methodology, and confirm that the solution correctly reproduces published results or conforms to established theoretical models. Their expertise ensures that the scientific underpinnings of the problem are accurate.</li> <li>Out-of-Domain Scientists: Out-of-domain scientists help by reviewing the question design and background information. The background is designed to augment graduate-level people who do not have specific domain expertise to successfully solve the problem. Therefore, while they may not be experts in the specific field, they can still ensure that the information provided is clear and sufficient to solve the problem.</li> <li>Leveraging GPT-4 for Reproduction: Instead of requiring human scientists to reproduce the problem, we delegate this task to GPT-4. This is both cost-effective and efficient. The scientists\u2019 primary responsibility then becomes performing error analysis on the outputs, identifying areas where the question design, solution code, or test cases may need refinement.</li> </ul> <p>This streamlined process enables us to make the most of our scientific expertise while still ensuring robust validation of the problems. By focusing human effort on critical evaluation points and using AI for reproduction tasks, we can iterate and validate problems with greater speed and precision.</p> </li> <li> <p>I don\u2019t think that SciCode validates/confirms that solutions to these problems don\u2019t exist online. For each main problem, our approach ensures that it reproduces results from a published paper or established theoretical model, using these references as a benchmark. This is a standard practice in scientists\u2019 research workflow, and these benchmark results are often accessible online. </p> <p>The challenge lies in designing specific question-code pairs that faithfully reproduce those results. These question-code pairs are either derived from our own research workflows or created from scratch for topics we consider fundamental or important to the domain.</p> <p>Take, for example, the 2D Ising model, where the transition temperature is kT/J = 2.269. While this result is easily found in this paper, reproducing it with the simulation method requires significant effort. </p> <p>By doing so, we ensure that the problems, while based on known results, still demand an additional level of effort and insight, making them meaningful and non-trivial challenges.</p> </li> <li> <p>What's the difference between the numerical tests (inputs/outputs), and \u201cdomain-specific tests\u201d? The difference between numerical tests (inputs/outputs) and domain-specific tests lies in how scientists assess the correctness of a problem.</p> <p>Numerical tests focus on data points, ensuring that the inputs and outputs match expected numerical values. For example, in a molecular dynamics problem, numerical tests would check the final positions and velocities of atoms under a specific potential in a closed box after some time t.</p> <p>Domain-specific tests, on the other hand, involve checking whether the solution follows the relevant physical laws or constraints. Scientists don\u2019t just stop at numerical data; We also verify whether any physical laws are violated. In the same molecular dynamics example, apart from verifying positions and velocities, we would also check the conservation of energy and momentum. In a closed box, total energy, angular momentum, and linear momentum should remain the same before and after time t.</p> <p>The domain-specific test is a critical part of validating correctness not only for SciCode but also for real research workflows in natural science domains.</p> </li> <li> <p>What were the kinds of updates made during in-domain validation?</p> <p>During in-domain validation, all aspects of the problem can be updated to ensure accuracy and alignment with domain-specific expectations:</p> <ul> <li>Question Design: Refining or rewriting the problem to better align with the desired scientific solution or to ensure clarity in how the problem is framed.</li> <li>Formulation of Background: Updating the derivation or contextual background to provide a more accurate or comprehensive explanation of the problem and its relevance to the domain.</li> <li>Methodology: Revising the methods used to solve the problem, which involves adjusting algorithms, simulation techniques, or approaches to ensure the solution is robust and scientifically valid.</li> <li>Domain-specific Tests: Updating the domain-specific checks that validate the fundamental physical laws or constraints.</li> </ul> <p>At this stage, we have even totally rewritten a couple of problems to ensure that the question design can be properly paired with the code, making the problem-solving process more effective and scientifically rigorous.</p> </li> <li> <p>How can I trust the out-of-domain validation?     The role of out-of-domain validation is not to ensure scientific correctness but rather to verify that the problem is presented in a way that is clear, complete, and accessible to someone outside the specific field. The primary focus of out-of-domain validators is to ensure that the combination of the question and its background information provides enough context for someone to solve the problem, even if they lack domain-specific expertise.</p> </li> <li> <p>It seems that in the prompt template definition, the prompts with and without backgrounds are assigned the other way around:   DEFAULT_PROMPT_TEMPLATE = Path(\"eval\", \"data\", \"background_comment_template.txt\").read_text()   BACKGOUND_PROMPT_TEMPLATE = Path(\"eval\", \"data\", \"multistep_template.txt\").read_text()   Are the numbers reported in the paper run with these prompts?</p> <p>Yes, DEFAULT_PROMPT_TEMPLATE is our standard setup where we ask the model to generate the related background itself. BACKGOUND_PROMPT_TEMPLATE is the template where we will put in the scientist-annotated background.</p> </li> <li> <p>For subproblems 13.6, 62.1, 76.2, it seems like the model-generated outputs are ignored and replaced with the files in the eval folder - is this how the evaluations were run in the paper? And why are these problems evaluated this way?     These three problem-code pairs are provided as given context in order to control uncertainty and reduce the degrees of freedom in the evaluation process. By doing so, we limit the model\u2019s randomness in problem-solving. Without this context, the evaluation would allow for too many possible solutions, leading to inconsistent results.</p> </li> <li> <p>In line 66, if self.previous_llm_code[prev_step] is None the previous steps are populated with saved model outputs after running them through extract_function_name and get_function_from_code; otherwise the previous steps are populated with extract_python_script. It doesn't seem like the first case is invoked except for the subproblems 13.6, 62.1, 76.2 - can you confirm that extract_python_script was used for the numbers in the paper?     This setup is designed specifically to handle cases where the model is interrupted mid-step and needs to resume from that point.</p> </li> </ul>"},{"location":"leaderboard/","title":"Leaderboard","text":"<p>"},{"location":"leaderboard/#scicode-leaderboard","title":"SciCode Leaderboard","text":"Model Main Problem Resolve Rate \ud83e\udd47OpenAI o1-preview 7.7% \ud83e\udd48Claude3.5-Sonnet 4.6% \ud83e\udd49Deepseek-Coder-v2 3.1% GPT-4o 1.5% GPT-4-Turbo 1.5% OpenAI o1-mini 1.5% Gemini 1.5 Pro 1.5% Claude3-Opus 1.5% Claude3-Sonnet 1.5% Qwen2-72B-Instruct 1.5% Llama-3.1-405B-Instruct 0% Llama-3.1-70B-Instruct 0% Mixtral-8x22B-Instruct 0% Llama-3-70B-Chat 0% <p>How to submit</p> <p>Want to submit your own model? Head over to the documentation.</p>"},{"location":"leaderboard_table/","title":"Leaderboard table","text":"date author model score 240712 scicode gpt4o 0.8 240712 scicode gpt4 0.8"},{"location":"problems/","title":"Problem List","text":""},{"location":"problems/#numerical-linear-algebra","title":"Numerical Linear Algebra","text":"<p>1_Conjugate_Gradient</p> <p>3_Gauss_Seidel</p> <p>4_Incomplete_Cholesky</p> <p>5_Lanczos</p> <p>9_Weighted_Jacobi</p> <p>29_Gram_Schmidt_orthogonalization</p> <p>31_independent_component_analysis</p> <p>74_Householder_QR</p>"},{"location":"problems/#computational-mechanics","title":"Computational Mechanics","text":"<p>18_NURBS</p> <p>24_Burgers_equation</p> <p>40_Spliting_Operator</p> <p>54_SUPG</p> <p>78_Chaotic_Dynamics_Pendulum</p>"},{"location":"problems/#computational-finance","title":"Computational Finance","text":"<p>63_Estimating_Stock_Option_Price</p>"},{"location":"problems/#condensed-matter-physics","title":"Condensed Matter Physics","text":"<p>17_linear_tetrahedron_method</p> <p>20_phonon_angular_momentum</p> <p><sup>*</sup>33_phase_diagram_chern_haldane_model</p> <p>38_Reciprocal_lattice_vector</p> <p>48_MEELS_conversion</p> <p><sup>*</sup>50_Replica_symmetry_breaking</p> <p>62_dmrg</p> <p>67_LEG_Dyson_equation_bulk</p> <p>69_LEG_Dyson_equation_semi_infinite</p> <p>72_ising_model</p> <p>73_Xray_conversion_II</p> <p>75_graphene_tight_binding</p>"},{"location":"problems/#optics","title":"Optics","text":"<p>2_Gaussian_Beam_Focus</p> <p>6_Spatial_filters_I</p> <p>7_Spatial_filters_II</p> <p>8_Spatial_filters_III</p> <p><sup>*</sup>14_Brownian_motion_in_the_optical_tweezer</p> <p>22_Beam_translation_reexpansion</p> <p>28_Gaussian_Beam_Intensity</p> <p><sup>*</sup>32_Multiparticle_dynamics_in_the_optical_tweezer_array</p> <p>37_ray_optics_spherical_aberration</p> <p>43_two_end_fiber_laser_generator</p>"},{"location":"problems/#quantum-informationcomputing","title":"Quantum Information/Computing","text":"<p>11_GADC_entanglement</p> <p>19_n_tangle</p> <p>23_Blahut_Arimoto</p> <p>59_VQE</p> <p>65_GHZ_protocol_fidelity</p> <p>71_GADC_rev_coherent_info</p>"},{"location":"problems/#computational-physics","title":"Computational Physics","text":"<p>13_Maxwell_Equation_Solver</p> <p>15_Crank_Nicolson_for_time_dependent_Schrodinger</p> <p>45_finite_difference_heat_equation</p> <p>52_Shooting_algo_H_atom</p> <p>57_1D_harmonic_oscillator_numerov_shooting</p>"},{"location":"problems/#astrophysics","title":"Astrophysics","text":"<p>49_nbody</p> <p>58_Tolman_Oppenheimer_Volkoff_star</p>"},{"location":"problems/#particle-physics","title":"Particle Physics","text":"<p><sup>*</sup>70_neutrino_oscillation</p>"},{"location":"problems/#quantum-chemistry","title":"Quantum Chemistry","text":"<p><sup>*</sup>12_Schrodinger_DFT_with_SCF</p> <p>30_helium_slater_jastrow_wavefunction</p> <p>46_helium_atom_vmc</p> <p>66_kolmogorov_crespi_potential</p> <p>68_helium_atom_dmc</p>"},{"location":"problems/#computational-chemistry","title":"Computational Chemistry","text":"<p>10_ewald_summation</p> <p>16_Davidson_method</p> <p>60_Widom_particle_insertion</p>"},{"location":"problems/#ecology","title":"Ecology","text":"<p>25_CRM_in_chemostat</p> <p>26_CRM_in_serial_dilution</p> <p>41_Structural_stability_in_serial_dilution</p> <p>53_Stochastic_Lotka_Volterra</p> <p>55_Swift_Hohenberg</p> <p>56_temporal_niches</p>"},{"location":"problems/#biochemistry","title":"Biochemistry","text":"<p>44_two_mer_entropy</p>"},{"location":"problems/#genetics","title":"Genetics","text":"<p>76_protein_dna_binding</p>"},{"location":"problems/#semiconductor-materials","title":"Semiconductor Materials","text":"<p>21_Absorption_coefficient_for_alloy_GaAlAs</p> <p>27_Design_trade_offs_for_high_speed_photodetectors</p> <p>34_PN_diode_band_diagram</p> <p><sup>*</sup>35_Quantum_Dot_Absorption_Spectrum</p> <p>36_Quasi_Fermi_levels_of_photo_resistor_out_of_equilibrium</p> <p>39_Reflection_spectra_for_a_Distributed_Bragg_Reflector</p> <p>42_The_threshold_current_for_multi_quantum_well_lasers</p>"},{"location":"problems/#molecular-modeling","title":"Molecular Modeling","text":"<p>47_Internal_Energy</p> <p>51_Simple_Molecular_Dynamics</p> <p>64_GCMC</p> <p>77_Berendsen_thermostat</p> <p>79_Nose_Hoover_chain_thermostat</p> <p>80_Anderson_thermostat</p> <p>*: Problems that are related to Nobel Prize-winning researches</p>"}]}
\ No newline at end of file