Merge pull request #67 from iris-hep/update-IRIS-HEP-projects-0131

Update iris hep projects 0131
research-software-collaborations · Feb 1, 2024 · b47d9b0 · b47d9b0
2 parents 6981384 + 566cfe6
commit b47d9b0
Show file tree

Hide file tree

Showing 20 changed files with 133 additions and 81 deletions.
diff --git a/projects/agc-julia-rntuple.yml b/projects/agc-julia-rntuple.yml
@@ -25,8 +25,10 @@ program:
   - IRIS-HEP fellow
 shortdescription: Implement an analysis pipeline for the Analysis Grand Challenge (AGC) using [JuliaHEP](https://github.com/JuliaHEP/) ecosystem.
 description: >
-  The project's main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected especially for systematics handling and out-of-core orchestration. (built on existing packages such as `FHist.jl` and `Dagger.jl`)
-  At the same time, the project can explore using `RNTuple` instead of `TTree` for AGC data storage. As the interface is exactly transparent, this goal mainly requires data conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).
+  The project's main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected
+  especially for systematics handling and out-of-core orchestration. (built on existing packages such as `FHist.jl` and `Dagger.jl`) At the same time, the
+  project can explore using `RNTuple` instead of `TTree` for AGC data storage. As the interface is exactly transparent, this goal mainly requires data
+  conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).
 contacts:
   - name: Jerry Ling
     email: [email protected]

diff --git a/projects/agc-physlite.yml b/projects/agc-physlite.yml
@@ -22,13 +22,13 @@ program:
   - IRIS-HEP fellow
 shortdescription: Create an Analysis Grand Challenge implementation using ATLAS PHYSLITE data
 description: >
-  The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC).
-  It captures relevant workflow aspects from data delivery to statistical inference.
-  The AGC has so far been based on publicly available Open Data from the CMS experiment.
-  The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC.
-  This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and optimizing the related performance under large volumes of data.
-  In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far.
-  This project will also investigate workflows to integrate the evaluation of such sources of uncertainty within a Python-based implementation of an AGC analysis task.
+  The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands
+  of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on
+  publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs
+  from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and
+  optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is
+  expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such
+  sources of uncertainty within a Python-based implementation of an AGC analysis task.
 contacts:
   - name: Matthew Feickert
     email: [email protected]

diff --git a/projects/agc-rdf.yml b/projects/agc-rdf.yml
@@ -26,10 +26,12 @@ program:
   - IRIS-HEP fellow
 shortdescription: Develop and test an analysis pipeline using ROOT's RDataFrame for the next iteration of the Analysis Grand Challenge
 description: >
-  The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks.
-  The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT's RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques.
-  The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities.
-  The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.
+  The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the
+  advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis
+  interfaces such as Coffea and ROOT's RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine
+  learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their
+  benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the
+  applications that will be deployed on them.
 contacts:
   - name: Enrico Guiraud
     email: [email protected]

diff --git a/projects/agc-recast.yml b/projects/agc-recast.yml
@@ -22,17 +22,11 @@ commitment:
   - Full time
 shortdescription: Implement the CMS open data AGC analysis with RECAST and REANA
 description: >
-  [RECAST](https://iris-hep.org/projects/recast.html) is a platform for systematic
-  interpretation of LHC searches.
-  It reuses preserved analysis workflows from the LHC experiments, which is now
-  possible with containerization and tools such as [REANA](http://reanahub.io).
-  A yet unrealized component of the IRIS-HEP [Analysis Grand Challenge](https://agc.readthedocs.io/)
-  (AGC) is reuse and reinterpretation of the analysis.
-  This project would aim to preserve the AGC CMS open data analysis and the
-  accompanying distributed infrastructure and implement a RECAST workflow allowing
-  REANA integration with the AGC.
-  A key challenge of the project is creating a preservation scheme for the associated
-  Kubernetes distributed infrastructure.
+  [RECAST](https://iris-hep.org/projects/recast.html) is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from
+  the LHC experiments, which is now possible with containerization and tools such as [REANA](http://reanahub.io). A yet unrealized component of the IRIS-HEP
+  [Analysis Grand Challenge](https://agc.readthedocs.io/) (AGC) is reuse and reinterpretation of the analysis. This project would aim to preserve the AGC CMS
+  open data analysis and the accompanying distributed infrastructure and implement a RECAST workflow allowing REANA integration with the AGC. A key challenge of
+  the project is creating a preservation scheme for the associated Kubernetes distributed infrastructure.
 contacts:
   - name: Kyle Cranmer
     email: [email protected]

diff --git a/projects/cms-data-pop.yml b/projects/cms-data-pop.yml
@@ -23,7 +23,10 @@ commitment:
   - Full time
 shortdescription: Predict data popularity to improve its availability for physics analysis
 description: >
-  The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.
+  The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team
+  must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved
+  from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is
+  to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.
 contacts:
   - name: Dmytro Kovalskyi
     email: [email protected]

diff --git a/projects/cms-monit-micro-services.yml b/projects/cms-monit-micro-services.yml
@@ -26,7 +26,12 @@ program:
   - IRIS-HEP fellow
 shortdescription: To develop microservice architecture for CMS HTCondor Job Monitoring
 description: >
-  Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically. This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on. However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data
+  Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically.
+  This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on.
+  However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and
+  OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like
+  GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop
+  the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data
 contacts:
   - name: Brij Kishor Jashal
     email: [email protected]
diff --git a/projects/cms-t0-test.yml b/projects/cms-t0-test.yml
@@ -24,7 +24,10 @@ commitment:
   - Full time
 shortdescription: Improve functional testing before deployment of critical changes for CMS Tier-0
 description: >
-  The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale "replay" of the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.
+  The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or
+  configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale "replay" of
+  the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit
+  tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.
 contacts:
   - name: Dmytro Kovalskyi
     email: [email protected]
@@ -34,4 +37,4 @@ contacts:
     email: [email protected]
 mentees:
   - name: Mycola Kolomiiets
-    link: https://iris-hep.org/fellows/MycolaKolomiiets.html
+    link: https://iris-hep.org/fellows/MycolaKolomiiets.html
diff --git a/projects/diff-geant.yml b/projects/diff-geant.yml
@@ -24,9 +24,9 @@ program:
   - IRIS-HEP fellow
 shortdescription: Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model.
 description: >
-  The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this 
-  Fellowship project is to develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic 
-  branching process that is modeling a particle shower spreading inside a detector material in three spatial dimensions.
+  The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this Fellowship project is to
+  develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic branching process that is modeling a particle shower
+  spreading inside a detector material in three spatial dimensions.
 contacts:
   - name: Lukas Heinrich
     email: [email protected]

diff --git a/projects/energy-cost-vre-coffea-casa.yml b/projects/energy-cost-vre-coffea-casa.yml
@@ -22,19 +22,16 @@ commitment:
   - Full time
 shortdescription: Implementing energy consumption benchmarks on different analysis platforms and facilities
 description: >
-  Benchmarks for software energy consumption are starting to appear
-  (see e.g. the [SCI score](https://github.com/Green-Software-Foundation/software_carbon_intensity/blob/main/Software_Carbon_Intensity/Software_Carbon_Intensity_Specification.md#quantification-method))
-  alongside more common performance benchmarks.
-  In this project, we will pilot the implementation of selected software energy consumption benchmarks
-  on two different facilities for user analysis:
+  Benchmarks for software energy consumption are starting to appear (see e.g. the [SCI
+  score](https://github.com/Green-Software-Foundation/software_carbon_intensity/blob/main/Software_Carbon_Intensity/Software_Carbon_Intensity_Specification.md#quantification-method))
+  alongside more common performance benchmarks. In this project, we will pilot the implementation of selected software energy consumption benchmarks on two
+  different facilities for user analysis:
      * the [Virtual Research Environment](https://indico.jlab.org/event/459/contributions/11671/),
        a prototype analysis platform for the European Open Science Cloud.
      * [Coffea-casa](https://coffea-casa.readthedocs.io/), a prototype Analysis
        Facility (AF), which provides services for "low-latency columnar analysis."
-  We will then test them with simple user software pipelines.
-  The candidate will work in collaboration with another IRIS-HEP fellow
-  investigating energy consumption benchmarks for ML algorithms,
-  and alongside a team of students and interns working on the selection and implementation of the benchmarks.
+  We will then test them with simple user software pipelines. The candidate will work in collaboration with another IRIS-HEP fellow investigating energy
+  consumption benchmarks for ML algorithms, and alongside a team of students and interns working on the selection and implementation of the benchmarks.
 contacts:
   - name: Caterina Doglioni
     email: [email protected]
diff --git a/projects/gnn-tracking.yml b/projects/gnn-tracking.yml
@@ -25,23 +25,25 @@ program:
 
 shortdescription: Reconstruct the trajectories of particle with graph neural networks
 description: |
-  In the GNN tracking project, we use [graph neural networks][gnn-wiki] (GNNs) to reconstruct trajectories ("tracks") of elementary particles traveling through a detector.
+  In the GNN tracking project, we use [graph neural networks][gnn-wiki] (GNNs) to reconstruct trajectories ("tracks") of elementary particles traveling through
+  a detector.
   This task is called ["tracking"][tracking-wiki] and is different from many other problems that involve trajectories:
   * there are several thousand particles that need to be tracked at once,
   * there is no time information (the particles travel too fast),
   * we do not observe a continuous trajectory but instead only around five points ("hits") along the way in different detector layers.
 
-  The task can be described as a combinatorically very challenging "connect-the-dots" problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories.
-  Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.
+  The task can be described as a combinatorically very challenging "connect-the-dots" problem, essentially turning a cloud of points (hits) in 3D space into a
+  set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it
+  belongs to.
 
-  A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn't connect points that belong to the same particle.
-  In this way, only the individual trajectories remain as components of the initial fully connected graph.
-  However, this strategy does not seem to lead to perfect results in practice.
-  The approach of this project uses this strategy only as the first step to arrive at "small" graphs.
-  It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.
+  A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge
+  classifier to reject any edge that doesn't connect points that belong to the same particle. In this way, only the individual trajectories remain as components
+  of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this
+  strategy only as the first step to arrive at "small" graphs. It then projects all hits into a learned latent space with the model learning to place hits of
+  the same particle close to each other, such that the hits belonging to the same particle form clusters.
 
-  The project code together with documentation and a reading list is available on [github][ghorganization] and uses [pytorch geometric][pyg].
-  See also [our GSoC proposal for the same project][gsoc-proposal], which lists prerequisites and possible tasks.
+  The project code together with documentation and a reading list is available on [github][ghorganization] and uses [pytorch geometric][pyg]. See also [our GSoC
+  proposal for the same project][gsoc-proposal], which lists prerequisites and possible tasks.
 
 
   [ghorganization]: https://github.com/gnn-tracking