website/assets/config/seminars.yml

# top-level array contains semesters and in semester there is an array of talks

# - name: name of the semester
#   when: any date within semester (needed for correct ordering); date in format MM/DD/YY, e.g 02/15/19
#   where: text saying how to attend the talks (e.g. typically held at [SOME OFFICE] at [SOME TIME])
#   responsible:
#     name: full name of person responsible for seminars in the semester
#     username: BU ID (username) of the responsible person; used for BU email (e.g. username@bu.edu)
#   talks:
#     - when: date of the talk; date in format MM/DD/YY, e.g 02/15/19
#       presenter: entire presenter block is OPTIONAL
#         url: OPTIONAL. URL to presenter's homepage.
#         name: presenter's full name
#         affiliation: OPTIONAL. Presenter's affiliation.
#         bio: OPTIONAL. Possibly big text piece, presenter's bio.
#       title: OPTIONAL. Talk's title. (Ignored if "departmental" or "special")
#       abstract: Possibly big text piece, talk's abstract. LaTeX snippets supported.
#       departmental: OPTIONAL. If set, title and abstract are ignored and will be rendered "CS Department Seminar".
#       special: OPTIONAL. If set, this text is rendered (HTML markup, like bold / italic / anchor allowed) as is in navy blue.
- name: Fall 2024 (Data Systems Seminar)
  when: 09/09/24
  where: |
    Seminars are held on Monday in CDS 950 at 12:00pm-1:30pm [unless otherwise noted].
  responsible:
    name: Teona Bagashvili
    username: teona
  talks:
    - when: 09/04/24
      presenter:
        url: https://kylebd99.github.io/kyle.deeds.github.io/
        name: Kyle Deeds
        affiliation: University of Washington
        bio: |
              Kyle Deeds is currently a 5th year PhD student in the database lab at the University of Washington. He is advised by Dan Suciu and Magda Balazinska and interested in the intersection of databases, compilers, and HPC. Broadly, his work aims to make high performance computing accessible to domain experts through declarative programming coupled with automatic program optimization.
      title: | 
              Galley: Modern Query Optimization for Declarative Sparse Tensor Programming <font color ="navy"><b>(12:00pm @ CDS 801)</b></font>
      abstract: |
              Modern computation has become dominated by the tensor programming abstraction. This framework allows users to write high performance programs for bulk computation via a high-level imperative interface. Recent work has extended this paradigm to sparse tensors (i.e. tensors where most entries are not explicitly represented) with the use of sparse tensor compilers. These systems excel at producing efficient code for computation over sparse tensors, which may be stored in a wide variety of formats. However, they require the user to manually choose the order of operations and the data formats at every step. Unfortunately, these decisions are both highly impactful and complicated, requiring significant effort to manually optimize. In this talk, I’ll present Galley, a system for declarative sparse tensor programming. Galley adapts the database community’s expertise in cost-based optimization to sparse tensor programs. First, it synthesizes a logical plan which determines high level structure and materialization strategy. Then, it lowers each piece of the logical plan to a physical implementation by determining loop orders, tensor formats, and merge algorithms. This plan is then efficiently executed by a sparse tensor compiler. Lastly, I’ll present early experimental results which show how Galley can produce up to 100x speedups over hand-optimized implementations for modestly complex tensor programs.
    - when: 09/09/24
      presenter:
        url: 
        name: Anwesha Saha/Teona Bagashvili
        affiliation: Boston University
        bio: 
      title: | 
              First meeting of the semester and summer school updates
      abstract: 
    - when: 09/16/24
      presenter:
        url: https://szeighami.github.io/
        name: Sepanta Zeighami
        affiliation: UC Berkeley
        bio: |
              Sepanta Zeighami is a postdoctoral scholar at UC Berkeley. Before that, he obtained his PhD from the University of Southern California.  His research is focused on using -- and theoretically understanding -- AI methods for improving data systems, where he has extensively published in top-tier conferences in both AI and database communities, including in SIGMOD, VLDB, ICML, ICLR, etc. His research has been recognized by oral presentation and best paper awards.
      title: | 
              Towards Practical Learned Database Operations with Theoretical Guarantees <font color ="navy"><b>(online @ CDS 801)</b></font>
      abstract: |
              Machine learning has shown significant practical benefits for performing various database operations such as indexing, cardinality estimation, and sorting. However, the theoretical characteristics of such learned database operations have not been well understood, limiting their applicability in real-world systems. In this talk, I will present our recent work on characterizing the performance benefits of learned database operations. I will show that, in the case of indexing, in static settings or when there is no distribution shift, learned methods can answer queries in O(loglog n) expected time, but their performance can drop to O(log n) in the presence of distribution shift. Using insights from this result, I will then discuss our broader analysis framework, distribution learnability, through which we show various novel theoretical results for different learned database operations. Our results provide a better understanding of why and when learned methods outperform non-learn methods (and when they don’t) and take a first step towards practical learned database operations with theoretical guarantees. I will conclude the talk by presenting various open theoretical and practical problems to achieve that goal.
    - when: 09/30/24
      presenter:
        url: 
        name: Yuanli Wang/Lei Huang
        affiliation: Boston University
        bio: 
      title: | 
              CAPSys: Contention-aware task placement for data stream processing
      abstract: 
    - when: 10/07/24
      presenter:
        url: https://sites.bu.edu/casp/people/naima-abrar-shami/
        name: Naima Abrar
        affiliation: Boston University
        bio: 
      title: | 
              PlatoGL: Effective and Scalable Deep Graph Learning System for Graph-enhanced Real-Time Recommendation
      abstract: 
    - when: 10/15/24
      presenter:
        url: 
        name: Muhammad Faisal
        affiliation: Boston University
        bio: 
      title: | 
              Towards a scalable, generic, configurable Multi-Party Computation engine
    - when: 10/21/24
      presenter:
        url:  https://aneeshraman.net/
        name: Aneesh Raman
        affiliation: Boston University
        bio: 
      title: | 
              Reducing the real storage overhead of learned indexes with wavelet trees
      abstract:
    - when: 10/28/24
      presenter:
        url:  http://andrew.nerdnetworks.org/
        name: Andrew Lamb
        affiliation: InfluxDB
        bio: Andrew currently works in Rust on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He serves on the Apache DataFusion PMC (2024 Chair), and on the Apache Arrow PMC (2023 Chair), and actively contributes to the Apache Arrow DataFusion query engine and the Apache Arrow Rust implementation. He earned a BS and MEng in Course VI from MIT. More details are available at http://andrew.nerdnetworks.org/
      title: | 
              Apache DataFusion: Design Choices when Building Modern Analytic Systems
      video_url: https://www.youtube.com/watch?v=CpnxuBwHbUc
      slides: andrew_lamb_slides.pdf
      abstract: |
              Andrew will discuss engineering choices made when building Apache DataFusion, an open source and extensible query engine used as the basis of many commercial and open source projects. He will cover rationale for decisions such as using Rust and building on standards such as Arrow and Parquet. He will also discuss DataFusion’s design philosophy of extensible APIs paired with simple default implementations. Finally, he will offer lessons learned and enumerate some things that worked well and what could have been improved.
    - when: 10/31/24
      presenter:
        url:  https://itshelenxu.github.io/
        name: Helen Xu
        affiliation: Georgia Institute of Technology
        bio: Helen Xu is an assistant professor at Georgia Tech in the School of Computational Science and Engineering. Previously, she was the Grace Hopper Postdoctoral Scholar. At Lawrence Berkeley National Laboratory. She completed her Ph.D. at MIT with Professor Charles E. Leiserson. Her main research interests are in parallel and cache-friendly algorithms and data structures.
      title: | 
              Recent Advances on In-Memory Cache-Friendly Data Structures <font color ="navy"><b>(10am @ CDS 701)</b></font>
      video_url: https://www.youtube.com/watch?v=cdq4ebPYUTQ
      abstract: |
              This talk will discuss optimizing different types of parallel in-memory indexes for improved spatial locality. Specifically, it will overview 1) BP-tree [VLDB 23], a cache-optimized concurrent in-memory index that improves range query performance without giving up on point queries, and 2) CPMA [PPoPP 24], an in-memory batch-parallel data structure built on the Packed Memory Array (PMA) that overcomes the point-range tradeoff.
    - when: 11/04/24
      presenter:
        url: 
        name: Anwesha Saha
        affiliation: Boston University
        bio: 
      title: | 
              Bayesian Optimization for LSM Tuning
      abstract:
    - when: 11/19/24
      presenter:
        url: 
        name: George Neville-Neil
        affiliation: Yale University
        bio:  George V. Neville-Neil, works on networking and operating system code for fun and profit. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly twenty years he has been the columnist better known as Kode Vicious. Since 2014 he has been an Industrial Visitor at the University of Cambridge where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth but has been deployed, as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City. He is currently a PhD student at Yale University working with Robert Soulé, Avi Silberschatz and Peter Alvaro
      title: | 
              OSDB: Exposing the Operating System’s Inner Database <font color ="navy"><b>(12:30pm @ CAS211)</b></font>
      abstract: |
              Operating systems must provide functionality that closely resembles that of a data management system, but existing query mechanisms are ad-hoc and idiosyncratic. To address this problem, we argue for the adoption of a relational interface to the operating system kernel. While prior work has made similar proposals, our approach is unique in that it allows for incremental adoption over an existing, production-ready operating system. In this paper, we present progress on a prototype system called OSDB that embodies the incremental approach and discuss key aspects of the design, including the data model and concurrency control mechanisms. We present three example use cases: a network usage monitor, a load balancer, and file system checker, as well as experiments that demonstrate low overhead for our approach.
    - when: 11/25/24
      presenter:
        url: https://aneeshraman.net/
        name: Aneesh Raman
        affiliation: Boston University
        bio: 
      title: | 
              PhD prospectus dry run
      abstract:
    - when: 12/09/24
      presenter: 
        url: https://cs-people.bu.edu/zczhu/
        name: Zichen Zhu
        affiliation: Boston University
        bio: 
      title: | 
              Designing Instance-Optimal Data Systems in Modern Applications
      abstract: 

- name: Spring 2024
  when: 01/18/24
  where: |
    Seminars are held on Thursdays in CDS 950 at 1:30pm [unless otherwise notified].
  responsible:
    name: Jinqi Lu
    username: jinqilu
  talks:
    - when: 02/01/24
      presenter:
        url: https://kapilvaidya24.github.io/
        name: Kapil Vaidya
        affiliation: MIT
        bio: |
              Kapil Vaidya is an Applied Scientist at Amazon Redshift in the Learned Systems Group. He recently graduated from MIT, where he completed his PhD under the guidance of Professor Tim Kraska in the Data Systems Group. His research primarily revolves around the application of machine learning to data systems, with a special focus on the data structures and algorithms integral to these systems. Before that Kapil completed his Bachelors from IIT Bombay. 
      title: |
              Sparse Numerical Array-Based Range Filters (SNARF) <font color ="navy"><b>(12:00pm @ CDS 1101 also part of BU Systems Seminar)</b></font>
      abstract: | 
                 We present Sparse Numerical Array-Based Range Filters (SNARF),a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads.
    - when: 02/15/24
      presenter:
        url: https://www.lujinqi.com/
        name: Jinqi Lu
        affiliation: Boston University
        bio: |
              Jinqi is currently a CS Master's student at BU DiSC Lab advised by Prof. Athanassoulis. He posses a wide range of interests including Database Systems, Cloud, and Networks. Currently, he is engaged in the relational memory project and also contributes to the testing of computational and zoned storage devices.
      title: |
              Reading Group: Selection Pushdown in Column Stores using Bit Manipulation Instructions
      abstract: | 
                 Modern analytical database systems predominantly rely on column-oriented storage, which offers superior compression efficiency due to the nature of the columnar layout. This compression, however, creates challenges in decoding speed during query processing. Previous research has explored predicate pushdown on encoded values to avoid decoding, but these techniques are restricted to specific encoding schemes and predicates, limiting their practical use. In this paper, we propose a generic predicate pushdown approach that supports arbitrary predicates by leveraging selection pushdown to reduce decoding costs. At the core of our approach is a fast select operator capable of directly extracting selected encoded values without decoding, by using Bit Manipulation Instructions, an instruction set extension to the X86 architecture. We empirically evaluate the proposed techniques in the context of Apache Parquet using both micro-benchmarks and the TPC-H benchmark, and show that our techniques improve the query performance of Parquet by up to one order of magnitude with representative scan queries. Further experimentation using Apache Spark demonstrates speed improvements of up to 5.5X even for end-to-end queries involving complex joins.
    - when: 02/21/24
      presenter:
        url: https://homes.cs.washington.edu/~suciu/
        name: Dan Suciu
        affiliation: University of Washington
        bio: |
              Dan Suciu is a Microsoft Endowed Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. Suciu is conducting research in data management, on topics such as query optimization, probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He received the ACM SIGMOD Codd Innovation Award, received several best paper awards and test of time awards, and is a Fellow of the ACM. Suciu is currently an associate editor for the Journal of the ACM. Suciu’s PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.
      title: |
              Cardinality Estimation: Achilles’ Heel of Query Optimizers <font color ="navy"><b>(11:00am @ CDS 1750 also part of the CS Distinguished Colloquium)</b></font>

      abstract: |
                 Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query.  The estimator has access to some limited statistics on the database, such as table cardinalities, number of distinct values in various columns, sometimes histograms, and needs to estimate the output size of complex queries, often involving many joins and filter predicates.  The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan.  In this talk I will briefly survey existing cardinality estimation methods, discussing their pros and cons, then I will introduce a new approach to the problem, which is based on computing an upper bound instead of an estimate.  The upper bound is given by a mathematical formula, and can use the available database statistics in surprising ways.
    - when: 02/22/24
      presenter:
        url: https://cs-people.bu.edu/teona/
        name: Teona Bagashvili
        affiliation: Boston University
      title: |
              Reading Group: ZNS SSDs
      abstract: | 
                 TBA
                 
    - when: 02/29/24
      presenter:
        url: https://sites.google.com/view/juhyoungmun/
        name: Ju Hyoung Mun
        affiliation: Boston University
      title: |
              Reading Group: Programmable DDRx Controllers
      abstract: | 
                 TBA
    - when: 03/07/24
      presenter:
        url: https://cs-people.bu.edu/karatse/
        name: Konstantinos Karatsenidis
        affiliation: Boston University
      title: |
              Reading Group: LRU-C: Parallelizing Database I/Os for Flash SSDs
      abstract: | 
                 TBA
    - when: 03/14/24
      presenter:
        url: https://gatterbauer.name/
        name: Wolfgang Gatterbauer
        affiliation: Northeastern University
        bio: |
              Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first. He received an NSF Career award and, with his students and collaborators, a best paper award at EDBT 2021, best-of-conference selections for PODS 2021, SIGMOD 2017, WALCOM 2017, and VLDB 2015, and two out of three reproducibility awards for papers published at SIGMOD 2020.
              <br/><br/><a href="https://gatterbauer.name">https://gatterbauer.name</a>
      title : |
              HITSnDIFFs: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property
      abstract: | 
                 We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the "truth"), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem "ability discovery" to emphasize the connection to and duality with the more well-studied problem of "truth discovery".
                 <br/><br/>To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call "HITSNDIFFS" (or HND), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), HND also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that HND produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm. 
                  <br/><br/>Based on join work with Zixuan Chen, Subhodeep Mitra and R. Ravi
                  <br/><br/><a href="https://arxiv.org/abs/2401.00013" target=_blank>https://arxiv.org/abs/2401.00013</a>
                  <br/><a href="https://github.com/northeastern-datalab/HITSnDIFFs/" target=_blank>https://github.com/northeastern-datalab/HITSnDIFFs/</a>
                  <br/><a href="https://www.youtube.com/watch?v=2pN1Nx7-sd0" target=_blank>https://www.youtube.com/watch?v=2pN1Nx7-sd0</a> 
    - when: 03/18/24
      presenter:
        url: https://aamodkh.github.io/
        name: Aamod Khatiwada
        affiliation: Northeastern University
        bio: |
              Aamod Khatiwada is a PhD Candidate (Computer Science, Databases) at Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. He originally from Jhorahat, a town in the southeastern part of Nepal. His primary research focus is to develop novel techniques and algorithms that find and integrate open data tables in a principled way. During free times, he enjoy watching cricket and soccer, writing poems, stories, and news articles.
      title: |
              Table Discovery and Integration in Data Lakes <font color ="navy"><b>(5:00pm - 6:15pm @ HAR 306)</b></font>
      abstract: | 
                 Data lakes serve as repositories that house millions of tables covering a wide range of subjects. Data scientists use tables from data lakes to support decision-making processes, train machine learning models, perform statistical analysis, and more. However, the enormous size and heterogeneity of data lakes, together with missing or unreliable metadata, make it challenging for data scientists to find suitable tables for their analyses. Furthermore, after table discovery, another challenge is to integrate the discovered tables together to get a unified view of data, enabling queries that go beyond a single table. In this talk, I will present novel techniques for table discovery and integration tasks. First, I will overview SANTOS [1] which helps data scientists discover tables from a data lake that union with their query table and possibly extend it with more rows. Contrary to existing techniques that only use the semantics of columns to perform union search, we improve the search accuracy by also using semantic relationships between pairs of columns in a table. Second, I will cover ALITE (Align and Integrate) [2], the first proposal for scalable integration of tables that may have been discovered using join, union, or related table search. ALITE aligns the columns of discovered tables using holistic schema matching and then applies Full Disjunction, an associative version of outer join,  to get an integrated table. I will also describe DIALITE [3], a novel table discovery pipeline that allows users to discover, integrate, and analyze open data tables. DIALITE is flexible such that the user can easily add and compare additional discovery and integration algorithms.
                 [1] Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald. SANTOS: Relationship-based Semantic Table Union Search. SIGMOD 2023
                 [2] Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller. Integrating Data Lake Tables. VLDB 2023.
                 [3] Aamod Khatiwada, Roee Shraga, Renée J. Miller. DIALITE: Discover, Align and Integrate Open Data Tables. SIGMOD-Companion 2023.
    - when: 03/21/24
      presenter:
        url: https://cs-people.bu.edu/ndhuynh/
        name: Andy Huynh
        affiliation: Boston University
      title: |
              Reading Group: Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
                 
    - when: 04/04/24
      presenter:
        url: https://cs-people.bu.edu/zczhu/
        name: Zichen Zhu
        affiliation: Boston University
        bio: |
              Zichen Zhu is currently a 5th-year PhD candidate in MiDAS of Boston University, advised by Prof. Manos Athanassoulis. Prior to joining Boston University, he studied as an MPhil student in Database Group, department of Computer Science in the University of Hong Kong, advised by Prof. Reynold Cheng. His research interest focuses on designing workload-aware query optimization under memory constraint.
      title: |
              Dry Run for Zichen's Prospectus
                 
    - when: 04/11/24
      presenter:
        url: https://cs-people.bu.edu/weiran22/
        name: Ran Wei
        affiliation: Boston University
      title: |
              Reading Group: Improving LSM-Tree Based Key-Value Stores With Fine-Grained Compaction Mechanism
                 
    - when: 04/11/24
      presenter:
        url: https://cs-people.bu.edu/aarona/
        name: Aaron Ang
        affiliation: Boston University
      title: |
              Reading Group: Automatic SQL Error Mitigation in Oracle
                 
    - when: 04/18/24
      presenter:
        url: https://cs-people.bu.edu/papon/
        name: Tarikul Islam Papon
        affiliation: Boston University
      title: |
              Reading Group: WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs
                 
    - when: 05/02/24
      presenter:
        url: https://disc.bu.edu/
        name: Tarikul Islam Papon / Zichen Zhu
        affiliation: Boston University
      title: |
              Dry Run for SIGMOD

    - when: 05/09/24
      presenter:
        url: https://ramananeesh.github.io/
        name: Aneesh Raman
        affiliation: Boston University
      title: |
              Reading Group: Learned Index with Precise Positions
                 
    - when: 05/16/24
      presenter:
        url: https://midas.bu.edu
        name: Yanpeng Wei
        affiliation: Boston University
      title: |
              Reading Group: Cabin: a Compressed Adaptive Binned Scan Index


- name: Fall 2023
  when: 09/01/23
  where: |
    Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
  responsible:
    name: Zichen Zhu
    username: zczhu
  talks:
    - when: 10/18/23
      presenter:
        url: https://midas.bu.edu/seminar.html
        name: C. Mohan
        affiliation: IBM Fellow
        bio: |
          Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (<a href="https://bit.ly/sigBcP">https://bit.ly/sigBcP</a>, <a href="https://bit.ly/CMoTalks">https://bit.ly/CMoTalks</a>). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.
      title: |
              Systems Design Dilemmas: Simplicity Versus Complexity <font color ="navy"><b>(CDS 17th floor, 11:00am)</b></font>
      abstract: | 
        Based on my 4 decades long involvement in and exposure to numerous database systems, in this talk I will discuss how various factors led multiple systems to be evolved in ways that made them become more complex in terms of their design as well as their usability aspects. While simplicity of design and architecture might be a highly desirable objective, there will invariably be conflicting requirements that lead to complexity creeping in, at least with respect to the internals of such systems. Examples of such requirements relate to the cloud environment, performance, availability, heterogeneity in terms of supported hardware, operating systems and programming language environments, accommodation of multiple workload types, extensibility, etc. It is also the case that aiming for simplicity in one aspect of system design might cause complexity to be introduced in other areas. A case in point is the desire for the near elimination of the need for DBAs and tuning knobs in the cloud environment which makes the internals of such systems become much more complex so that they could dynamically detect and react to different workload types that such systems might be used for. While KISS (Keep It Simple, Stupid) might be a desirable design principle, following it mindlessly might result in the kiss of death with respect to performance and reliability!
    - when: 10/18/23
      presenter:
        url: https://midas.bu.edu/seminar.html
        name: C. Mohan
        affiliation: IBM Fellow
        bio: |
          Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (<a href="https://bit.ly/sigBcP">https://bit.ly/sigBcP</a>, <a href="https://bit.ly/CMoTalks">https://bit.ly/CMoTalks</a>). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.
      title: |
              A Technical Retrospect: Q&A with Mohan <font color ="navy"><b>(CDS 364, 2:30pm)</b></font>
      abstract: | 
        In this talk, using my own life as an example, I discuss the excitement of doing research and developing advanced technologies. Starting with my undergraduate days as a chemical engineering student at IIT Madras, I narrate how my academic life changed when I got introduced to an IBM mainframe which led to my ultimately doing a PhD in computer science at University of Texas at Austin. Along the way, I talk about the frustrations of the prolonged hunt for a PhD thesis topic. Then, I trace my professional career at IBM which included working on hot topics as well as what were initially thought by others to be passé topics which turned out to be not so! I deal with the difficulties of getting papers published and doing technology transfer to new products as well as old ones. Having never been a manager during my entire 38.5 years IBM career, even though as an IBM Fellow I had been an IBM executive for 23 years, I bring a unique perspective on what it takes to be on a long-term purely technical career path. I discuss the DOs and DONTs of choosing technical topics to work on, and on being successful in having impact internally and outside one’s employer. Based in general on my extensive travels across the world and in particular on my experience of working as the IBM India Chief Scientist, while based in Bangalore for almost 3 years, I also address issues that arise in being part of a multinational corporation and trying to be successful in doing deeply technical work in a developing country. I also talk about how I am keeping myself active technically and professionally since my retirement from IBM in June 2020.
    - when: 11/06/23
      presenter:
        url: https://cs.uwaterloo.ca/~xiaohu/
        name: Xiao Hu
        affiliation: University of Waterloo
        bio: |
          Xiao Hu is an incoming Assistant Professor in the Cheriton School of Computer Science at the University of Waterloo. Prior to that, she was a Visiting Faculty Scholar in the Discrete Algorithm Group at Google Research from and a postdoctoral associate within the Department of Computer Science at Duke University. She received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems.
      title: |
              Incremental View Maintenance for Conjunctive Queries 
      abstract: |
                 Incremental view maintenance (IVM) is a critical problem in the field of database management systems. It addresses the challenge of efficiently updating materialized views in response to changes in the underlying data, reducing the computational cost and ensuring up-to-date query results. In this context, conjunctive queries, a subset of relational algebra, provide a rich framework for expressing complex queries of both theoretical and practical interest. In this talk, I will introduce a general IVM framework for conjunctive queries, such that views can be maintained efficiently while supporting enumerating full query results as well as delta query results with delay guarantee. I will describe the intrinsic relationship between the sequence of updates and the maintenance cost. At last, I will briefly discuss some interesting questions when aggregation is involved.
    - when: 11/30/23
      presenter:
        url: https://www.cs.cit.tum.de/dis/team/prof-dr-viktor-leis/
        name: Viktor Leis
        affiliation: Technical University of Munich
        bio: |
           Viktor Leis is a Professor of Computer Science at the Technical University of Munich (TUM). His research revolves around designing high-performance data management systems and includes core database systems topics such as query processing, query optimization, transaction processing, index structures, and storage. Another major research area is designing cloud-native, cost-efficient information systems. Viktor Leis received his doctoral degree in 2016 from TUM and was a professor at the Friedrich-Schiller-Universität Jena and Friedrich-Alexander-Universität Erlangen-Nürnberg before returning to TUM in 2022. He received the ACM SIGMOD dissertation award, the IEEE TCDE Rising Star Award, best paper awards at IEEE ICDE and ACM SIGMOD, and an ERC Starting Grant.
      title: |
              LeanStore: A High-Performance Storage Engine for NVMe SSDs and Multi-Core CPUs <font color ="navy"><b>(CAS 313, 12:30pm)</b></font>
      abstract: |
                 The goal of the LeanStore project is to achieve robust performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for larger data sets. To achieve these goals we had to redesign all components of the traditional DBMS stack, including buffer management, indexing, concurrency control, I/O management, logging, recovery, and page replacement. In this talk, I will give an overview of these components and the evolution of the system over the past 6 years.
    - when: 12/05/23
      presenter:
        url: https://www.eecs.tufts.edu/~johes/
        name: Johes Bater
        affiliation: Tufts University
        bio: |
          Johes Bater is an assistant professor of Computer Science at Tufts University and co-director of the Tufts Security and Privacy Lab. Before that, he was a postdoctoral researcher in the Database Group at Duke University, received his Ph.D. in computer science from Northwestern University, and completed his B.S. and M.S. in electrical engineering at Stanford University. His research centers on how to balance privacy, security, and utility to build fast, accurate database systems that support privacy-preserving analytics with provable security guarantees.
      title: |
              Building Systems That Protect Users and Their Data <font color ="navy"><b>(CAS 313, 12:30pm)</b></font>
      abstract: |
                Simply put, data security and privacy affects every single person in the world. Data breaches, where organizations compromise sensitive user data, are frighteningly common. Current approaches however, fail to properly safeguard security and privacy. Existing encrypted databases introduce data pipelines that leak enough side channel information that data breaches are an inevitability, while cryptographically secure solutions are so slow and cumbersome that they no longer serve their intended purpose. Moreover, many business models - such as targeted advertising - are predicated on their continued ability to analyze and make decisions based on private user data. Without a principled approach to protecting user privacy, it is unclear if this setup is sustainable. To address this problem, we need systems that treat security and privacy as first-class citizens in their system design.
                In this talk, I will discuss how to build secure and private database management systems that balance the trade-offs among privacy, performance, and query result accuracy. This work spans numerous disciplines of computer science, including, 1) databases: modeling query workloads to estimate and optimize their privacy properties, runtime, and result accuracy across multiple settings, 2) security: extending secure computation protocols to evaluate SQL statements over the private data of many data owners without revealing data in plaintext, and 3) privacy: rigorously modeling the end-to-end information leakage profile of a privacy-preserving DBMS. The goal of this work is to bring together these disparate research areas to create usable query processing engines with strong privacy and security guarantees.
    - when: 12/11/23
      special: |
              Internal Project Updates 


- name: Spring 2023
  when: 01/31/23
  where: |
    Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
  responsible:
    name: Zichen Zhu
    username: zczhu
  talks:
    - when: 02/10/23
      presenter:
        url: https://subhadeepsarkar.bitbucket.io/
        name: Subhadeep Sakar
        affiliation: Boston University
        bio: | 
          Subhadeep Sarkar is a post-doctoral associate at Boston University.
          His current research interests are data systems, storage layouts,
          access methods, and their intersection with data privacy. Prior to
          this, Subhadeep spent two years as a post-doctoral researcher at
          INRIA Rennes, France. He completed his Ph.D. in 2017 from the IIT
          Kharagpur. Subhadeep’s work appears at leading database and network
          conferences and journals, including SIGMOD, VLDB, EDBT, ICDE, IEEE
          Transactions on Cloud Computing, and IEEE Transactions on Computers.
          In 2016, Subhadeep was selected as one of the most qualified
          research scientists to attend the Heidelberg Laureate Forum.
      title: | 
        Building Deletion-Compliant Data Systems
        <font color ="navy"><b>(CDS 1001)</b></font>
      abstract: | 
        Modern data systems are designed around the silent underlying
        assumption of perpetual data retention. Systems are designed to offer
        high performance for data ingestion and queries, while deletion of
        data is realized by simply logically invalidating the target entries
        (instead of modifying them in place). This has severe implications
        on the privacy front, especially, in light of the newly enforced
        privacy regulations which are aimed toward enabling the users to
        express their preferences about the deletion of their data. In this
        talk, we present our vision of enabling privacy-through-deletion as
        a design element in modern data systems empowering them with the
        ability to ensure timely and persistent deletion of data, and we
        envision doing so without hurting performance or increasing
        operational cost.
    - when: 02/17/23
      presenter:
        url: https://sites.google.com/view/juhyoungmun/
        name: Juhyoung Mun
        affiliation: Boston University
        bio: | 
          Juhyoung Mun is currently a postdoctoral associate in the
          Department of Computer Science at Boston University, working with
          Prof. Manos Athanassoulis. She received her Ph.D. in Electronic
          and Electrical Engineering at Ewha w. University in Seoul, Korea.
          Her research interest lies in the area of data systems and
          networking, particularly, enhancing the lookup performance in
          LSM-Trees and building data systems for complex hybrid
          transactional/analytical workloads using hardware/software
          co-design. 
      title: | 
        Relational Memory: Native In-Memory Accesses on Rows and Columns
      abstract: | 
        Analytical database systems are typically designed to use a
        column-first data layout to access only the desired fields. On the
        other hand, storing data row-first works great for accessing,
        inserting, or updating entire rows. Transforming rows to columns
        at runtime is expensive, hence, many analytical systems ingest data
        in row-first form and transform it in the background to columns to
        facilitate future analytical queries. How will this design change
        if we can always efficiently access only the desired set of columns?
        To address this question, we present a radically new approach to
        data transformation from rows to columns. We build upon recent
        advancements in embedded platforms with re-programmable logic to
        design native in-memory access on rows and columns. Our approach,
        termed Relational Memory, relies on an FPGAbased accelerator that
        sits between the CPU and main memory and transparently transforms
        base data to any group of columns with minimal overhead at runtime.
        This design allows accessing any group of columns as if it already
        exists in memory. We implement and deploy Relational Memory in real
        hardware, and we show that we can access the desired columns up to
        1.63x faster than accessing them from their row-wise counterpart,
        while matching the performance of a pure columnar access for low
        projectivity, and outperforming it by up to 1.87X as projectivity
        (and tuple reconstruction cost) increases. Moreover, our approach
        can be easily extended to support offloading of a number of
        operations to hardware, e.g., selection, group by, aggregation, and
        joins, having the potential to vastly simplify the software logic
        and accelerate the query execution.
    - when: 03/03/23
      presenter:
        url: https://ramananeesh.github.io/
        name: Aneesh Raman
        affiliation: Boston University
        bio: | 
          Aneesh Raman is a PhD student at Boston University and a member of
          the DiSC Lab, working with Prof. Manos Athanassoulis. His research
          focus is on Database systems - indexing and access methods, and his
          interests include high performance computing, data mining and machine
          learning. Prior to Boston University, he was at Purdue University Fort
          Wayne, where he graduated with highest distinction with a Bachelor's
          in Computer Science.
      title: | 
        SWARE: Indexing for Near-Sorted Data
      abstract: | 
        Indexing in data systems can be perceived as the process of adding
        structure to the incoming, otherwise unsorted data. If the data
        ingestion order matches the indexed attribute order, the index
        ingestion cost is entirely redundant and can be avoided altogether.
        In this work, we identify sortedness as a resource that can accelerate
        index ingestion. We propose a new sortedness-aware (SWARE) design
        paradigm that pays less indexing cost for near-sorted data. When
        applied to a classical index like the B+-tree SWARE offers a 9x
        speedup for write intensive workloads and a 4x benefit in mixed
        workloads by exploiting data sortedness.
    - when: 03/17/23
      presenter:
        url: https://cs-people.bu.edu/papon/
        name: Tarikul Islam Papon
        affiliation: Boston University
        bio: | 
          Papon is a Ph.D candidate, part of DiSC Lab at BU, advised by Prof.
          Athanassoulis. His research interests broadly include Data Systems,
          specifically on hardware-aware data management challenges, stemming
          from the evolution of storage and memory devices. Currently, he is
          working on proposing a new design paradigm for interacting with durable
          storage that takes into account performance asymmetry between read and
          write operations, as well as the variable access concurrency that
          different devices may support. Prior to joining BU, Papon served as a
          lecturer for 4 years at the Department of Computer Science & Engineering
          (CSE) in Bangladesh University of Engineering & Technology (BUET). He
          completed his MSc and BSc degrees from the same department where he
          worked on medical informatics using various machine learning and
          embedded system techniques.
      title: | 
        ACEing the Bufferpool Management Paradigm for Modern Storage Devices
      abstract: | 
        Over the past few decades, solid-state drives (SSDs) have been
        replacing hard disk drives (HDDs) due to their faster reads and writes,
        and their superior random access performance. Further, when compared to HDDs,
        SSDs have two fundamentally different properties: (i) read/write asymmetry
        (writes are slower than reads) and (ii) access concurrency (multiple I/Os can
        be executed in parallel to saturate the device bandwidth). However, several
        database operators are designed without considering storage asymmetry and
        concurrency resulting in device underutilization, which is typically addressed
        opportunistically by device-specific tuning during deployment. As a key example
        and the focus of our work, the bufferpool management of a Database Management
        System (DBMS) is tightly connected to the underlying storage device, yet,
        state-of-the-art approaches treat reads and writes equally, and do not
        expressly exploit the device concurrency, leading to subpar performance.
        In this paper, we propose a new Asymmetry & Concurrency-aware bufferpool
        management (ACE) that batches writes based on device concurrency and performs
        them in parallel to amortize the asymmetric write cost. In addition, ACE
        performs parallel prefetching to exploit the device’s read concurrency. ACE
        does not modify the existing bufferpool replacement policy, rather, it is a
        wrapper that can be integrated with any eplacement policy. We implement ACE in
        PostgreSQL and evaluate its benefits using a synthetic benchmark and TPC-C for
        several popular eviction policies (Clock Sweep, LRU, CFLRU, LRU-WSR). The ACE
        counterparts of all four policies lead to significant performance improvements,
        exhibiting up to 32.1% lower runtime for mixed workloads (33.8% for
        write-intensive TPC-C transactions) with a negligible increase in total disk
        writes and buffer misses, which shows that incorporating asymmetry and
        concurrency in algorithm design leads to more faithful storage modeling and,
        ultimately, to better device utilization.
    - when: 04/14/23
      presenter:
        url:  https://kvombatkere.github.io/
        name: Karan Vombatkere
        affiliation: Boston University
        bio: | 
         Karan Vombatkere is a 2nd year PhD student in the Massive Data & Algorithms
         (MiDAS) research group at BU, advised by Dr. Evimaria Terzi. His research
         interests are primarily focused on algorithmic data mining and computational social science problems. His current research is on designing efficient approximation algorithms for submodular optimization problems, specifically the Team Formation problem. 
      title: | 
        Balancing Task Coverage and Expert Workload in Team Formation
      abstract: | 
        In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.
    - when: 04/18/23
      presenter:
        url: https://rmarcus.info/blog/
        name: Ryan Marcus
        affiliation: University of Pennsylvania
        bio: | 
          Ryan Marcus is working on using machine learning to build the next generation
          of data management tools that automatically adapt to new hardware and user
          workloads, invent novel processing strategies, and understand user intention.
          He is especially interested in query optimization, index structures,
          intelligent clouds, programming language runtimes, program synthesis for data
          processing, and applications of reinforcement learning to systems problems in
          general.
      title: | 
        LSI: A Learned Secondary Index Structure <font color ="navy"><b>(MCS B37, 12:30pm)</b></font>
      abstract: | 
        Learned index structures have been shown to achieve favorable lookup performance
        and space consumption compared to their traditional counterparts such as B-trees.
        However, most learned index studies have focused on the primary indexing setting,
        where the base data is sorted. In this work, we investigate whether learned indexes
        sustain their advantage in the secondary indexing setting. We introduce Learned
        Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted
        data. LSI works by building a learned index over a permutation vector, which allows
        binary search to performed on the unsorted base data using random access. We
        additionally augment LSI with a fingerprint vector to accelerate equality lookups.
        We show that LSI achieves comparable lookup performance to state-of-the-art
        secondary indexes while being up to 6× more space efficient.
    - when: 04/21/23
      presenter:
        url: https://itrummer.github.io/
        name: Immanuel Trummer
        affiliation: Cornell University
        bio: | 
          Immanuel Trummer is an assistant professor at Cornell University. His papers were selected
          for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for
          publication in CACM as CACM Research Highlight. His online lecture introducing students to
          database topics collected over a million views. He received the NSF CAREER Award and
          multiple Google Faculty Research Awards.
      title: | 
        Towards Highly Customizable Database Systems via Generative AI <font color ="navy"><b>(MCS B37, 12:30pm)</b></font>
      abstract: | 
        Deep neural networks are becoming ever more useful for numerous applications. However, it is
        increasingly more challenging to utilize high-quality networks due to massive computational
        costs. In fact, computational costs continuously increase due to the need for ever more
        complex networks and due to the need to explore numerous network designs during the
        development of an AI solution. This leads to several problems including limiting development
        and access to high-quality AI solutions, time-to-market limitations for organizations, and
        negative environmental impact caused by increased use of resources. We observe that modern
        GPUs, the substrate where neural networks are developed, are severely underutilized. In fact,
        the utilization factor is only in the order of 50% and drops as GPUs get ever faster. Fusing
        operators across multiple models and larger mini-batch sizes are two key methods for
        increasing hardware utilization and training throughput. However, the large memory requirements
        of training algorithms coupled with the limited memory capacity of modern GPUs pose a strict
        restriction on the applicability of these methods. In this paper, we introduce μ-two, a new
        scheduling algorithm that maximizes GPU utilization by using operator fusion and memory
        optimization. The core novelty behind μ-two is in being able to schedule enough independent
        computational operations from multiple models such as to utilize all compute capacity of the
        available GPU while at the same time, it utilizes all compute time to overlap independent memory
        swapping operations. We show how to create μ-two schedules for arbitrary neural network and GPU
        architectures. We integrated μ-two within the PyTorch compiler and demonstrate a 3× speed-up with
        diverse network architectures and diverse NVIDIA GPUs, spanning vision, natural language
        processing and recommendation applications.
    - when: 04/25/23
      presenter:
        url: https://scholar.harvard.edu/sanket/home
        name: Sanket Purandare
        affiliation: Harvard University
        bio: | 
          Sanket Purandare is a 4th year PhD student at the Data and Systems Lab (DASLab) at Harvard
          University, advised by Prof. Stratos Idreos. His research interests lie in the area of systems for
          deep learning. In particular, he works on the characterization of deep learning training and
          inference workloads and optimizing the deep learning software infrastructure for hardware
          efficiency by applying systems concepts from operating systems, compilers, and computer
          architecture in a novel fashion. Currently, he is a Visiting Researcher with Meta Inc.
          (formerly Facebook) and works with the PyTorch Compilers and Distributed Teams on building
          the compiler stack for distributed training.
      title: | 
        μ -TWO: Multi-Model Training with Orchestration and Memory Optimization <font color ="navy"><b>(CDS 701, 11:00am)</b></font>
      abstract: | 
        Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of
        sound and robust policy making by providing a means to estimate the impact of a certain intervention to
        the world, and is practically indispensable in health, medicine, social sciences, and other domains.
        It forms the stepping stone for actionable data analysis that correlation, association, or model-based
        prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply
        causation. In this talk, I will discuss some of our research in making a connection between traditional
        database research with causal inference research from AI and Statistics, and exploring how they can
        strengthen each other. First, I will present a "matching" framework for causal inference called "Almost
        Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the
        benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query
        evaluations from databases for scalability. Then I will discuss how standard causal inference
        techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data
        frequently found in practice. Finally, I will talk about how such "relational causal inference" can be
        applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and
        conclude with an overview of our ongoing research directions on actionable and interpretable data
        analysis with causality and explanations.
    - when: 04/26/23
      presenter:
        url: https://users.cs.duke.edu/~sudeepa/
        name: Sudeepa Roy
        affiliation: Duke University
        bio: | 
          Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in
          data management, with a focus on the foundational aspects of big data analysis, which includes
          causality and explanations for big data, data repair, query optimization, probabilistic databases,
          and database theory. Before joining Duke in 2015,  she did a postdoc at the University of Washington,
          and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career
          Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data.
          She is a co-director of the Almost Matching Exactly (AME) lab for interpretable causal inference at
          Duke (<a href="https://almost-matching-exactly.github.io/">https://almost-matching-exactly.github.io/</a>).
      title: | 
        Connecting Causal Inference with Databases towards Actionable Data Analysis <font color ="navy"><b>(CDS 701, 11:00am)</b></font>
      abstract: | 
        Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of
        sound and robust policy making by providing a means to estimate the impact of a certain intervention to
        the world, and is practically indispensable in health, medicine, social sciences, and other domains.
        It forms the stepping stone for actionable data analysis that correlation, association, or model-based
        prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply
        causation. In this talk, I will discuss some of our research in making a connection between traditional
        database research with causal inference research from AI and Statistics, and exploring how they can
        strengthen each other. First, I will present a "matching" framework for causal inference called "Almost
        Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the
        benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query
        evaluations from databases for scalability. Then I will discuss how standard causal inference
        techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data
        frequently found in practice. Finally, I will talk about how such "relational causal inference" can be
        applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and
        conclude with an overview of our ongoing research directions on actionable and interpretable data
        analysis with causality and explanations.
    - when: 05/05/23
      presenter:
        url: https://user.it.uu.se/~geofa117/
        name: Georgios Fakas
        affiliation: Uppsala University
        bio: | 
          Georgios Fakas is an Associate Professor at the Department of Information Technology, Uppsala
          University. He is currently leading the Uppsala DataBase Laboratory (UDBL) within the Department of
          Information Technology. Before joining Uppsala University, he worked as a Research Fellow at the Hong
          Kong University of Science and Technology (with Prof. Dimitris Papadias), Hong Kong University (with
          Prof. Nikos Mamoulis), EPFL (Lausanne, Switzerland), UMIST (Manchester, UK), University of Cyprus
          (Cyprus). Prof. Georgios Fakas also worked as a Senior Lecturer at Manchester Metropolitan University
          (UK). He obtained his Ph.D. in 1998, his M.Phil. in 1996 and his B.Sc. in Computation in 1995; all
          from the Department of Computation, UMIST, Manchester, UK.
      title: | 
        Proportionality on Spatial Data with Context
      abstract: | 
        More often than not, spatial objects are associated with some context, in the form of text, descriptive
        tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2,
        DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but
        also the context of the objects, especially when the retrieved objects are too many and the query result
        is overwhelming. In this article, we study the problem of selecting a subset of the query result, which
        is the most representative. We argue that objects with similar context and nearby locations should
        proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all
        retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost
        of proportional object selection in practice. In addition, we propose pre-processing, pruning, and
        approximate computation techniques that their combination reduces the computational cost of the algorithms
        even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical
        studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies
        that proportional selection is more preferable than random selection and selection based on object
        diversification.
    - when: 06/02/23
      presenter:
        url: https://www.it.uu.se/katalog/tiazh991/
        name: Tianru Zhang
        affiliation: Uppsala University 
        bio: | 
          Tianru Zhang is a PhD student in Department of Information Technology at Uppsala University. He received
          his MSc degree from ENSAI (National school of statistics and information analysis of France), and his
          bachelor degree from University of Science and Technology of China. Before joining Uppsala University,
          he also worked as an assistant researcher in Fujitsu R&D center, Beijing. Zhang is interested in research
          field including cloud computing, data management, and federated machine learning.
      title: | 
        Efficient Hierarchical Storage Management Empowered by Reinforcement Learning <font color ="navy"><b>(Online only @ 11:30 AM)</b></font>
      abstract: | 
        With the rapid development of big data and cloud computing, data management has become increasingly
        challenging. Over the years, a number of frameworks for data management have become available. Most of them
        are highly efficient, but ultimately create data silos. It becomes difficult to move and work coherently
        with data as new requirements emerge. A possible solution is to use an intelligent hierarchical
        (multi-tier) storage system (HSS). A HSS is a meta solution that consists of different storage frameworks
        organized as a jointly constructed storage pool. A built-in data migration policy that determines the
        optimal placement of the datasets in the hierarchy is essential. Placement decisions is a non-trivial task
        since it should be made according to the characteristics of the dataset, the tier status in a hierarchy,
        and access patterns. This paper presents an open-source hierarchical storage framework with a dynamic
        migration policy based on reinforcement learning (RL). We present a mathematical model, a software
        architecture, and implementations based on both simulations and a live cloud-based environment. We compare
        the proposed RL-based strategy to a baseline of three rule-based policies, showing that the RL-based policy
        achieves significantly higher efficiency and optimal data distribution in different scenarios.


- name: Fall 2022
  when: 09/02/22
  where: |
    Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
  responsible:
    name: Andy Huynh
    username: ndhuynh
  talks:
    - when: 10/07/22
      presenter:
        url: https://mhmdsaiid.github.io/
        name: Mohammed Saeed
        affiliation: EURECOM
        bio: | 
          Mohammed Saeed obtained a double engineering diploma from the Lebanese
          University and Télécom Paristech. Since 2019, he is a PhD Student in
          Computational Fact Checking at EURECOM (France), under the supervision
          of Prof.  Paolo Papotti. His research spans methods across Natural
          Language Processing, Knowledge Graphs, and Deep Learning.
      title: | 
        Explainable Checking for Factual Claims
      abstract: | 
        Misinformation is an important problem but fact checkers are overwhelmed
        by the amount of false content that is produced online every day. To
        support fact checkers in their efforts, we are creating data driven
        verification methods that use data resources to assess claims and
        explain their decisions. However, using structured data for facts
        expressed in natural language is a complex task. In this talk, we
        present a first solution that jointly learns representations of
        structured and non-structured data. We translate text claims into SQL
        queries on relational databases by exploiting text classifiers and
        tentative execution of query candidates to narrow down the set of
        alternatives. We then discuss how logical rules can be used to verify
        claims in a  checking task modeled as a probabilistic inference problem.
        We show that this form of reasoning can be “learned” by
        transformer-based language models. Experiments show that both methods
        enable the efficient and effective labeling of claims with interpretable
        explanations.

- name: Spring 2022
  when: 02/02/22
  where: |
    Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
  responsible:
    name: Andy Huynh
    username: ndhuynh
  talks:
    - when: 02/11/22
      presenter:
        url: https://homepages.inf.ed.ac.uk/scohen/
        name: Shay Cohen
        affiliation: University of Edinburgh
        bio: | 
          Shay Cohen is a Reader at the University of Edinburgh (School of
          Informatics). Prior to this, he was a postdoctoral research scientist
          in the Department of Computer Science at Columbia University, and held
          an NSF/CRA Computing Innovation Fellowship. He received his B.Sc. and
          M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from
          Carnegie Mellon University in 2011. His research interests span a
          range of topics in natural language processing and machine learning,
          with a focus on structured prediction (for example, parsing) and text
          generation.
      title: | 
        Document Summarisation: Modelling, Datasets and Verification of Content
      abstract: | 
        Within Natural Language Processing, document summarisation is one of the
        central problems. It has both short-term societal implications and
        long-term implications in terms of the success of AI. I will describe
        advances made in this area with respect to three different aspects:
        methodology and modelling, dataset development and enforcing factuality
        of summaries. In relation to modelling, I will show how reinforcement
        learning can be used to directly maximise the metric by which the
        summaries are being evaluated. With regards to dataset development, I
        will describe a dataset that we released for summarisation, XSum, in
        which a single sentence is used to describe the content of a whole
        article. The dataset has become a standard benchmark for summarisation.
        Finally, in relation to factuality, I will show how one can improve the
        quantitative factuality of summaries by re-ranking them in a beam based
        on a "verification" model.

    - when: 03/11/22
      presenter:
        url: https://prashantpandey.github.io/
        name: Prashant Pandey
        affiliation: VMware
        bio: |
          Pandey is a Research Scientist at VMware Research. Previously, he did
          postdocs at University of California Berkeley and Carnegie Mellon
          University. He obtained his Ph.D. in Computer Science at Stony Brook
          University in December 2018. His goal as a researcher is to advance
          the theory and practice of resource-efficient data structures and
          employ them to democratize complex and large-scale data analyses. He
          designs and builds tools for large-scale data management problems
          across computational biology, stream processing, and storage. He is
          also the main contributor and maintainer of multiple open-source
          software tools that are used by hundreds of users across academia and
          industry. He won the prestigious Catacosinos Fellowship in 2018, a
          Best paper award at FAST 2016, and Runner’s Up to Best Paper at FAST
          2015. His work has appeared at top conferences and journals, such as
          SIGMOD, FAST, SPAA, ESA, RECOMB, ISMB, Genome Biology, and Cell
          Systems.
      title: |
        Data Systems at Scale: Scaling Up by Scaling Down and Out (to Disk)
      abstract: |
        Our ability to generate, acquire, and store data has grown exponentially
        over the past decade making the scalability of data systems a major
        challenge. This talk presents my work on two techniques to tackle the
        scalability challenge: scaling down, i.e., shrinking data to fit in RAM,
        and scaling out to disk, i.e., organizing data on disk so that the
        application can still run fast. I will describe new compact and
        I/O-efficient data structures and their applications in computational
        biology, stream processing, and storage.
        In computational biology, my work shows how to shrink genomic and
        transcriptomic indexes by a factor of two while accelerating queries by
        an order of magnitude compared to the state-of-the-art tools. In stream
        processing, my work bridges the gap between the worlds of external
        memory and stream processing to perform scalable and precise real-time
        event-detection on massive streams. In file systems, my work improves
        file-system random-write performance by an order of magnitude without
        sacrificing sequential read/write performance.

    - when: 03/18/22
      presenter:
        url: https://www.linkedin.com/in/ippokratis-pandis-03b1402/
        name: Ippokratis Pandis
        affiliation: Amazon AWS
        bio: |
          Ippokratis Pandis is a senior principal engineer at Amazon Web Services, currently working in Amazon Redshift.
          Redshift is Amazon's fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held
          positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as
          member of the research staff at the IBM Almaden Research Center, where he worked on IBM DB2 BLU.
          Ippokratis received his PhD from the Electrical and Computer Engineering department at Carnegie Mellon
          University. He is the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD 2011, and Test-of-Time
          award at EDBT 2019. He has served or serving as PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019
          and ICDE Industrial 2022, as well as General Chair of SIGMOD 2023.
      title: | 
        Amazon Redshift Reinvented <font color ="navy"><b>(talk @ 12:00 PM)</b></font>
      abstract: | 
        In 2013, eight years ago, Amazon Web Services revolutionized the data warehousing industry by launching Amazon
        Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple
        and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This
        launch was a significant leap from the traditional on-premise data warehousing solutions which were expensive,
        rigid (not elastic), and needed a lot of tribal knowledge to perform. Unsurprisingly, customers embraced Amazon
        Redshift and it went on to become the fastest growing service in AWS. Today, tens of thousands of customers use
        Amazon Redshift in AWS's global infrastructure of 25 launched Regions and 81 Availability Zones (AZs) to process
        Exabytes of data daily.
        The success of Amazon Redshift inspired a lot of innovation in the analytics industry
        which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and
        in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this
        talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on
        the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading
        performance and scalability. We discuss how Amazon Redshift extends beyond traditional data warehousing
        workloads, but integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for
        analytics. We then talk about Amazon Redshift’s autonomics and Amazon Redshift Serverless. In particular, we
        present how Redshift continuously monitors the system and uses machine learning to improve its performance and
        operational health without the need of dedicated administration resources, in an easy to use offering.

    - when: 04/22/22
      presenter:
        url: https://www.linkedin.com/in/venusatuluri/
        name: Venu Satuluri
        affiliation: Twitter
        bio: | 
          Venu Satuluri is a Principal ML Engineer at Twitter. He has been
          working on a variety of recommender systems and applied ML problems
          during the last 10 years at Twitter. Venu got his Ph.D. focusing on
          graph clustering and similarity search from the Ohio State University
          in 2012.
      title: | 
        SimClusters: Community-Based Representations for Heterogeneous
          Recommendations at Twitter
          <font color ="navy"><b>(talk @ 11:00 AM)</b></font>
      abstract: | 
        Personalized recommendation products at Twitter target a variety of
        items: Tweets, Events, Topics, Hashtags, and users. Each of these targets
        varies in their cardinality (which affects the scale of the problem) and
        their shelf life (which constrains the latency of generating the
        recommendations). Although Twitter has built a variety of recommendation
        systems before dating back a decade, solutions to the broader problem
        were mostly tackled piecemeal. In this work, we present SimClusters, a
        general-purpose representation layer based on overlapping communities
        into which users as well as heterogeneous content can be captured as
        sparse, interpretable vectors to support a multitude of recommendation
        tasks. We propose a novel algorithm for community discovery based on
        Metropolis-Hastings sampling, which is both more accurate and
        significantly faster than off-the-shelf alternatives. SimClusters scales
        to networks with billions of users and has been effective across a
        variety of deployed applications at Twitter.

    - when: 05/06/22
      presenter:
        url: https://raulcastrofernandez.com/
        name: Raul Castro Fernandez
        affiliation: University of Chicago
        bio: | 
          I am interested in understanding the economics and value of data,
          including the potential of data markets to unlock that value. The goal
          of my research is to understand how to make the best use of data
          possible. For that, I often build systems to share, discover, prepare,
          integrate, and process data. I often use techniques from data
          management, statistics, and machine learning. I am an assistant
          professor in the Computer Science department at the University of
          Chicago. Before UChicago, I did a postdoc at MIT with Sam Madden and
          Mike Stonebraker. And before that, I completed my PhD at Imperial
          College London with Peter Pietzuch.
      title: | 
        Data Station: Delegated, Trustworthy, and Auditable Computation to
        Enable Data-Sharing Consortia with a Data Escrow
      abstract: | 
        Pooling and sharing data increases and distributes its value. But since
        data cannot be revoked once shared, scenarios that require controlled
        release of data for regulatory, privacy, and legal reasons default to
        not sharing. Because selectively controlling what data to release is
        difficult, the few data-sharing consortia that exist are often built
        around data-sharing agreements resulting from long and tedious one-off
        negotiations.
        In this talk, I will present Data Station, a data escrow designed to
        enable the formation of data-sharing consortia. Data owners share data
        with the escrow knowing it will not be released without their consent.
        Data users delegate their computation to the escrow. The data escrow
        relies on delegated computation to execute queries without releasing the
        data first. The Data Station leverages hardware enclaves to generate
        trust among participants and exploit the centralization of data and
        computation to generate an audit log.
        Finally, I will contextualize the design and implementation of the Data Station
        in a larger research agenda that explores “the value of data”, the “economics of
        data”, and “data markets”.

    - when: 05/13/22
      presenter:
        url: https://www.eurecom.fr/~appuswam/
        name: Raja Appuswamy
        affiliation: EURECOM
        bio: | 
          Raja Appuswamy is an Assistant Professor in the Data Science
          department at EURECOM--a Grandes Écoles located in the sunny Sophia
          Antipolis tech-valley of southern France. Previously, he was as a
          Researcher and Visiting Professor at EPFL, Switzerland, a Visiting
          Researcher in the Systems and Networking group at Microsoft Research,
          Cambridge, and as a Software Development Engineer in the Windows 7
          kernel team at Microsoft, Redmond.
          He received his Ph.D in Computer Science from the Vrije Universiteit, Amsterdam,
          where he worked under the guidance of Prof. Andrew S. Tanenbaum on designing and
          implementing a new storage stack for the MINIX 3 microkernel operating system.
          He also holds dual Masters degrees in Computer Science and Agricultural
          Engineering from the University of Florida, Gainesville.
      title: | 
        Microform and Macromolecules: Archiving digital data on analog or biological storage media
      abstract: | 
        Today, most, if not all, enterprises in our society are driven by data.
        For decades, we wanted fast storage devices that can quickly deliver
        data, and storage technologies evolved to meet this requirement. As
        data-driven decision making becomes an integral part of enterprises, we
        are increasingly faced with a new need-–one for cheap, long-term storage
        devices that can safely store the data we generate for tens or hundreds
        of years to meet legal and regulatory compliance requirements.
        In this talk, we will first explore recent trends in the storage
        hardware landscape that show that all current storage media face
        fundamental limitations that threaten our ability to store, much less
        process, the data we generate over long time frames. We will then focus
        on unconventional biological and analog media that have received quite
        some attention recently--synthetic Deoxyribonucleic acid (DNA) and film.
        After highlighting the pros and cons of using each as a digital storage
        media, I will present our recent work in the EU-funded Future and
        Emerging Technologies (FET) project OligoArchive, that focuses on
        overcoming challenges in using such media to build a deep archival tier
        for data management systems.

    - when: 05/16/22
      presenter:
        url: https://sites.google.com/view/oanabalmau
        name: Oana Balmau
        affiliation: McGill University
        bio: |
          Oana Balmau is an Assistant Professor in the School of Computer
          Science at McGill University, where she leads the Data-Intensive
          Storage and Computer Systems Lab (DISCS Lab). Her research focuses on
          storage systems and data management systems, with an emphasis on
          large-scale data management for workloads in Machine Learning, Data
          Science, and Edge Computing. She completed her PhD in Computer Science
          at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before
          her PhD, Oana earned her Bachelors and Masters degrees in Computer
          Science from EPFL, Switzerland. Oana won the CORE John Makepeace
          Bennet Award 2021 for the best computer science dissertation in
          Australia and New Zealand, an Honorable Mention for the ACM SIGOPS
          Dennis M. Ritchie Doctoral Dissertation Award 2021, as well as a Best
          Paper Award in the USENIX Annual Technical Conference (USENIX ATC)
          2019. Before joining McGill, Oana worked with Nutanix, ABB Research,
          and HP Vertica. She is also a part of MLCommons, an open engineering
          consortium that aims to accelerate machine learning innovation, where
          she is leading the effort for storage benchmarking.
      title: |
        Reducing Space Amplification in Key-Value Stores
        <font color ="navy"><b>(Monday Talk @ 11:30 AM EDT)</b></font>
      abstract: |
        As dataset sizes continue to grow, efficient storage use with low space
        amplification is increasingly important in production systems. In
        particular in key-value stores (KVs), space amplification is caused by
        storing multiple versions of the same item e.g., waiting to be garbage
        collected, or by large metadata e.g., caused by large indexes and other
        metadata. While space amplification has been a well-known problem for
        in-memory data stores, it is becoming problematic when the datasets are
        stored on modern devices as well, as fast NVMe SSDs can persist millions
        of items per second.
        In this talk, I will present two techniques to decrease space
        amplification at the disk, and main memory levels. Our target workload
        is a mix of update-heavy and analytics queries running under snapshot
        isolation. First, I introduce KVell+, a new KV store which supports data
        analytics running alongside update-heavy workloads with close to no
        space amplification. KVell+ is built on top of KVell, but can be easily
        adapted to other state-of-the-art KVs such as RocksDB or LevelDB. On a
        high-level, the key idea is to propagate an item X to the analytics
        query to get processed out-of-order, if X happens to be updated as the
        analytics job is running. Even though KVell+ has almost no space
        amplification on disk, in memory its index can grow significantly with
        the dataset size --especially for small keys. To address this issue, I
        introduce TONE, a new write-optimized learned index. In mixed workloads,
        TONE uses up to 2.5x less memory compared to state-of-the-art index data
        structures (e.g., skiplists, B+trees), with order-of-magnitude
        improvements in tail latency over state-of-the-art learned indexes
        (e.g., ALEX, XIndex).


- name: Fall 2021
  when: 09/20/21
  where: |
    Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
  responsible:
    name: Lina Qiu
    username: qlina
  talks:
    - when: 09/24/21
      presenter:
        url: https://scholar.harvard.edu/bhentschel/home
        name: Brian Hentschel
        affiliation: Harvard University
        bio: |
          Brian Hentschel is a Ph.D. candidate at Harvard advised by Stratos Idreos. He is interested in blending statistics and machine learning with classical techniques to improve computer systems. His research has been awarded the SIGMOD research highlight award as well a best paper from EDBT. He earned his BA in mathematics and computer science at Pomona College and has previously spent time at Microsoft, Amazon, LinkedIn, and IBM.
      title: "Augmenting Classical Data Structures through Learning"
      abstract: |
        Data systems are moving from data structures which are implicitly designed for specific data and workload characteristics towards data structures explicitly designed to exploit patterns in both. This generally leads to better expected performance but opens up questions about both the predictability of data structure performance and its robustness to workload or data shifts. In this talk, we make the case for ML-Augmented data structures, which solve these problems by using workload information and statistical modeling to redesign components of classical data structures. By keeping the core structure of classical data structures, we keep their robustness properties and provide a pathway towards arguing theoretically about their performance. In using ML for design, we transfer properties of the workload and data into the data structure and achieve better expected performance. The result is that we can argue theoretically that ML-designed data structures perform better than classical data structures and empirically show the benefits of this approach, all while maintaining many of the benefits of classical data structures. We present three applications of this approach. First, we present Stacked Filters, which uses workload skew to produce 100X lower false positive rates for filter data structures. Second, we present Entropy-Learned Hashing, which produces 10X faster hash function evaluation and 4X faster hash tables. And third, we present Column Sketches, which produces 7.5X faster database scans.
    - when: 10/22/21
      presenter:
        url: https://www.linkedin.com/in/siyingdong/
        name: Siying Dong
        affiliation: Facebook
        bio: |
          Siying Dong is a software engineer working in the Database Engineering team at Facebook, focusing on RocksDB. He also worked on Hive, HDFS, and some other data warehouse infrastructures. Before joining Facebook, Siying worked in the SQL Azure Team at Microsoft. He received a bachelor’s degree from Tsinghua University and a master’s degree from Brandeis University.
      title: |
        <font color ="navy"><b>Online only @ 11:00 AM</b></font>
        <br/>
        RocksDB: The Changing Of Optimization Targets
      abstract: |
        RocksDB is an embedded persistent storage engine optimized for flash-based SSD. It has been adapted to a wide range of workloads. In the talk, Siying will introduce how and why RocksDB’s resource optimization target migrated from write amplification, to space amplification, to CPU utilization, and why disaggregated storage is the current focus. Siying will also introduce RocksDB’s plan for supporting SSD/HDD hybrid, as well as challenges for it.
    - when: 10/29/21
      presenter:
        url: http://people.csail.mit.edu/xchen
        name: Xuhao Chen
        affiliation: Massachusetts Institute of Technology
        bio: |
          Xuhao Chen is a Research Scientist at MIT CSAIL, working with Prof. Arvind. Dr. Chen is broadly interested in parallel systems for AI & big-data, with a focus on emerging graph algorithms. He has built multiple software and hardware systems for data mining and machine learning on graphs. His work has been published in ISCA, MICRO, VLDB, ICS and DAC. Before joining MIT, Dr. Chen was a research fellow at UT Austin.
      title: "System Specialization for Graph Pattern Mining"
      abstract: |
        Graph algorithms play a key role in numerous real-world applications, including social networks, e-commerce, biomedicine and cybersecurity. These algorithms process massive irregular data, which poses both performance and programmability challenges in computing system design. In this talk, I will explore system specialization approaches for an important class of graph algorithms — graph pattern mining (GPM). I will introduce a specialized software programming framework and a dedicated hardware accelerator that we built for GPM, and describe experiences creating abstractions, optimization techniques and automation methods. Overall, by specializing in both software and hardware, we can make graph algorithms easy to program and run fast at the same time.
    - when: 11/05/21
      presenter:
        url: http://www.itrummer.org
        name: Immanuel Trummer
        affiliation: Cornell University
        bio: |
          Immanuel Trummer is assistant professor at Cornell University, working towards making data analysis more efficient and more user-friendly. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His current research is funded by the NSF and by multiple Google Faculty Research Awards.
      title: "Towards Database Tuning Tools that Read the Manual"
      abstract: |
        Recent advances in natural language processing (NLP) open up new applications in the context of database systems. In this talk, I discuss how NLP can yield useful information for various database performance tuning problems. In particular, I focus on the problem of database system parameter tuning.
    - when: 11/12/21
      presenter:
        url: https://www.utkusirin.com
        name: Utku Sirin
        affiliation: Harvard University
        bio: |
          Utku is a postdoctoral researcher at Harvard Data Systems lab working with Stratos Idreos. He has done his Ph.D. at EPFL, Switzerland advised by Anastasia Ailamaki. He is interested in hardware- and energy-efficient data systems.
      title: "Characterization of OLTP Workloads: From Micro-architecture to Power/Performance"
      abstract: |
        Traditional online transaction processing (OLTP) systems under-utilize the micro-architectural resources; more than half of the CPU cycles go to stalls, and the number of instructions retired per cycle barely reaches one on machines that are able to retire up to four. This causes large power waste, and renders the power-hungry, heavily optimized Intel Xeon processors energy-inefficient. In this talk, we firstly examine hardware utilization of OLTP workloads for state-of-the-art OLTP systems. We then proceed by examining whether and how machine learning models can be useful in improving system performance. Lastly, we compare a high-end Intel Xeon processor with a recently announced server-grade ARM processor in terms of their power, throughput, and latency characteristics when running OLTP workloads. Based on our observations, we lastly comment on power-saving and performance improvement opportunities in co-designing OLTP software and hardware.
    - when: 12/03/21
      presenter:
        url: https://people.cs.umass.edu/~cmusco
        name: Cameron Musco
        affiliation: University of Massachusetts Amherst
      title: "Hutch++: Optimal Stochastic Trace Estimation"
      abstract: |
        We study the problem of estimating the trace of a matrix that can only be accessed through matrix-vector multiplication. We introduce a new randomized algorithm, Hutch++, which computes a (1+epsilon) approximation to the trace of any positive semidefinite (PSD) matrix using just O(1/epsilon) matrix-vector products. This improves quadratically on the ubiquitous Hutchinson's estimator, which requires O(1/epsilon^2) matrix-vector products. Our approach is based on a simple technique for reducing the variance of Hutchinson's estimator using low-rank approximation, and is easy to implement and analyze. Moreover, we prove that up to a logarithmic factor, the complexity of Hutch++ is optimal amongst all matrix-vector query algorithms. We show that it significantly outperforms Hutchinson's method in experiments, on both PSD and non-PSD matrices arising in applications such as network analysis and Bayesian machine learning.


- name: Spring 2021
  when: 01/25/21
  where: |
    Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
  responsible:
    name: Zichen Zhu
    username: zczhu
  talks:
    - when: 01/22/21
      presenter:
        url: https://ntzia.github.io
        name: Nikolaos Tziavelis
        affiliation: Northeastern University
        bio: |
          Nikolaos Tziavelis is a PhD student at the Khoury College of Computer Sciences of Northeastern University.
          His research interests lie in algorithms for database query processing and distributed computing. He holds a
          MEng in Electrical and Computer Engineering from the National Technical University of Athens.
      title: "Any-k: Optimal Join Algorithms meet Top-k"
      abstract: |
        Top-k queries have been studied intensively in the database community and they are an important means to reduce
        query cost when only the “best” or “most interesting” results are needed instead of the full output. However,
        optimality results for queries involving joins hold only in a fairly limited model of computation that does not
        account for the cost incurred by large intermediate results. On the other hand, the idea of avoiding large
        intermediate results is arguably the main goal of recent work on optimal join algorithms, which has created a
        lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has
        mostly focused on full-output computation. In this talk, I will present the problem of ranked enumeration that
        lies in the intersection of these two research areas. The goal is to return query results one after the other
        in order of their importance with theoretical guarantees that are better suited for in-memory computation. I
        will focus on two families of algorithms that have their roots on the k'th shortest path problem. The first
        one is based on the Lawler-Murty partitioning procedure, while the second one uses a recursive approach that
        has been developed over several decades.
    - when: 02/12/21
      presenter:
        url: https://cs-people.bu.edu/papon/
        name: Tarikul Islam Papon
        affiliation: Boston University
        bio: |
          Papon is a second-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis.
          His research interests broadly include Data Systems, specifically on hardware-aware data management challenges,
          stemming from the evolution of storage and memory devices.
          Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into
          account performance asymmetry between read and write operations, as well as the variable access concurrency that
          different devices may support. Prior to joining BU, Papon served as a lecturer for 4 years at the Department of
          Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed
          his MSc and BSc degree from the same department where he worked on medical informatics using various machine
          learning and embedded system techniques.
      title: |
        The Case for A Parametric I/O Model
      abstract: |
        Storage devices have evolved to offer increasingly faster read/write access, through flash-based and other solid-state storage technologies. When compared to classical rotating hard disk drives (HDDs), modern solid-state drives (SSDs) have two key differences: (i) the absence of mechanical parts, and (ii) an inherent difference between the process of reading and writing. The former removes a key performance bottleneck, enabling internal device parallelism, whereas the latter manifests as a read/write performance asymmetry. In other words, modern storage devices can serve multiple concurrent I/Os, and writes are generally slower than reads; none of which is true for HDDs.To address this mismatch, we propose a simple yet expressive storage model, termed Parametric I/O Model (PIO) that captures modern (and future) devices by parameterizing read/write asymmetry (α) and access concurrency (k). PIO enables making device-specific decisions at algorithm design time, rather than as an optimization during deployment. To showcase the benefits of algorithm design using PIO, we introduce a new family of bufferpool eviction strategies that offer performance improvements by accurately modeling the behavior of the underlying devices. We compare their performance on various synthetic traces with three state-of-the-art algorithms (LRU, CFLRU, LRU-WSR) to show that they achieve 15.9%  44.5% improvement in runtime with a negligible increase in total writes ( 0.06%) and buffer misses ( 0.004%). We further report up to 1.83x speedup for TPC-C traces on PostgreSQL.
    - when: 04/23/21
      presenter:
        url: http://pages.cs.wisc.edu/~chronis/
        name: Yannis Chronis
        affiliation: University of Wisconsin Madison
        bio: |
          Yannis is a fourth-year Ph.D. student in the Computer Sciences Department at the University of Wisconsin-Madison advised by Prof. Jignesh Patel. His research focuses on efficient data processing using modern and emerging hardware.
      title: |
        Castle: Accelerating analytical query processing with associative computing
      abstract: |
        We are in the midst of a Cambrian hardware evolution in which a rich variety of technologies and architectures are being invented with a flurry that we havent seen in a long time. One class of hardware devices, called Processing-in-Memory (PIM), is enabling new ways of utilizing memories. PIM devices generally pack storage and computing closely together and have novel computing primitives that are built into the memory devices. Associative processors are a type of PIM device with massive data-level parallelism and single-cycle search and update capabilities. The focus of our work is on exploring how database analytic workloads can be run efficiently using an associative computing substrate. Our vehicle for this research is the Castle project in which we are building a data processing engine to run on the CAPE hardware, which is a specific implementation of an associative PIM device. Initial results show that associative PIM hardware can potentially deliver manifold performance gains over traditional von Neumann architectures; however, to fully exploit CAPEs potential, data query processing kernels and query optimization methods both have to be adapted in new ways to leverage the style of data-parallel computing that can be carried out directly inside CAPE. We experiment with both individual query processing kernels and end-to-end queries and show that the performance gains using our methods are significant, and up to 46X in some cases. Tentatively, these results point to the exciting potential in accelerating database analytic query processing workloads using associative computing.

- name: Fall 2020
  when: 10/01/20
  where: |
    Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
  responsible:
    name: Tianyi Chen
    username: ctony
  talks:
    - when: 11/06/20
      presenter:
        url: http://daslab.seas.harvard.edu/people/awasay/
        name: Abdul Wasay
        affiliation: Harvard
        bio: |
          I am a Ph.D. candidate at the Harvard John A. Paulson School of Engineering and Applied Sciences, where I am
          advised by Prof. Stratos Idreos. I design techniques to accelerate machine learning and data science pipelines,
          specifically, addressing the bottleneck of repeated data movement. Additionally, I am interested in mapping out
          the design space of deep neural network models to enable better design.
          Prior to joining Harvard, I was an undergraduate at Lahore University of Management Sciences (LUMS). In my
          undergraduate thesis, I investigated ways to improve buffer management in data center networks. During this
          time, I also spent a summer at École Polytechnique Fédérale de Lausanne (EPFL) working on techniques to enhance
          the privacy of social network graphs.
      title: How and When to Build Neural Network Ensemble Models
      abstract: |
        Neural network models are applied to many tasks with rising complexity and data sizes. Researchers and
        practitioners seek to scale the representational power of these models by adding more parameters. Increasing
        parameters, though, requires additional critical resources such as memory and compute time leading to increased
        training and inference cost. Thus a consistent challenge is to obtain as high as possible accuracy within a
        parameter budget. As neural network designers navigate this complex landscape, they are guided by conventional
        wisdom. This comes from empirical studies of the design space. We identify a critical part of this design space
        that is not well-understood: That is how to decide between the alternatives of expanding a single network model
        or increasing the number of networks and using them together in an ensemble. We study this question in detail
        through extensive experimentation across various network architectures and data sets. Crucially, we provide a
        robust assessment, missing in previous studies, by fixing the number of parameters. We also consider a holistic
        set of metrics such as training time, inference time, and memory usage. Contrary to conventional wisdom, we show
        that when we perform a holistic and robust assessment, we uncover a wide design space, where ensembles not only
        provide better accuracy but also train faster, and deploy at a speed comparable to single convolutional networks
        with the same total number of parameters.

    - when: 12/04/20
      presenter:
        url: https://dicl.cut.ac.cy/index.php/people?id=131
        name: Herodotos Herodotou
        affiliation: Cyprus University of Technology
        bio: |
          Herodotos Herodotou is an Assistant Professor in the Department of Electrical Engineering, Computer
          Engineering and Informatics at the Cyprus University of Technology, where he is leading the Data Intensive
          Computing Research Lab. He received his Ph.D. in Computer Science from Duke University. His Ph.D.
          dissertation work on building a self-tuning system for big data analytics received the ACM SIGMOD Jim Gray
          Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer
          Science at Duke. Before joining CUT, he held research positions at Microsoft Research, Yahoo! Labs, and Aster
          Data. His research interests are in large-scale Data Processing Systems, Database Systems, and Cloud
          Computing. In particular, his work focuses on ease-of-use, manageability, and automated tuning of both
          centralized and distributed data-intensive computing systems. In addition, he is interested in applying
          database techniques in other areas like maritime informatics, scientific computing, bioinformatics, and
          social computing. His research work to date has been published in several top scientific conferences and
          journals, two books, and two book chapters, while he is actively participating in multiple European and
          nationally funded projects.
      title: Automating Distributed Tiered Storage Management in Cluster Computing
      video_url: https://drive.google.com/file/d/1-J1DhKjmDL6_FgelKvrtbGOIPWlGQ4-V/view
      abstract: |
        Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data
        residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies
        (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users
        are now burdened with the additional complexity of managing the multiple storage tiers and the data residing
        on them, while trying to optimize their workloads. In this talk, I will present OctopusFS, a novel distributed
        file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different
        capacities and performance characteristics. The system offers a variety of pluggable policies for automating
        data management across the storage tiers and cluster nodes. Smart placement and retrieval policies employ
        multi-objective optimization techniques for making intelligent data management decisions based on the
        requirements of fault tolerance, data and load balancing, and throughput maximization. In addition,
        redistribution policies employ machine learning for tracking and predicting file access patterns, which are
        used to decide when and which data to move up or down the storage tiers for increasing system performance.
        The approach uses incremental learning to dynamically refine the models with new file accesses, allowing them
        to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic
        workloads derived from Facebook and CMU traces compares our approach with several other policies and
        showcases significant benefits in terms of both workload performance and cluster efficiency.


    - when: 12/11/20
      presenter:
        url: https://schariya.github.io
        name: Sudhanshu (Dan) Chanpuriya
        affiliation: UMass Amherst
        bio: |
          Sudhanshu Chanpuriya is a PhD student at the College of Computer and Information Sciences at UMass Amherst,
          where he is advised by Cameron Musco. His research interests lie at the intersection of machine learning and
          graph theory. Prior to joining UMass, he received his undergraduate degree at Dartmouth College.
      title: Towards Understanding the Success and Limitations of Node Embeddings
      abstract: |
        The node embedding task involves mapping nodes in networks to low-dimensional, continuous vectors for direct
        use in downstream machine learning algorithms, such as those for classification and clustering. This task has
        received great interest from the ML community in the past few years, and I will present some of our recent
        work, which tries to advance two parts of our understanding: (i) What are the limitations of node embeddings?
        In work appearing at NeurIPS 2020, we show that, contrary to recent doubts, node embeddings can capture fine
        network structures like triangles involving only low-degree nodes; in fact, they can exactly capture the full
        network structure at surprisingly low embedding dimensionalities. (ii) What has made recent node embedding
        algorithms more effective on downstream tasks? I will present our variant of the popular DeepWalk algorithm,
        InfiniteWalk, which attributes the success of DeepWalk to the addition of a nonlinear component to older
        algorithms based on spectral clustering.

    - when: 12/18/20
      presenter:
        url: https://www.ics.uci.edu/~cluo8/
        name: Chen Luo
        affiliation: University of California Irvine
        bio: |
          Chen Luo received his PhD from University of California, Irvine in December 2020 under the supervision of
          Professor Michael J. Carey. His research interests include storage management, indexing, and LSM-trees. In
          particular, he worked on optimizing LSM-trees for big data management systems. During his PhD studies, he
          interned at IBM Almaden (2017) working on HTAP indexing and Microsoft Research (2019) working on customized
          SSD controllers. Before that, he obtained the Master’s degree from Tsinghua University in 2016 and the
          Bachelor’s degree from Tongji University in 2013, respectively.
      title: On Optimizing LSM-based Storage for Big Data Management Systems
      video_url: https://drive.google.com/file/d/1sqttrgQLfAzdMEaZtTAgYSwivaXTRJ3b/view
      abstract: |
        In recent years, the Log-Structured Merge-tree (LSM-tree) has been widely used in the storage layer in modern
        NoSQL systems. Different from traditional index structures that apply updates in-place, an LSM-tree first
        buffers all writes in memory and subsequently flushes them to disk and merges them using sequential I/Os. This
        out-of-place update design brings a number of advantages, including superior write performance, high space
        utilization, tunability, and simplification of concurrency control and recovery. These advantages have enabled
        LSM-trees to serve a large variety of workloads in production systems.
        Despite the popularity of LSM-trees, the existing research efforts have been primarily focusing on improving
        the maximum throughput of LSM-trees in simple key-value store settings. This leads to several outages when
        adopting LSM-based storage techniques in a Big Data Management System (BDMS) with multiple heterogeneous
        LSM-trees and requiring performance metrics beyond just the maximum throughput. In this talk, I will describe
        some of my PhD research to optimize LSM-trees for BDMSs, including efficient maintenance of LSM-based auxiliary
        structures, minimizing write stalls via better scheduling, and adaptive memory managemnt of LSM-trees.

- name: Spring 2020
  when: 02/01/20
  where: |
    Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
  responsible:
    name: Tarikul Islam Papon
    username: papon
  talks:
    - when: 02/07/20
      presenter:
        url: https://sites.google.com/view/juhyoungmun/
        name: Ju Hyoung Mun
        affiliation: BU
        bio: |
          Juhyoung is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis.
          Her research interest lies in the fields of Data Systems and Networking;particularly, in enhancing the lookup performance.
          She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim.
          Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Information Centric Networking (ICN).
      title: Reducing Hash Overhead of Bloom Filters in LSM-trees
      abstract: |
        A log-structured merge-tree (LSM-tree) is widely used in modern key-value stores.
        To balance the read performance and ingestion rate, LSM-trees maintain sorted runs in multiple levels so that the lookup may require multiple I/Os.
        To accelerate the performance of point lookups, LSM-trees often employ Bloom filters in the main memory to avoid unnecessary I/Os for levels that do not contain the matching key.
        Using Bloom filters is useful when there is a vast gap in the cost of accessing Bloom filters, hashing and probing the memory, and the storage.
        As the access latency of the modern storage device like NVMe is getting comparable with hashing, the CPU cost for hashing becomes the performance bottleneck of lookups and hinders other CPU intensive tasks.
        Hashing is crucial for LSM-trees since, by design, they have increased read amplification to mitigate the insertion cost.
        To alleviate the hashing overhead, we propose to share the hash calculation for multiple levels and re-use it across the tree hierarchy.
        Furthermore, by sharing the hash calculation, which sections of Bloom filters are going be searched along with the different hierarchy are acquired in advance.
        Using this property, we also propose to prefetch the necessary Bloom filter sections to improve the lookup performance further.
        The simulation results show that our proposal can achieve almost the same false-positive ratio as the state-of-the-art while keeping the hashing overhead constant.

    - when: 02/21/20
      presenter:
        url: https://www.bu.edu/cs/profiles/esmail-asyabi/
        name: Showan Esmail Asyabi
        affiliation: BU
        bio: |
          Showan Esmail Asyabi is a research assistant and a Ph.D. student at Boston University.
          He is advised by Prof. Azer Bestavros.
          His research focuses on designing and building system software for emerging computing models (e.g., cloud computing and stream processing).
          He has designed several CPU schedulers for virtualized and non-virtualized data centers to mitigate power consumption or raise the performance of IO-bound workloads.
          He is mainly interested in distributed systems, operating systems, cloud computing, and big data.
      title: In-application CPU Scheduling to Reduce Power Consumption of In-memory Key-Value Stores
      abstract: |
        The traffic load sent to key-value (KV) stores vary over long timescales of hours to short timescales of a few microseconds.
        Long-term variations present the opportunity to save power during low or medium periods of utilization.
        Several techniques exist to save power in servers, including feedback-based controllers that right-size the number of allocated CPU cores, dynamic voltage and frequency scaling (DVFS), and c-state (idle-state) mechanisms.
        In this talk, we demonstrate that existing power-saving techniques are not effective for KV stores.
        This is because of the high rate of traffic even under low load prevents the system from entering low power states for extended periods of time.
        To achieve power savings, we must unbalance the load among the CPU cores so that some of them can enter low power states during periods of low load.
        We accomplish this by introducing the notion of in-application CPU scheduling.
        Instead of relying on the kernel to schedule threads, we pin threads to bypass the kernel CPU scheduler and then perform the scheduling within the KV store application.
        Our design is a KV store that features an in-application CPU scheduler that monitors the load to learn the workload characteristics and then scales the number of active CPU cores when the load drops, leading to notable power savings during low or medium periods of utilization.
        Our experiments demonstrate that our system outperforms state of the art approaches such as Rubik and µDPM by up to 6x and 4x respectively.

    - when: 03/06/20
      presenter:
        url: https://cs-people.bu.edu/dstara/
        name: Dimitris Staratzis
        affiliation: BU
        bio: |
          Dimitris Staratzis is a first year PhD student at the Computer Science Department of Boston University.
          He is a member of the MiDAS group@BU, advised by Professor Manos Athanassoulis.
          His research interests broadly include Database and Big Data Management.
          Prior to joining Boston University, he received his BSc in Informatics from Athens University of Economics and Business.
      title: Efficient Range Deletes in LSM Engines
      abstract: |
        Modern data systems process enormous amount of data and selecting the appropriate data structure can have major implications in the responsiveness, availability and scalability of such systems.
        Traditional systems have certain limitations and thus more advanced Key-Value systems are used in most applications today.
        During updates, inserts and deletes, LSM trees utilise the main memory of the system to achieve better performance.
        Values are not deleted or updated in place.
        Each delete requires a new entry referred as a 'tombstone' that marks all previous entries with the same key as deleted.
        In our knowledge, none of the major Key-Value store developers have implemented range deletes except RocksDB.
        Their implementation use the so called 'range tombstones' which altogether consist a data-structure that can answer whether a requested key is deleted by a range delete or not.
        For the database to construct such a response multiple I/Os are needed.
        Our method tries to reduce these I/Os by efficiently storing past results in memory in an opportunistic way, thus increasing responsiveness, since I/Os majorly affect performance.

    - when: 03/19/20
      presenter:
        url: https://www.linkedin.com/in/stavrospap/
        name: Stavros Papadopoulos
        affiliation: TileDB
        bio: |
          Dr. Stavros Papadopoulos is the founder and CEO of TileDB, Inc.
          Prior to founding TileDB, Inc. in February 2017, Dr. Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years.
          He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST).
          Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.
      title: |
        <font color ="navy">MiDAS Seminar/CS591A1 Guest Lecture <b>(Online@12:30 PM)</b></font>
        <br/>
        The TileDB array data storage engine
      abstract: |
        In this talk I will present TileDB, a storage engine for multi-dimensional dense and sparse arrays.
        TileDB is open-source (MIT License) and built as a C++ embeddable library.
        It comes with 6 APIs (C, C++, Python, R, Java, Go), and integrates with Spark, Dask, MariaDB, PrestoDB, PDAL and GDAL.
        I will first explain the TileDB data format, the internal engine mechanics (parallelism, filters, integrations), and its applicability to data science and analytics.
        I will then describe TileDB’s value in application domains such as Genomics and Geospatial.
        Finally, I will show a demo of our recently launched product that offers access control, logging and serverless SQL and UDFs on the cloud.

    - when: 03/26/20
      presenter:
        url: https://www.csail.mit.edu/person/ryan-marcus/
        name: Ryan Marcus
        affiliation: MIT
        bio: |
          Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska.
          Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil.
          Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory.
          He enjoys long walks down steep gradients, and shorter walks down gentler ones.
          He’s also looking for a job!
          You can find Ryan online at <a href="https://rmarcus.info">https://rmarcus.info</a>.
      title: |
        <font color ="navy">MiDAS Seminar/CS591A1 Guest Lecture <b>(Online@12:30 PM)</b></font>
        <br/>
        Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers
      abstract: |
        Query optimization is one of the most well-studied problems in database management systems.
        Because of the exponential problem space and the heuristic nature of current solutions, researchers have recently applied machine learning techniques to this challenging problem domain.
        Various approaches range from using machine learning to perform cardinality estimation to outright replacing the query optimizer with deep reinforcement learning.
        Each of these approaches comes with their own unique sets of advantages and constraints.
        This talk will lay out these recent advancements in a skeptical light (my own work included), highlighting both potential advantages and the myriad of challenges that still need to be overcome to achieve a fully autonomous query optimizer.

    - when: 04/10/20
      presenter:
        url: https://jialinding.github.io/
        name: Jialin Ding
        affiliation: MIT
        bio: |
          Jialin Ding is a second-year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska.
          His research focuses broadly on the application of machine learning to data systems.
          He also collaborates with Umar Farooq Minhas and the Database Group at Microsoft Research on learned data structures.
          Prior to MIT, Jialin was an undergraduate at Stanford University, where he worked on data-intensive systems with Prof. Peter Bailis at part of Stanford DAWN.
      title: |
          Learning Multi-dimensional Indexes
      abstract: |
        Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines.
        To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering).
        However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries.<br/>
        In this talk, I will present Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout.
        Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads.
        Our work serves as a building block towards an end-to-end learned database system. Our paper on Flood will appear at SIGMOD 2020.<br/>
        I will also talk about our continuing work on extending the ideas of Flood to address real-world challenges of indexing multi-dimensional data such as data correlation, non-uniform queries, and categorical attributes.

    - when: 04/17/20
      presenter:
        url: https://cs-people.bu.edu/ndhuynh/
        name: Andy Huynh
        affiliation: BU
        bio: |
          Andy is a third-year PhD student in the MiDaS group at Boston University where he is advised by Professor Manos Athanassoulis.
          His research focuses on novel design concepts in hybrid analytical/transaction processing systems.
          He is currently working on an open source column-storage called OakDB, a DBMS that focuses on auto tuning with input uncertainty.
          Prior to Boston University Andy attended the University of Minnesota, Twin Cities earning a Bachelors in Computer Engineering.
      title: |
          Tuning Data Systems Under Uncertainty in Workload
      abstract: |
        Modern hybrid analytical/transactional processing (HTAP) systems are generally tuned with assumptions on the expected workload, however, often times workload knowledge may be inaccurate.
        Deploying systems tuned on inaccurate knowledge can result in unexpected latency increases of up to 2X from the optimal.
        Additionally expected workloads may "drift" overtime, effectively acting as uncertainty in the workload leading to frequent and unnecessary retuning.
        This work focuses on exploring a new robust design paradigm for column stores, a popular layout for modern HTAP storage engines.
        We explore how much these uncertainties affect performance by benchmarking current systems and how well they handle workload drift.
        We then introduce some intuitive notions of workload drift and explore different heuristic strategies to combat drift.
        Additionally we demonstrate these strategies in OakDB, an open source column store engine to demonstrate robust tuning.

    - when: 04/24/20
      presenter:
        url: https://chatterjeesubarna.github.io/
        name: Subarna Chatterjee
        affiliation: Harvard
        bio: |
          Subarna Chatterjee is a post-doc at the Data Systems Lab (DASLab) at Harvard University since January 2019.
          Prior to this, she was also a post-doc with the Myriads team at Inria, Rennes in France.
          She completed her Ph.D. from Indian Institute of Technology Kharagpur, India from 2013-2017.
          Her current research interests are NoSQL storage engines and reasoning about their cost and performance on cloud.
      title: |
          Cosine: A Cloud-Cost Optimized NoSQL Storage Engine
      abstract: |
        We present a key-value storage engine, Cosine, that guarantees an optimal cost- performance trade-off, given a workload and a budget.
        Cosine creates a massive search space comprising of the entire data structure design and hardware space of (LSM-tree/B-tree) key-value stores over diverse cloud pricing policies for the top three cloud providers – AWS, GCP, and Azure.
        In order to prune configurations from this massive search space, we present distribution-aware cost models that precisely estimate I/O costs of each possible data structure design, which in turn, helps in comparing pairs of designs.
        By using the cost models, Cosine reduces the massive dimensionality of the search space into a continuum of holistically optimal configurations.
        Cosine also enables decision makers in applications to quickly answer rich what- if questions about the changes in workload performance and cost as any of the design, hardware, or cloud provider change.

    - when: 05/01/20
      presenter:
        url: https://cs-people.bu.edu/papon
        name: Tarikul Islam Papon
        affiliation: BU
        bio: |
          Papon is a first-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis.
          His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices.
          Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support.<br/>
          Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET).
          He completed his MSc and BSc degree from the same department where he worked on medical informatics using various machine learning and embedded system techniques.
      title: |
          The Case for A Parameterized I/O Model
      abstract: |
        The traditional I/O model (EM Model) was developed based on the properties of HDDs, and has not evolved accordingly while it is still widely used.
        The EM Model assumes that read and write accesses have the same cost (symmetric cost) whereas most modern flash devices and emerging non-volatile memory (NVM) technologies show asymmetric cost - writes are significantly slower compared to reads.
        In addition, these devices provide some degree of access concurrency which cannot be modeled with the EM Model because it considers only one I/O at a time.
        The EM Model cannot capture these key properties of modern memory technologies, thus unable to exploit the full benefits of contemporary devices.<br/>
        We propose a new Parameterized I/O Model PIO(M, k, α) that can interact with present-day storage devices that considers read/write asymmetry (α) as well as the variable access concurrency (k) that different devices may support.
        In addition, we aim to propose a new database buffer pool page eviction algorithm for flash devices and compare its performance with the traditional algorithms with respect to our proposed PIO(M, k, α) model.

    - when: 05/08/20
      presenter:
        url: https://sites.google.com/view/juhyoungmun/
        name: Ju Hyoung Mun
        affiliation: BU
        bio: |
          Ju Hyoung Mun is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis.
          She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim.
          Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Named Data Networking (NDN).
          Her research interest lies in the fields of Data Systems and Networking; particularly, in enhancing the lookup performance.
      title: |
          FIB Sharing and Cache Sharing using Bloom filters in NDN
      abstract: |
        The major portion of the Internet bandwidth is now being used for content distribution like  video  streaming.
        The  conventional  Internet  infrastructure,  which  is  based  on  a host-to-host communication model, is not efficient for content distribution.
        In order to accommodate  today's  communication  needs,  Named  Data  Networking  (NDN)  has been  introduced  as  a  future  Internet  architecture.
        NDN  technology  has  two  main characteristics: name-based content retrieval and in-network caching.
        There are many research  challenges  for  the  realization  of  NDN.
        This  talk  presents  two  methods  of information sharing using Bloom filters in NDN: (1) Content Store sharing, which shares cache summaries with neighboring routers to increase the diversity of cached contents in  a  network  as  a  whole,  and  (2)  FIB  table  sharing,  which  reduces  the  traffic  and  the size of the FIB tables.

    - when: 05/15/20
      presenter:
        url: https://subhadeepsarkar.bitbucket.io
        name: Subhadeep Sarkar
        affiliation: BU
        bio: |
          Subhadeep Sarkar is a post-doctoral researcher at Boston University working with Prof. Manos Athanassoulis.
          Prior to this, Subhadeep spent nearly two years as a post-doctoral researcher at INRIA Rennes, France.
          He completed his Ph.D. in 2017 from Indian Institute of Technology Kharagpur.
          Subhadeep's current research interests are NoSQL key-value stores, data access methods, and data privacy.
      title: |
          Lethe: A Tunable Delete-Aware LSM Engine
      abstract: |
        Data-intensive applications fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference.
        These benefits, however, come at the cost of treating deletes as a second-class citizen.
        A delete inserts a tombstone that invalidates older instances of the deleted key.
        State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion.
        Further, these LSM-engines only support deletion on the sort key.
        To delete on another attribute (e.g., timestamp), the entire tree is read and re-written.
        We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource.<br/>
        In this talk, I will present Lethe, an LSM-based key-value storage that introduces a set of delete-aware compaction policies and a new physical data storage layout, which together, allows for efficient and timely persistent deletion of data.
        Lethe is capable of supporting any user-defined threshold for the delete persistence latency offering higher read throughput and lower space amplification, with a modest increase in write amplification.
        In addition, by physically storing data in an interweaved order on the sort and the delete key, Lethe supports efficient range deletes on a secondary (delete) key without employing a costly full tree merge.
        Our paper on Lethe is set to appear at SIGMOD 2020.

    - when: 05/22/20
      presenter:
        url: https://cs-people.bu.edu/zczhu/
        name: Zichen Zhu
        affiliation: BU
        bio: |
          Zichen Zhu is a first-year PhD student at Boston University advised by Prof. Manos Athanassoulis.
          Prior to this, he received his MPhil degree in 2019 from Hong Kong University.
          Zichen's current research interests are data access methods and LSM-based NoSQL key-value stores.
      title: |
          A Fast Workload-Driven Blocking Approach with Sub-optimal Skipping Effectiveness
      abstract: |
        With rapidly growing data volume, modern query engines are required to process enormous amount of data timely.
        In order to speed up the data access, an existing data skipping-based technique is proposed as follows.
        For each block of tuples, a bit vector (also noted as feature vector) is constructed where each dimension represents whether there exists a tuple qualified for some specific filters in this block.
        Therefore, the effectiveness of data skipping essentially depends on how we partition data into blocks.
        Existing skipping-oriented partitioning techniques assume the dimension of feature vectors is given and the feature selection is modeled as a frequent item mining problem with the dimension as the input.
        We argue that there may exist a more intelligent way to determine the parameters.
        In this work, a greedy partitioning technique is introduced with sub-optimal skipping performance without specifying the dimension.
        Further experiments also demonstrate the effectiveness of our algorithm in a synthetic workload and dataset.

- name: Fall 2019
  when: 09/01/19
  where: |
    Seminars are held in the old MCS B33 at 11.00 am [unless otherwise notified], followed by lunch.
  responsible:
    name: Konstantinos Sotiropoulos
    username: ksotirop
  talks:
    - when: 09/03/19
      presenter:
        name: Tianyi Chen and Konstantinos Sotiropoulos
      title: |
        Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries and TwitterMancer: Predicting User Interactions on Twitter
      abstract: |
        <b>Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries.</b>
        In the densest subgraph problem, given an undirected graph \( G(V;E;w) \) with non-negative edge weights we are asked to find a subset of nodes \( S \) from \( V \) that maximizes the degree density \( w(S)=|S| \).
        This problem is solvable in polynomial time, and in general is well studied.
        But what happens when we are asked questions such as "how can we find a dense subgraph in Twitter with lots of replies and mentions but no follows?", "how do we extract a dense subgraph with high expected reward and low risk from an uncertain graph"?
        In this work, specifically, we formulate both problems mathematically as special instances of dense subgraph discovery in graphs with negative weights.
        We prove that the problem in general is NP-hard, but we also provide sufficient conditions under which it is poly-time solvable.
        We design an efficient approximation algorithm that works best in the presence of small negative weights, and an effective heuristic for the more general case.
        Finally, we perform experiments on various real-world datasets that verify the value of the proposed primitives, and the effectiveness of our proposed algorithms.
        <br>
        <b>TwitterMancer: Predicting User Interactions on Twitter.</b>
        This work investigates the interplay between different types of user interactions on Twitter, with respect to predicting missing or unseen interactions.
        For example, given a set of retweet interactions between Twitter users, how accurately can we predict reply interactions?
        Is it more difficult to predict retweet or quote interactions between a pair of accounts?
        Also, which features of interaction patterns are most important to enable accurate prediction of specific Twitter interactions?
        Our empirical study of Twitter interactions contributes initial answers to these questions.
        We have crawled an extensive dataset of Twitter accounts and their follow, quote, retweet, reply interactions over a period of a month.
        Using a machine learning framework, we find we can accurately predict many interactions of Twitter users.
        Interestingly, the most predictive features vary with the user profiles, and are not the same across all users.
        For example, for a pair of users that interact with a large number of other Twitter users, we find that certain "higher-dimensional" triads, i.e., triads that involve multiple types of interactions, are very informative, whereas for less active Twitter users, certain in-degrees and out-degrees play a major role.
        Finally, we provide various other insights on Twitter user behavior.

    - when: 09/20/19
      presenter:
        name: Charalampos (Babis) Tsourakakis
      title: Optimal learning of joint alignments
      abstract: |
        We consider the following problem, which is useful in applications such as joint image and shape alignment.
        The goal is to recover \( n \) discrete variables \( g_i \in \{0, \ldots, k-1\} \) (up to some global offset) given noisy observations of a set of their pairwise differences \( \{g_i - g_j \bmod k\} \); specifically, with probability \( \frac{1}{k}+\delta \) for some \( \delta > 0 \) one obtains the correct answer, and with the remaining probability one obtains a uniformly random incorrect answer.
        We consider a learning-based formulation where one can perform a query to observe a pairwise difference, and the goal is to perform as few queries as possible while obtaining the exact joint alignment.
        We provide an easy-to-implement, time efficient algorithm that performs \( O(n \lg n/\delta^2 k) \) queries, and recovers the joint alignment with high probability.
        We also show that our algorithm is optimal.
        This work improves significantly work of Chen and Candes, who view the problem as a constrained principal components analysis problem that can be solved using the power method.
        Our approach is simpler both in the algorithm and the analysis, and provides additional insights into the problem structure.
        Finally, experimentally our algorithm performs well compared to the algorithm of Chen and Candes both in terms of accuracy and running time.

    - when: 09/27/19
      presenter:
        name: Loqman Salamatian
        bio: |
          Loqman is a Master student in mathematics at Sorbonne University.
          His main interests lie at the crossroad between Complex Networks, Differential Geometry and Optimal Transport.
          He is currently working at CAIDA on understanding the topology of the Internet and its relationship with geography.
      title: Monitoring the Internet through Riemannian Geometry
      abstract: |
        In this work, we tackle the problem of real-time monitoring of dynamic variations of global Internet using topology changes information provided by combining several BGP feeds.
        We develop a geometric approach that leverages notions of geometric curvature and recent development in graph embedding using Ollivier-Ricci curvature.
        In particular, we use our method to detect major events and changes via the geometry of the embedding of the graph, where a major event is defined as an incident that either impacts a substantial number of prefixes or can be observed by numerous route monitors within a given period of time.
        We first describe the Ricci curvature approach to characterize the AS level graph.
        The key idea is to detect changes in between consecutive snapshots and to separate events that result in considerable geometric alterations, from those that remain local.
        The variations of curvature are evaluated through a set of landmarks as reference points.
        We then introduce an anomaly tracking mechanism to detect large variations of curvature that translate into major variations in the geometry of the graph.
        These changes are then considered as major BGP-level events.
        We describe a mechanism for identifying the network elements responsible for the set of coordinated changes and isolate their implications in the geometric space.
        We finally evaluate this system in operational setting.

    - when: 10/18/19
      presenter:
        name: Kiran Garimella
        bio: |
          Kiran Garimella is the Michael Hammer postdoc at the MIT Institute for Data, Systems, and Society (IDSS).
          Before joining MIT, he was a postdoc at EPFL, Switzerland.
          His research focuses on using digital data for social good, including areas like polarization, misinformation and human migration.
          His work on studying and mitigating polarization on social media won the best student paper awards at WSDM 2017 and WebScience 2017.
          Kiran received his PhD at Aalto University, Finland, and Masters & Bachelors from IIIT Hyderabad, India.
          Prior to his PhD, he worked as a Research Engineer at Yahoo Research, Barcelona, and QCRI, Doha.
      title: A first look at WhatsApp data — Problems, and Challenges
      abstract: |
        In this work, I will talk about our on going efforts to collect and analyze large amounts of public Whatsapp data from political groups in Brazil, India, and Indonesia.
        I will discuss our efforts in monitoring a large number of whatsapp groups and provide insights on various problems arising from such data.
        I will specifically give examples of novel ways in which misinformation, hate speech and coordination are being spread through whatsapp.
        I will briefly discuss some of the technical challenges that arise from data, specifically related to identifying image based misinformation.
        Finally, I will mention our efforts on collecting data across platforms for the Indian election and why such a cross platform analysis is required to understand the spread of issues like misinformation.

    - when: 10/25/19
      presenter:
        name: Matteo Riondato
        bio: |
          Matteo Riondato is an assistant professor of computer science at Amherst College, and a visiting faculty at Brown University.
          Previously he was a research scientist at Two Sigma, and a postdoc at Stanford and Brown.
          He obtained his PhD in computer science from Brown.
          His research focuses on algorithms for knowledge discovery, data mining, and machine learning: he develops methods to analyze rich datasets, including graphs and time series, as fast as possible and in a statistically sound way.
          His works received best-of-conference awards at the 2014 SIAM International Conference on Data Mining (SDM), the 2016 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), and the 2018 IEEE International Conference on Data Mining (ICDM).
          He tweets @teorionda and lives at <a href="http://matteo.rionda.to">http://matteo.rionda.to</a>.
      title: |
        SPuManTe: Significant Pattern Mining with Unconditional Testing
      abstract: |
        We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset.
        SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold.
        A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test.
        Computational requirements have limited the use of unconditional tests in significant pattern discovery, but ut overcomes this issue by obtaining the required probabilities in a novel efficient way.
        SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, grounded in statistical learning theory.
        Joint work with Leonardo Pellegrina (UniPD) and Fabio Vandin (UniPD), presented at KDD'19.

    - when: 11/01/19
      presenter:
        name: Leonardo Torres
        bio: |
          Leo is a 4th year PhD student at the Network Science Institute at Northeastern University, under the advisement of Tina Eliassi-Rad.
          He is broadly interested in the intersection of Network Science, Machine Learning, and Mathematics, focusing on metric aspects of complex networks and graph mining.
          He specializes in introducing novel mathematical and computational tools to the network science toolbox, usually coming from differential geometry, algebraic topology, spectral linear algebra, and unsupervised techniques.
          Leo has a B.S. in Mathematics from Pontificia Universidad Católica del Perú.
          He has taught mathematics and Spanish at the undergraduate level and he was a Research Programmer at Wolfram Research South America, working on the Wolfram|Alpha knowledge engine.
      title: |
        Non-Backtracking Cycles: Length Spectrum Theory and Graph Mining Applications
      abstract: |
        Two basic tasks in graph analysis are: (1) computing the distance between two graphs and (2) embedding of the graph elements (i.e., nodes or links) into a lower-dimensional space.
        The former task has numerous applications from k-nearest neighbor search, to clustering a collection of graphs, to transfer learning.
        Unfortunately, there exists no canonical way of computing the distance between two finite graphs because the dissimilarity problem is ill-defined.
        Despite this fact, the relevant literature contains many methods for measuring graph distance with different heuristics, computational efficiency, interpretability, and theoretical soundness.
        We introduce a graph distance, called the Non-Backtracking Spectral Distance (NBD), which is theoretically sound and interpretable.
        NBD is based on a mathematical construct from algebraic topology (in particular, homotopy theory) called the length spectrum.
        The length spectrum of a graph is a function defined on the graph's first homotopy group (a.k.a. the fundamental group); and it uniquely characterizes a graph's 2-core up to isometry.
        Thus, if two graphs have the same length spectra, then their 2-cores are isometric.
        We show the relationship between a graph's length spectrum and its non-backtracking cycles; and present a method based on computing the eigenvalues of a graph's non-backtracking matrix.
        For the second task (i.e., graph embedding), most existing methods are stochastic and depend on black-box models such as deep networks.
        Both of these characteristics make their output difficult to analyze.
        We propose the Non-Backtracking Embedding Dimensions (NBED) for finding a graph embedding in low-dimensional space by computing the eigenvectors of the non-backtracking matrix.
        Both NBD and NBED are interpretable in terms of features of complex networks such as hubs, triangles, edge centrality, and communities.
        We showcase the usefulness of both NBD and NBED in experiments on synthetic and real-world networks.

    - when: 11/08/19
      presenter:
        name: Shreyas Sekar
        bio: |
          Shreyas Sekar is a Postdoctoral Fellow at the Laboratory for Innovation Science at the Harvard Business School.
          Shreyas received his Ph.D. in Computer Science from Rensselaer Polytechnic Institute in 2017, where he was awarded the Robert McNaughton Prize for the best dissertation in CS.
          His dissertation was on Non-Discriminative Algorithmic Pricing.
          Broadly speaking, his research is centered around online decision making and learning in dynamic and strategic environment, with applications to e-commerce, transportation, and the digital marketplaces.
          Currently, he is excited by questions pertaining to information design---i.e., how can digital firms leverage information as a tool for revenue management.
          Shreyas has also collaborated with industry partners such as Wayfair and Freelancer to help develop prescriptive strategies for optimizing core metrics.
      title: Learning to Rank an Assortment of Products
      abstract: |
        We consider the product ranking challenge that online retailers face when their customers typically behave as "window shoppers": they form an impression of the assortment after browsing products ranked in the initial positions and then decide whether to continue browsing.
        Further, customers' product preferences and attention spans are correlated and unknown to the retailer.
        We develop a class of online explore-then-exploit algorithms that prescribe a ranking to offer each customer, learning from preceding customers' clickstream data to offer better rankings to subsequent customers.
        Our algorithms balance product popularity with diversity: the notion of appealing to a large variety of heterogeneous customers.
        We prove that our learning algorithms rapidly converge to a ranking that matches the best-known approximation factors for the offline, complete information setting.
        Finally, we partner with Wayfair - a multi-billion dollar home goods online retailer - to estimate the impact of our algorithm in practice via simulations using actual clickstream data, and we find that our algorithm yields a significant increase (5-30%) in the number of customers that engage with the site.

    - when: 11/18/19
      presenter:
        name: Stella Pantela
        bio: |
          Stella is a senior distributed systems software engineer at Vertica, chair of the Vertica DataGals and a member of the Vertica patent committee.
          At Vertica, Stella builds powerful infrastructure used by big data analytics companies including Uber, Etsy, Wayfair, Philps.
          Her focus is on improving the distributed systems protocols for more efficient maintenance and transfer of metadata in both the cloud and enterprise environments.
          Stella's work has been featured at SIGMOD and North East Database Day.
          Before Vertica, Stella completed her B.A. with Honors in Computer Science at Harvard University in 2015, where she also conducted research in the area of column stores with Professor Stratos Idreos.
          In her free time she maintains a blog on distributed systems and technology at stylianipantela.com.
      special: |
        Guest lecture for CS460: Introduction to Databases [4:30-5:45pm, EPC 207]

    - when: 12/06/19
      presenter:
        name: Brian Hentschel
        bio: |
          Brian is a PhD student at Harvard University in the Data Systems Laboratory.
          His work focuses on the usage of statistics and machine learning in making better data structures and algorithms, and conversely in the creation of data management solutions for better machine learning.
          Previous work of his has won the ACM SIGMOD Research Highlight Award as well as the Best Paper Award at EDBT.
      title: Temporally Biased Sampling & Online Model Management
      abstract: |
        To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function".
        We then periodically retrain the models on the current sample.
        Time-biasing lets the models adapt to recent changes in the data while - unlike in a sliding-window approach - still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values.
        In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change.
        We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size.
        The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes.
        We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.