-
Notifications
You must be signed in to change notification settings - Fork 2
/
seminars.yml
1796 lines (1735 loc) · 157 KB
/
seminars.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# top-level array contains semesters and in semester there is an array of talks
# - name: name of the semester
# when: any date within semester (needed for correct ordering); date in format MM/DD/YY, e.g 02/15/19
# where: text saying how to attend the talks (e.g. typically held at [SOME OFFICE] at [SOME TIME])
# responsible:
# name: full name of person responsible for seminars in the semester
# username: BU ID (username) of the responsible person; used for BU email (e.g. [email protected])
# talks:
# - when: date of the talk; date in format MM/DD/YY, e.g 02/15/19
# presenter: entire presenter block is OPTIONAL
# url: OPTIONAL. URL to presenter's homepage.
# name: presenter's full name
# affiliation: OPTIONAL. Presenter's affiliation.
# bio: OPTIONAL. Possibly big text piece, presenter's bio.
# title: OPTIONAL. Talk's title. (Ignored if "departmental" or "special")
# abstract: Possibly big text piece, talk's abstract. LaTeX snippets supported.
# departmental: OPTIONAL. If set, title and abstract are ignored and will be rendered "CS Department Seminar".
# special: OPTIONAL. If set, this text is rendered (HTML markup, like bold / italic / anchor allowed) as is in navy blue.
- name: Fall 2024 (Data Systems Seminar)
when: 09/09/24
where: |
Seminars are held on Monday in CDS 950 at 12:00pm-1:30pm [unless otherwise noted].
responsible:
name: Teona Bagashvili
username: teona
talks:
- when: 09/04/24
presenter:
url: https://kylebd99.github.io/kyle.deeds.github.io/
name: Kyle Deeds
affiliation: University of Washington
bio: |
Kyle Deeds is currently a 5th year PhD student in the database lab at the University of Washington. He is advised by Dan Suciu and Magda Balazinska and interested in the intersection of databases, compilers, and HPC. Broadly, his work aims to make high performance computing accessible to domain experts through declarative programming coupled with automatic program optimization.
title: |
Galley: Modern Query Optimization for Declarative Sparse Tensor Programming <font color ="navy"><b>(12:00pm @ CDS 801)</b></font>
abstract: |
Modern computation has become dominated by the tensor programming abstraction. This framework allows users to write high performance programs for bulk computation via a high-level imperative interface. Recent work has extended this paradigm to sparse tensors (i.e. tensors where most entries are not explicitly represented) with the use of sparse tensor compilers. These systems excel at producing efficient code for computation over sparse tensors, which may be stored in a wide variety of formats. However, they require the user to manually choose the order of operations and the data formats at every step. Unfortunately, these decisions are both highly impactful and complicated, requiring significant effort to manually optimize. In this talk, I’ll present Galley, a system for declarative sparse tensor programming. Galley adapts the database community’s expertise in cost-based optimization to sparse tensor programs. First, it synthesizes a logical plan which determines high level structure and materialization strategy. Then, it lowers each piece of the logical plan to a physical implementation by determining loop orders, tensor formats, and merge algorithms. This plan is then efficiently executed by a sparse tensor compiler. Lastly, I’ll present early experimental results which show how Galley can produce up to 100x speedups over hand-optimized implementations for modestly complex tensor programs.
- when: 09/09/24
presenter:
url:
name: Anwesha Saha/Teona Bagashvili
affiliation: Boston University
bio:
title: |
First meeting of the semester and summer school updates
abstract:
- when: 09/16/24
presenter:
url: https://szeighami.github.io/
name: Sepanta Zeighami
affiliation: UC Berkeley
bio: |
Sepanta Zeighami is a postdoctoral scholar at UC Berkeley. Before that, he obtained his PhD from the University of Southern California. His research is focused on using -- and theoretically understanding -- AI methods for improving data systems, where he has extensively published in top-tier conferences in both AI and database communities, including in SIGMOD, VLDB, ICML, ICLR, etc. His research has been recognized by oral presentation and best paper awards.
title: |
Towards Practical Learned Database Operations with Theoretical Guarantees <font color ="navy"><b>(online @ CDS 801)</b></font>
abstract: |
Machine learning has shown significant practical benefits for performing various database operations such as indexing, cardinality estimation, and sorting. However, the theoretical characteristics of such learned database operations have not been well understood, limiting their applicability in real-world systems. In this talk, I will present our recent work on characterizing the performance benefits of learned database operations. I will show that, in the case of indexing, in static settings or when there is no distribution shift, learned methods can answer queries in O(loglog n) expected time, but their performance can drop to O(log n) in the presence of distribution shift. Using insights from this result, I will then discuss our broader analysis framework, distribution learnability, through which we show various novel theoretical results for different learned database operations. Our results provide a better understanding of why and when learned methods outperform non-learn methods (and when they don’t) and take a first step towards practical learned database operations with theoretical guarantees. I will conclude the talk by presenting various open theoretical and practical problems to achieve that goal.
- when: 09/30/24
presenter:
url:
name: Yuanli Wang/Lei Huang
affiliation: Boston University
bio:
title: |
CAPSys: Contention-aware task placement for data stream processing
abstract:
- when: 10/07/24
presenter:
url: https://sites.bu.edu/casp/people/naima-abrar-shami/
name: Naima Abrar
affiliation: Boston University
bio:
title: |
PlatoGL: Effective and Scalable Deep Graph Learning System for Graph-enhanced Real-Time Recommendation
abstract:
- when: 10/15/24
presenter:
url:
name: Muhammad Faisal
affiliation: Boston University
bio:
title: |
Towards a scalable, generic, configurable Multi-Party Computation engine
- when: 10/21/24
presenter:
url: https://aneeshraman.net/
name: Aneesh Raman
affiliation: Boston University
bio:
title: |
Reducing the real storage overhead of learned indexes with wavelet trees
abstract:
- when: 10/28/24
presenter:
url: http://andrew.nerdnetworks.org/
name: Andrew Lamb
affiliation: InfluxDB
bio: Andrew currently works in Rust on InfluxDB 3.0, focused on query processing, the Apache DataFusion query engine and the Apache Arrow ecosystem. He serves on the Apache DataFusion PMC (2024 Chair), and on the Apache Arrow PMC (2023 Chair), and actively contributes to the Apache Arrow DataFusion query engine and the Apache Arrow Rust implementation. He earned a BS and MEng in Course VI from MIT. More details are available at http://andrew.nerdnetworks.org/
title: |
Apache DataFusion: Design Choices when Building Modern Analytic Systems
video_url: https://www.youtube.com/watch?v=CpnxuBwHbUc
slides: andrew_lamb_slides.pdf
abstract: |
Andrew will discuss engineering choices made when building Apache DataFusion, an open source and extensible query engine used as the basis of many commercial and open source projects. He will cover rationale for decisions such as using Rust and building on standards such as Arrow and Parquet. He will also discuss DataFusion’s design philosophy of extensible APIs paired with simple default implementations. Finally, he will offer lessons learned and enumerate some things that worked well and what could have been improved.
- when: 10/31/24
presenter:
url: https://itshelenxu.github.io/
name: Helen Xu
affiliation: Georgia Institute of Technology
bio: Helen Xu is an assistant professor at Georgia Tech in the School of Computational Science and Engineering. Previously, she was the Grace Hopper Postdoctoral Scholar. At Lawrence Berkeley National Laboratory. She completed her Ph.D. at MIT with Professor Charles E. Leiserson. Her main research interests are in parallel and cache-friendly algorithms and data structures.
title: |
Recent Advances on In-Memory Cache-Friendly Data Structures <font color ="navy"><b>(10am @ CDS 701)</b></font>
video_url: https://www.youtube.com/watch?v=cdq4ebPYUTQ
abstract: |
This talk will discuss optimizing different types of parallel in-memory indexes for improved spatial locality. Specifically, it will overview 1) BP-tree [VLDB 23], a cache-optimized concurrent in-memory index that improves range query performance without giving up on point queries, and 2) CPMA [PPoPP 24], an in-memory batch-parallel data structure built on the Packed Memory Array (PMA) that overcomes the point-range tradeoff.
- when: 11/04/24
presenter:
url:
name: Anwesha Saha
affiliation: Boston University
bio:
title: |
Bayesian Optimization for LSM Tuning
abstract:
- when: 11/19/24
presenter:
url:
name: George Neville-Neil
affiliation: Yale University
bio: George V. Neville-Neil, works on networking and operating system code for fun and profit. His areas of interest are computer security, operating systems, networking, time protocols, and the care and feeding of large code bases. He is the author of The Kollected Kode Vicious and co-author with Marshall Kirk McKusick and Robert N. M. Watson of The Design and Implementation of the FreeBSD Operating System. For nearly twenty years he has been the columnist better known as Kode Vicious. Since 2014 he has been an Industrial Visitor at the University of Cambridge where he is involved in several projects relating to computer security. He earned his bachelor’s degree in computer science at Northeastern University in Boston, Massachusetts, and is a member of ACM, the Usenix Association, and IEEE. His software not only runs on Earth but has been deployed, as part of VxWorks in NASA’s missions to Mars. He is an avid bicyclist and traveler who currently lives in New York City. He is currently a PhD student at Yale University working with Robert Soulé, Avi Silberschatz and Peter Alvaro
title: |
OSDB: Exposing the Operating System’s Inner Database <font color ="navy"><b>(12:30pm @ CAS211)</b></font>
abstract: |
Operating systems must provide functionality that closely resembles that of a data management system, but existing query mechanisms are ad-hoc and idiosyncratic. To address this problem, we argue for the adoption of a relational interface to the operating system kernel. While prior work has made similar proposals, our approach is unique in that it allows for incremental adoption over an existing, production-ready operating system. In this paper, we present progress on a prototype system called OSDB that embodies the incremental approach and discuss key aspects of the design, including the data model and concurrency control mechanisms. We present three example use cases: a network usage monitor, a load balancer, and file system checker, as well as experiments that demonstrate low overhead for our approach.
- when: 11/25/24
presenter:
url: https://aneeshraman.net/
name: Aneesh Raman
affiliation: Boston University
bio:
title: |
PhD prospectus dry run
abstract:
- when: 12/09/24
presenter:
url: https://cs-people.bu.edu/zczhu/
name: Zichen Zhu
affiliation: Boston University
bio:
title: |
Designing Instance-Optimal Data Systems in Modern Applications
abstract:
- name: Spring 2024
when: 01/18/24
where: |
Seminars are held on Thursdays in CDS 950 at 1:30pm [unless otherwise notified].
responsible:
name: Jinqi Lu
username: jinqilu
talks:
- when: 02/01/24
presenter:
url: https://kapilvaidya24.github.io/
name: Kapil Vaidya
affiliation: MIT
bio: |
Kapil Vaidya is an Applied Scientist at Amazon Redshift in the Learned Systems Group. He recently graduated from MIT, where he completed his PhD under the guidance of Professor Tim Kraska in the Data Systems Group. His research primarily revolves around the application of machine learning to data systems, with a special focus on the data structures and algorithms integral to these systems. Before that Kapil completed his Bachelors from IIT Bombay.
title: |
Sparse Numerical Array-Based Range Filters (SNARF) <font color ="navy"><b>(12:00pm @ CDS 1101 also part of BU Systems Seminar)</b></font>
abstract: |
We present Sparse Numerical Array-Based Range Filters (SNARF),a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads.
- when: 02/15/24
presenter:
url: https://www.lujinqi.com/
name: Jinqi Lu
affiliation: Boston University
bio: |
Jinqi is currently a CS Master's student at BU DiSC Lab advised by Prof. Athanassoulis. He posses a wide range of interests including Database Systems, Cloud, and Networks. Currently, he is engaged in the relational memory project and also contributes to the testing of computational and zoned storage devices.
title: |
Reading Group: Selection Pushdown in Column Stores using Bit Manipulation Instructions
abstract: |
Modern analytical database systems predominantly rely on column-oriented storage, which offers superior compression efficiency due to the nature of the columnar layout. This compression, however, creates challenges in decoding speed during query processing. Previous research has explored predicate pushdown on encoded values to avoid decoding, but these techniques are restricted to specific encoding schemes and predicates, limiting their practical use. In this paper, we propose a generic predicate pushdown approach that supports arbitrary predicates by leveraging selection pushdown to reduce decoding costs. At the core of our approach is a fast select operator capable of directly extracting selected encoded values without decoding, by using Bit Manipulation Instructions, an instruction set extension to the X86 architecture. We empirically evaluate the proposed techniques in the context of Apache Parquet using both micro-benchmarks and the TPC-H benchmark, and show that our techniques improve the query performance of Parquet by up to one order of magnitude with representative scan queries. Further experimentation using Apache Spark demonstrates speed improvements of up to 5.5X even for end-to-end queries involving complex joins.
- when: 02/21/24
presenter:
url: https://homes.cs.washington.edu/~suciu/
name: Dan Suciu
affiliation: University of Washington
bio: |
Dan Suciu is a Microsoft Endowed Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. Suciu is conducting research in data management, on topics such as query optimization, probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He received the ACM SIGMOD Codd Innovation Award, received several best paper awards and test of time awards, and is a Fellow of the ACM. Suciu is currently an associate editor for the Journal of the ACM. Suciu’s PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.
title: |
Cardinality Estimation: Achilles’ Heel of Query Optimizers <font color ="navy"><b>(11:00am @ CDS 1750 also part of the CS Distinguished Colloquium)</b></font>
abstract: |
Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The estimator has access to some limited statistics on the database, such as table cardinalities, number of distinct values in various columns, sometimes histograms, and needs to estimate the output size of complex queries, often involving many joins and filter predicates. The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan. In this talk I will briefly survey existing cardinality estimation methods, discussing their pros and cons, then I will introduce a new approach to the problem, which is based on computing an upper bound instead of an estimate. The upper bound is given by a mathematical formula, and can use the available database statistics in surprising ways.
- when: 02/22/24
presenter:
url: https://cs-people.bu.edu/teona/
name: Teona Bagashvili
affiliation: Boston University
title: |
Reading Group: ZNS SSDs
abstract: |
TBA
- when: 02/29/24
presenter:
url: https://sites.google.com/view/juhyoungmun/
name: Ju Hyoung Mun
affiliation: Boston University
title: |
Reading Group: Programmable DDRx Controllers
abstract: |
TBA
- when: 03/07/24
presenter:
url: https://cs-people.bu.edu/karatse/
name: Konstantinos Karatsenidis
affiliation: Boston University
title: |
Reading Group: LRU-C: Parallelizing Database I/Os for Flash SSDs
abstract: |
TBA
- when: 03/14/24
presenter:
url: https://gatterbauer.name/
name: Wolfgang Gatterbauer
affiliation: Northeastern University
bio: |
Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first. He received an NSF Career award and, with his students and collaborators, a best paper award at EDBT 2021, best-of-conference selections for PODS 2021, SIGMOD 2017, WALCOM 2017, and VLDB 2015, and two out of three reproducibility awards for papers published at SIGMOD 2020.
<br/><br/><a href="https://gatterbauer.name">https://gatterbauer.name</a>
title : |
HITSnDIFFs: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property
abstract: |
We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the "truth"), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem "ability discovery" to emphasize the connection to and duality with the more well-studied problem of "truth discovery".
<br/><br/>To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call "HITSNDIFFS" (or HND), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), HND also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that HND produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm.
<br/><br/>Based on join work with Zixuan Chen, Subhodeep Mitra and R. Ravi
<br/><br/><a href="https://arxiv.org/abs/2401.00013" target=_blank>https://arxiv.org/abs/2401.00013</a>
<br/><a href="https://github.com/northeastern-datalab/HITSnDIFFs/" target=_blank>https://github.com/northeastern-datalab/HITSnDIFFs/</a>
<br/><a href="https://www.youtube.com/watch?v=2pN1Nx7-sd0" target=_blank>https://www.youtube.com/watch?v=2pN1Nx7-sd0</a>
- when: 03/18/24
presenter:
url: https://aamodkh.github.io/
name: Aamod Khatiwada
affiliation: Northeastern University
bio: |
Aamod Khatiwada is a PhD Candidate (Computer Science, Databases) at Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. He originally from Jhorahat, a town in the southeastern part of Nepal. His primary research focus is to develop novel techniques and algorithms that find and integrate open data tables in a principled way. During free times, he enjoy watching cricket and soccer, writing poems, stories, and news articles.
title: |
Table Discovery and Integration in Data Lakes <font color ="navy"><b>(5:00pm - 6:15pm @ HAR 306)</b></font>
abstract: |
Data lakes serve as repositories that house millions of tables covering a wide range of subjects. Data scientists use tables from data lakes to support decision-making processes, train machine learning models, perform statistical analysis, and more. However, the enormous size and heterogeneity of data lakes, together with missing or unreliable metadata, make it challenging for data scientists to find suitable tables for their analyses. Furthermore, after table discovery, another challenge is to integrate the discovered tables together to get a unified view of data, enabling queries that go beyond a single table. In this talk, I will present novel techniques for table discovery and integration tasks. First, I will overview SANTOS [1] which helps data scientists discover tables from a data lake that union with their query table and possibly extend it with more rows. Contrary to existing techniques that only use the semantics of columns to perform union search, we improve the search accuracy by also using semantic relationships between pairs of columns in a table. Second, I will cover ALITE (Align and Integrate) [2], the first proposal for scalable integration of tables that may have been discovered using join, union, or related table search. ALITE aligns the columns of discovered tables using holistic schema matching and then applies Full Disjunction, an associative version of outer join, to get an integrated table. I will also describe DIALITE [3], a novel table discovery pipeline that allows users to discover, integrate, and analyze open data tables. DIALITE is flexible such that the user can easily add and compare additional discovery and integration algorithms.
[1] Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald. SANTOS: Relationship-based Semantic Table Union Search. SIGMOD 2023
[2] Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller. Integrating Data Lake Tables. VLDB 2023.
[3] Aamod Khatiwada, Roee Shraga, Renée J. Miller. DIALITE: Discover, Align and Integrate Open Data Tables. SIGMOD-Companion 2023.
- when: 03/21/24
presenter:
url: https://cs-people.bu.edu/ndhuynh/
name: Andy Huynh
affiliation: Boston University
title: |
Reading Group: Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
- when: 04/04/24
presenter:
url: https://cs-people.bu.edu/zczhu/
name: Zichen Zhu
affiliation: Boston University
bio: |
Zichen Zhu is currently a 5th-year PhD candidate in MiDAS of Boston University, advised by Prof. Manos Athanassoulis. Prior to joining Boston University, he studied as an MPhil student in Database Group, department of Computer Science in the University of Hong Kong, advised by Prof. Reynold Cheng. His research interest focuses on designing workload-aware query optimization under memory constraint.
title: |
Dry Run for Zichen's Prospectus
- when: 04/11/24
presenter:
url: https://cs-people.bu.edu/weiran22/
name: Ran Wei
affiliation: Boston University
title: |
Reading Group: Improving LSM-Tree Based Key-Value Stores With Fine-Grained Compaction Mechanism
- when: 04/11/24
presenter:
url: https://cs-people.bu.edu/aarona/
name: Aaron Ang
affiliation: Boston University
title: |
Reading Group: Automatic SQL Error Mitigation in Oracle
- when: 04/18/24
presenter:
url: https://cs-people.bu.edu/papon/
name: Tarikul Islam Papon
affiliation: Boston University
title: |
Reading Group: WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs
- when: 05/02/24
presenter:
url: https://disc.bu.edu/
name: Tarikul Islam Papon / Zichen Zhu
affiliation: Boston University
title: |
Dry Run for SIGMOD
- when: 05/09/24
presenter:
url: https://ramananeesh.github.io/
name: Aneesh Raman
affiliation: Boston University
title: |
Reading Group: Learned Index with Precise Positions
- when: 05/16/24
presenter:
url: https://midas.bu.edu
name: Yanpeng Wei
affiliation: Boston University
title: |
Reading Group: Cabin: a Compressed Adaptive Binned Scan Index
- name: Fall 2023
when: 09/01/23
where: |
Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
responsible:
name: Zichen Zhu
username: zczhu
talks:
- when: 10/18/23
presenter:
url: https://midas.bu.edu/seminar.html
name: C. Mohan
affiliation: IBM Fellow
bio: |
Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (<a href="https://bit.ly/sigBcP">https://bit.ly/sigBcP</a>, <a href="https://bit.ly/CMoTalks">https://bit.ly/CMoTalks</a>). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.
title: |
Systems Design Dilemmas: Simplicity Versus Complexity <font color ="navy"><b>(CDS 17th floor, 11:00am)</b></font>
abstract: |
Based on my 4 decades long involvement in and exposure to numerous database systems, in this talk I will discuss how various factors led multiple systems to be evolved in ways that made them become more complex in terms of their design as well as their usability aspects. While simplicity of design and architecture might be a highly desirable objective, there will invariably be conflicting requirements that lead to complexity creeping in, at least with respect to the internals of such systems. Examples of such requirements relate to the cloud environment, performance, availability, heterogeneity in terms of supported hardware, operating systems and programming language environments, accommodation of multiple workload types, extensibility, etc. It is also the case that aiming for simplicity in one aspect of system design might cause complexity to be introduced in other areas. A case in point is the desire for the near elimination of the need for DBAs and tuning knobs in the cloud environment which makes the internals of such systems become much more complex so that they could dynamically detect and react to different workload types that such systems might be used for. While KISS (Keep It Simple, Stupid) might be a desirable design principle, following it mindlessly might result in the kiss of death with respect to performance and reliability!
- when: 10/18/23
presenter:
url: https://midas.bu.edu/seminar.html
name: C. Mohan
affiliation: IBM Fellow
bio: |
Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (<a href="https://bit.ly/sigBcP">https://bit.ly/sigBcP</a>, <a href="https://bit.ly/CMoTalks">https://bit.ly/CMoTalks</a>). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.
title: |
A Technical Retrospect: Q&A with Mohan <font color ="navy"><b>(CDS 364, 2:30pm)</b></font>
abstract: |
In this talk, using my own life as an example, I discuss the excitement of doing research and developing advanced technologies. Starting with my undergraduate days as a chemical engineering student at IIT Madras, I narrate how my academic life changed when I got introduced to an IBM mainframe which led to my ultimately doing a PhD in computer science at University of Texas at Austin. Along the way, I talk about the frustrations of the prolonged hunt for a PhD thesis topic. Then, I trace my professional career at IBM which included working on hot topics as well as what were initially thought by others to be passé topics which turned out to be not so! I deal with the difficulties of getting papers published and doing technology transfer to new products as well as old ones. Having never been a manager during my entire 38.5 years IBM career, even though as an IBM Fellow I had been an IBM executive for 23 years, I bring a unique perspective on what it takes to be on a long-term purely technical career path. I discuss the DOs and DONTs of choosing technical topics to work on, and on being successful in having impact internally and outside one’s employer. Based in general on my extensive travels across the world and in particular on my experience of working as the IBM India Chief Scientist, while based in Bangalore for almost 3 years, I also address issues that arise in being part of a multinational corporation and trying to be successful in doing deeply technical work in a developing country. I also talk about how I am keeping myself active technically and professionally since my retirement from IBM in June 2020.
- when: 11/06/23
presenter:
url: https://cs.uwaterloo.ca/~xiaohu/
name: Xiao Hu
affiliation: University of Waterloo
bio: |
Xiao Hu is an incoming Assistant Professor in the Cheriton School of Computer Science at the University of Waterloo. Prior to that, she was a Visiting Faculty Scholar in the Discrete Algorithm Group at Google Research from and a postdoctoral associate within the Department of Computer Science at Duke University. She received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems.
title: |
Incremental View Maintenance for Conjunctive Queries
abstract: |
Incremental view maintenance (IVM) is a critical problem in the field of database management systems. It addresses the challenge of efficiently updating materialized views in response to changes in the underlying data, reducing the computational cost and ensuring up-to-date query results. In this context, conjunctive queries, a subset of relational algebra, provide a rich framework for expressing complex queries of both theoretical and practical interest. In this talk, I will introduce a general IVM framework for conjunctive queries, such that views can be maintained efficiently while supporting enumerating full query results as well as delta query results with delay guarantee. I will describe the intrinsic relationship between the sequence of updates and the maintenance cost. At last, I will briefly discuss some interesting questions when aggregation is involved.
- when: 11/30/23
presenter:
url: https://www.cs.cit.tum.de/dis/team/prof-dr-viktor-leis/
name: Viktor Leis
affiliation: Technical University of Munich
bio: |
Viktor Leis is a Professor of Computer Science at the Technical University of Munich (TUM). His research revolves around designing high-performance data management systems and includes core database systems topics such as query processing, query optimization, transaction processing, index structures, and storage. Another major research area is designing cloud-native, cost-efficient information systems. Viktor Leis received his doctoral degree in 2016 from TUM and was a professor at the Friedrich-Schiller-Universität Jena and Friedrich-Alexander-Universität Erlangen-Nürnberg before returning to TUM in 2022. He received the ACM SIGMOD dissertation award, the IEEE TCDE Rising Star Award, best paper awards at IEEE ICDE and ACM SIGMOD, and an ERC Starting Grant.
title: |
LeanStore: A High-Performance Storage Engine for NVMe SSDs and Multi-Core CPUs <font color ="navy"><b>(CAS 313, 12:30pm)</b></font>
abstract: |
The goal of the LeanStore project is to achieve robust performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for larger data sets. To achieve these goals we had to redesign all components of the traditional DBMS stack, including buffer management, indexing, concurrency control, I/O management, logging, recovery, and page replacement. In this talk, I will give an overview of these components and the evolution of the system over the past 6 years.
- when: 12/05/23
presenter:
url: https://www.eecs.tufts.edu/~johes/
name: Johes Bater
affiliation: Tufts University
bio: |
Johes Bater is an assistant professor of Computer Science at Tufts University and co-director of the Tufts Security and Privacy Lab. Before that, he was a postdoctoral researcher in the Database Group at Duke University, received his Ph.D. in computer science from Northwestern University, and completed his B.S. and M.S. in electrical engineering at Stanford University. His research centers on how to balance privacy, security, and utility to build fast, accurate database systems that support privacy-preserving analytics with provable security guarantees.
title: |
Building Systems That Protect Users and Their Data <font color ="navy"><b>(CAS 313, 12:30pm)</b></font>
abstract: |
Simply put, data security and privacy affects every single person in the world. Data breaches, where organizations compromise sensitive user data, are frighteningly common. Current approaches however, fail to properly safeguard security and privacy. Existing encrypted databases introduce data pipelines that leak enough side channel information that data breaches are an inevitability, while cryptographically secure solutions are so slow and cumbersome that they no longer serve their intended purpose. Moreover, many business models - such as targeted advertising - are predicated on their continued ability to analyze and make decisions based on private user data. Without a principled approach to protecting user privacy, it is unclear if this setup is sustainable. To address this problem, we need systems that treat security and privacy as first-class citizens in their system design.
In this talk, I will discuss how to build secure and private database management systems that balance the trade-offs among privacy, performance, and query result accuracy. This work spans numerous disciplines of computer science, including, 1) databases: modeling query workloads to estimate and optimize their privacy properties, runtime, and result accuracy across multiple settings, 2) security: extending secure computation protocols to evaluate SQL statements over the private data of many data owners without revealing data in plaintext, and 3) privacy: rigorously modeling the end-to-end information leakage profile of a privacy-preserving DBMS. The goal of this work is to bring together these disparate research areas to create usable query processing engines with strong privacy and security guarantees.
- when: 12/11/23
special: |
Internal Project Updates
- name: Spring 2023
when: 01/31/23
where: |
Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
responsible:
name: Zichen Zhu
username: zczhu
talks:
- when: 02/10/23
presenter:
url: https://subhadeepsarkar.bitbucket.io/
name: Subhadeep Sakar
affiliation: Boston University
bio: |
Subhadeep Sarkar is a post-doctoral associate at Boston University.
His current research interests are data systems, storage layouts,
access methods, and their intersection with data privacy. Prior to
this, Subhadeep spent two years as a post-doctoral researcher at
INRIA Rennes, France. He completed his Ph.D. in 2017 from the IIT
Kharagpur. Subhadeep’s work appears at leading database and network
conferences and journals, including SIGMOD, VLDB, EDBT, ICDE, IEEE
Transactions on Cloud Computing, and IEEE Transactions on Computers.
In 2016, Subhadeep was selected as one of the most qualified
research scientists to attend the Heidelberg Laureate Forum.
title: |
Building Deletion-Compliant Data Systems
<font color ="navy"><b>(CDS 1001)</b></font>
abstract: |
Modern data systems are designed around the silent underlying
assumption of perpetual data retention. Systems are designed to offer
high performance for data ingestion and queries, while deletion of
data is realized by simply logically invalidating the target entries
(instead of modifying them in place). This has severe implications
on the privacy front, especially, in light of the newly enforced
privacy regulations which are aimed toward enabling the users to
express their preferences about the deletion of their data. In this
talk, we present our vision of enabling privacy-through-deletion as
a design element in modern data systems empowering them with the
ability to ensure timely and persistent deletion of data, and we
envision doing so without hurting performance or increasing
operational cost.
- when: 02/17/23
presenter:
url: https://sites.google.com/view/juhyoungmun/
name: Juhyoung Mun
affiliation: Boston University
bio: |
Juhyoung Mun is currently a postdoctoral associate in the
Department of Computer Science at Boston University, working with
Prof. Manos Athanassoulis. She received her Ph.D. in Electronic
and Electrical Engineering at Ewha w. University in Seoul, Korea.
Her research interest lies in the area of data systems and
networking, particularly, enhancing the lookup performance in
LSM-Trees and building data systems for complex hybrid
transactional/analytical workloads using hardware/software
co-design.
title: |
Relational Memory: Native In-Memory Accesses on Rows and Columns
abstract: |
Analytical database systems are typically designed to use a
column-first data layout to access only the desired fields. On the
other hand, storing data row-first works great for accessing,
inserting, or updating entire rows. Transforming rows to columns
at runtime is expensive, hence, many analytical systems ingest data
in row-first form and transform it in the background to columns to
facilitate future analytical queries. How will this design change
if we can always efficiently access only the desired set of columns?
To address this question, we present a radically new approach to
data transformation from rows to columns. We build upon recent
advancements in embedded platforms with re-programmable logic to
design native in-memory access on rows and columns. Our approach,
termed Relational Memory, relies on an FPGAbased accelerator that
sits between the CPU and main memory and transparently transforms
base data to any group of columns with minimal overhead at runtime.
This design allows accessing any group of columns as if it already
exists in memory. We implement and deploy Relational Memory in real
hardware, and we show that we can access the desired columns up to
1.63x faster than accessing them from their row-wise counterpart,
while matching the performance of a pure columnar access for low
projectivity, and outperforming it by up to 1.87X as projectivity
(and tuple reconstruction cost) increases. Moreover, our approach
can be easily extended to support offloading of a number of
operations to hardware, e.g., selection, group by, aggregation, and
joins, having the potential to vastly simplify the software logic
and accelerate the query execution.
- when: 03/03/23
presenter:
url: https://ramananeesh.github.io/
name: Aneesh Raman
affiliation: Boston University
bio: |
Aneesh Raman is a PhD student at Boston University and a member of
the DiSC Lab, working with Prof. Manos Athanassoulis. His research
focus is on Database systems - indexing and access methods, and his
interests include high performance computing, data mining and machine
learning. Prior to Boston University, he was at Purdue University Fort
Wayne, where he graduated with highest distinction with a Bachelor's
in Computer Science.
title: |
SWARE: Indexing for Near-Sorted Data
abstract: |
Indexing in data systems can be perceived as the process of adding
structure to the incoming, otherwise unsorted data. If the data
ingestion order matches the indexed attribute order, the index
ingestion cost is entirely redundant and can be avoided altogether.
In this work, we identify sortedness as a resource that can accelerate
index ingestion. We propose a new sortedness-aware (SWARE) design
paradigm that pays less indexing cost for near-sorted data. When
applied to a classical index like the B+-tree SWARE offers a 9x
speedup for write intensive workloads and a 4x benefit in mixed
workloads by exploiting data sortedness.
- when: 03/17/23
presenter:
url: https://cs-people.bu.edu/papon/
name: Tarikul Islam Papon
affiliation: Boston University
bio: |
Papon is a Ph.D candidate, part of DiSC Lab at BU, advised by Prof.
Athanassoulis. His research interests broadly include Data Systems,
specifically on hardware-aware data management challenges, stemming
from the evolution of storage and memory devices. Currently, he is
working on proposing a new design paradigm for interacting with durable
storage that takes into account performance asymmetry between read and
write operations, as well as the variable access concurrency that
different devices may support. Prior to joining BU, Papon served as a
lecturer for 4 years at the Department of Computer Science & Engineering
(CSE) in Bangladesh University of Engineering & Technology (BUET). He
completed his MSc and BSc degrees from the same department where he
worked on medical informatics using various machine learning and
embedded system techniques.
title: |
ACEing the Bufferpool Management Paradigm for Modern Storage Devices
abstract: |
Over the past few decades, solid-state drives (SSDs) have been
replacing hard disk drives (HDDs) due to their faster reads and writes,
and their superior random access performance. Further, when compared to HDDs,
SSDs have two fundamentally different properties: (i) read/write asymmetry
(writes are slower than reads) and (ii) access concurrency (multiple I/Os can
be executed in parallel to saturate the device bandwidth). However, several
database operators are designed without considering storage asymmetry and
concurrency resulting in device underutilization, which is typically addressed
opportunistically by device-specific tuning during deployment. As a key example
and the focus of our work, the bufferpool management of a Database Management
System (DBMS) is tightly connected to the underlying storage device, yet,
state-of-the-art approaches treat reads and writes equally, and do not
expressly exploit the device concurrency, leading to subpar performance.
In this paper, we propose a new Asymmetry & Concurrency-aware bufferpool
management (ACE) that batches writes based on device concurrency and performs
them in parallel to amortize the asymmetric write cost. In addition, ACE
performs parallel prefetching to exploit the device’s read concurrency. ACE
does not modify the existing bufferpool replacement policy, rather, it is a
wrapper that can be integrated with any eplacement policy. We implement ACE in
PostgreSQL and evaluate its benefits using a synthetic benchmark and TPC-C for
several popular eviction policies (Clock Sweep, LRU, CFLRU, LRU-WSR). The ACE
counterparts of all four policies lead to significant performance improvements,
exhibiting up to 32.1% lower runtime for mixed workloads (33.8% for
write-intensive TPC-C transactions) with a negligible increase in total disk
writes and buffer misses, which shows that incorporating asymmetry and
concurrency in algorithm design leads to more faithful storage modeling and,
ultimately, to better device utilization.
- when: 04/14/23
presenter:
url: https://kvombatkere.github.io/
name: Karan Vombatkere
affiliation: Boston University
bio: |
Karan Vombatkere is a 2nd year PhD student in the Massive Data & Algorithms
(MiDAS) research group at BU, advised by Dr. Evimaria Terzi. His research
interests are primarily focused on algorithmic data mining and computational social science problems. His current research is on designing efficient approximation algorithms for submodular optimization problems, specifically the Team Formation problem.
title: |
Balancing Task Coverage and Expert Workload in Team Formation
abstract: |
In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.
- when: 04/18/23
presenter:
url: https://rmarcus.info/blog/
name: Ryan Marcus
affiliation: University of Pennsylvania
bio: |
Ryan Marcus is working on using machine learning to build the next generation
of data management tools that automatically adapt to new hardware and user
workloads, invent novel processing strategies, and understand user intention.
He is especially interested in query optimization, index structures,
intelligent clouds, programming language runtimes, program synthesis for data
processing, and applications of reinforcement learning to systems problems in
general.
title: |
LSI: A Learned Secondary Index Structure <font color ="navy"><b>(MCS B37, 12:30pm)</b></font>
abstract: |
Learned index structures have been shown to achieve favorable lookup performance
and space consumption compared to their traditional counterparts such as B-trees.
However, most learned index studies have focused on the primary indexing setting,
where the base data is sorted. In this work, we investigate whether learned indexes
sustain their advantage in the secondary indexing setting. We introduce Learned
Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted
data. LSI works by building a learned index over a permutation vector, which allows
binary search to performed on the unsorted base data using random access. We
additionally augment LSI with a fingerprint vector to accelerate equality lookups.
We show that LSI achieves comparable lookup performance to state-of-the-art
secondary indexes while being up to 6× more space efficient.
- when: 04/21/23
presenter:
url: https://itrummer.github.io/
name: Immanuel Trummer
affiliation: Cornell University
bio: |
Immanuel Trummer is an assistant professor at Cornell University. His papers were selected
for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for
publication in CACM as CACM Research Highlight. His online lecture introducing students to
database topics collected over a million views. He received the NSF CAREER Award and
multiple Google Faculty Research Awards.
title: |
Towards Highly Customizable Database Systems via Generative AI <font color ="navy"><b>(MCS B37, 12:30pm)</b></font>
abstract: |
Deep neural networks are becoming ever more useful for numerous applications. However, it is
increasingly more challenging to utilize high-quality networks due to massive computational
costs. In fact, computational costs continuously increase due to the need for ever more
complex networks and due to the need to explore numerous network designs during the
development of an AI solution. This leads to several problems including limiting development
and access to high-quality AI solutions, time-to-market limitations for organizations, and
negative environmental impact caused by increased use of resources. We observe that modern
GPUs, the substrate where neural networks are developed, are severely underutilized. In fact,
the utilization factor is only in the order of 50% and drops as GPUs get ever faster. Fusing
operators across multiple models and larger mini-batch sizes are two key methods for
increasing hardware utilization and training throughput. However, the large memory requirements
of training algorithms coupled with the limited memory capacity of modern GPUs pose a strict
restriction on the applicability of these methods. In this paper, we introduce μ-two, a new
scheduling algorithm that maximizes GPU utilization by using operator fusion and memory
optimization. The core novelty behind μ-two is in being able to schedule enough independent
computational operations from multiple models such as to utilize all compute capacity of the
available GPU while at the same time, it utilizes all compute time to overlap independent memory
swapping operations. We show how to create μ-two schedules for arbitrary neural network and GPU
architectures. We integrated μ-two within the PyTorch compiler and demonstrate a 3× speed-up with
diverse network architectures and diverse NVIDIA GPUs, spanning vision, natural language
processing and recommendation applications.
- when: 04/25/23
presenter:
url: https://scholar.harvard.edu/sanket/home
name: Sanket Purandare
affiliation: Harvard University
bio: |
Sanket Purandare is a 4th year PhD student at the Data and Systems Lab (DASLab) at Harvard
University, advised by Prof. Stratos Idreos. His research interests lie in the area of systems for
deep learning. In particular, he works on the characterization of deep learning training and
inference workloads and optimizing the deep learning software infrastructure for hardware
efficiency by applying systems concepts from operating systems, compilers, and computer
architecture in a novel fashion. Currently, he is a Visiting Researcher with Meta Inc.
(formerly Facebook) and works with the PyTorch Compilers and Distributed Teams on building
the compiler stack for distributed training.
title: |
μ -TWO: Multi-Model Training with Orchestration and Memory Optimization <font color ="navy"><b>(CDS 701, 11:00am)</b></font>
abstract: |
Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of
sound and robust policy making by providing a means to estimate the impact of a certain intervention to
the world, and is practically indispensable in health, medicine, social sciences, and other domains.
It forms the stepping stone for actionable data analysis that correlation, association, or model-based
prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply
causation. In this talk, I will discuss some of our research in making a connection between traditional
database research with causal inference research from AI and Statistics, and exploring how they can
strengthen each other. First, I will present a "matching" framework for causal inference called "Almost
Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the
benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query
evaluations from databases for scalability. Then I will discuss how standard causal inference
techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data
frequently found in practice. Finally, I will talk about how such "relational causal inference" can be
applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and
conclude with an overview of our ongoing research directions on actionable and interpretable data
analysis with causality and explanations.
- when: 04/26/23
presenter:
url: https://users.cs.duke.edu/~sudeepa/
name: Sudeepa Roy
affiliation: Duke University
bio: |
Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in
data management, with a focus on the foundational aspects of big data analysis, which includes
causality and explanations for big data, data repair, query optimization, probabilistic databases,
and database theory. Before joining Duke in 2015, she did a postdoc at the University of Washington,
and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career
Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data.
She is a co-director of the Almost Matching Exactly (AME) lab for interpretable causal inference at
Duke (<a href="https://almost-matching-exactly.github.io/">https://almost-matching-exactly.github.io/</a>).
title: |
Connecting Causal Inference with Databases towards Actionable Data Analysis <font color ="navy"><b>(CDS 701, 11:00am)</b></font>
abstract: |
Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of
sound and robust policy making by providing a means to estimate the impact of a certain intervention to
the world, and is practically indispensable in health, medicine, social sciences, and other domains.
It forms the stepping stone for actionable data analysis that correlation, association, or model-based
prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply
causation. In this talk, I will discuss some of our research in making a connection between traditional
database research with causal inference research from AI and Statistics, and exploring how they can
strengthen each other. First, I will present a "matching" framework for causal inference called "Almost
Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the
benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query
evaluations from databases for scalability. Then I will discuss how standard causal inference
techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data
frequently found in practice. Finally, I will talk about how such "relational causal inference" can be
applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and
conclude with an overview of our ongoing research directions on actionable and interpretable data
analysis with causality and explanations.
- when: 05/05/23
presenter:
url: https://user.it.uu.se/~geofa117/
name: Georgios Fakas
affiliation: Uppsala University
bio: |
Georgios Fakas is an Associate Professor at the Department of Information Technology, Uppsala
University. He is currently leading the Uppsala DataBase Laboratory (UDBL) within the Department of
Information Technology. Before joining Uppsala University, he worked as a Research Fellow at the Hong
Kong University of Science and Technology (with Prof. Dimitris Papadias), Hong Kong University (with
Prof. Nikos Mamoulis), EPFL (Lausanne, Switzerland), UMIST (Manchester, UK), University of Cyprus
(Cyprus). Prof. Georgios Fakas also worked as a Senior Lecturer at Manchester Metropolitan University
(UK). He obtained his Ph.D. in 1998, his M.Phil. in 1996 and his B.Sc. in Computation in 1995; all
from the Department of Computation, UMIST, Manchester, UK.
title: |
Proportionality on Spatial Data with Context
abstract: |
More often than not, spatial objects are associated with some context, in the form of text, descriptive
tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2,
DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but
also the context of the objects, especially when the retrieved objects are too many and the query result
is overwhelming. In this article, we study the problem of selecting a subset of the query result, which
is the most representative. We argue that objects with similar context and nearby locations should
proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all
retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost
of proportional object selection in practice. In addition, we propose pre-processing, pruning, and
approximate computation techniques that their combination reduces the computational cost of the algorithms
even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical
studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies
that proportional selection is more preferable than random selection and selection based on object
diversification.
- when: 06/02/23
presenter:
url: https://www.it.uu.se/katalog/tiazh991/
name: Tianru Zhang
affiliation: Uppsala University
bio: |
Tianru Zhang is a PhD student in Department of Information Technology at Uppsala University. He received
his MSc degree from ENSAI (National school of statistics and information analysis of France), and his
bachelor degree from University of Science and Technology of China. Before joining Uppsala University,
he also worked as an assistant researcher in Fujitsu R&D center, Beijing. Zhang is interested in research
field including cloud computing, data management, and federated machine learning.
title: |
Efficient Hierarchical Storage Management Empowered by Reinforcement Learning <font color ="navy"><b>(Online only @ 11:30 AM)</b></font>
abstract: |
With the rapid development of big data and cloud computing, data management has become increasingly
challenging. Over the years, a number of frameworks for data management have become available. Most of them
are highly efficient, but ultimately create data silos. It becomes difficult to move and work coherently
with data as new requirements emerge. A possible solution is to use an intelligent hierarchical
(multi-tier) storage system (HSS). A HSS is a meta solution that consists of different storage frameworks
organized as a jointly constructed storage pool. A built-in data migration policy that determines the
optimal placement of the datasets in the hierarchy is essential. Placement decisions is a non-trivial task
since it should be made according to the characteristics of the dataset, the tier status in a hierarchy,
and access patterns. This paper presents an open-source hierarchical storage framework with a dynamic
migration policy based on reinforcement learning (RL). We present a mathematical model, a software
architecture, and implementations based on both simulations and a live cloud-based environment. We compare
the proposed RL-based strategy to a baseline of three rule-based policies, showing that the RL-based policy
achieves significantly higher efficiency and optimal data distribution in different scenarios.
- name: Fall 2022
when: 09/02/22
where: |
Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
responsible:
name: Andy Huynh
username: ndhuynh
talks:
- when: 10/07/22
presenter:
url: https://mhmdsaiid.github.io/
name: Mohammed Saeed
affiliation: EURECOM
bio: |
Mohammed Saeed obtained a double engineering diploma from the Lebanese
University and Télécom Paristech. Since 2019, he is a PhD Student in
Computational Fact Checking at EURECOM (France), under the supervision
of Prof. Paolo Papotti. His research spans methods across Natural
Language Processing, Knowledge Graphs, and Deep Learning.
title: |
Explainable Checking for Factual Claims
abstract: |
Misinformation is an important problem but fact checkers are overwhelmed
by the amount of false content that is produced online every day. To
support fact checkers in their efforts, we are creating data driven
verification methods that use data resources to assess claims and
explain their decisions. However, using structured data for facts
expressed in natural language is a complex task. In this talk, we
present a first solution that jointly learns representations of
structured and non-structured data. We translate text claims into SQL
queries on relational databases by exploiting text classifiers and
tentative execution of query candidates to narrow down the set of
alternatives. We then discuss how logical rules can be used to verify
claims in a checking task modeled as a probabilistic inference problem.
We show that this form of reasoning can be “learned” by
transformer-based language models. Experiments show that both methods
enable the efficient and effective labeling of claims with interpretable
explanations.
- name: Spring 2022
when: 02/02/22
where: |
Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
responsible:
name: Andy Huynh
username: ndhuynh
talks:
- when: 02/11/22
presenter:
url: https://homepages.inf.ed.ac.uk/scohen/
name: Shay Cohen
affiliation: University of Edinburgh
bio: |
Shay Cohen is a Reader at the University of Edinburgh (School of
Informatics). Prior to this, he was a postdoctoral research scientist
in the Department of Computer Science at Columbia University, and held
an NSF/CRA Computing Innovation Fellowship. He received his B.Sc. and
M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from
Carnegie Mellon University in 2011. His research interests span a
range of topics in natural language processing and machine learning,
with a focus on structured prediction (for example, parsing) and text
generation.
title: |
Document Summarisation: Modelling, Datasets and Verification of Content
abstract: |
Within Natural Language Processing, document summarisation is one of the
central problems. It has both short-term societal implications and
long-term implications in terms of the success of AI. I will describe
advances made in this area with respect to three different aspects:
methodology and modelling, dataset development and enforcing factuality
of summaries. In relation to modelling, I will show how reinforcement
learning can be used to directly maximise the metric by which the
summaries are being evaluated. With regards to dataset development, I
will describe a dataset that we released for summarisation, XSum, in
which a single sentence is used to describe the content of a whole
article. The dataset has become a standard benchmark for summarisation.
Finally, in relation to factuality, I will show how one can improve the
quantitative factuality of summaries by re-ranking them in a beam based
on a "verification" model.
- when: 03/11/22
presenter:
url: https://prashantpandey.github.io/
name: Prashant Pandey
affiliation: VMware
bio: |
Pandey is a Research Scientist at VMware Research. Previously, he did
postdocs at University of California Berkeley and Carnegie Mellon
University. He obtained his Ph.D. in Computer Science at Stony Brook
University in December 2018. His goal as a researcher is to advance
the theory and practice of resource-efficient data structures and
employ them to democratize complex and large-scale data analyses. He
designs and builds tools for large-scale data management problems
across computational biology, stream processing, and storage. He is
also the main contributor and maintainer of multiple open-source
software tools that are used by hundreds of users across academia and
industry. He won the prestigious Catacosinos Fellowship in 2018, a
Best paper award at FAST 2016, and Runner’s Up to Best Paper at FAST
2015. His work has appeared at top conferences and journals, such as
SIGMOD, FAST, SPAA, ESA, RECOMB, ISMB, Genome Biology, and Cell
Systems.
title: |
Data Systems at Scale: Scaling Up by Scaling Down and Out (to Disk)
abstract: |
Our ability to generate, acquire, and store data has grown exponentially
over the past decade making the scalability of data systems a major
challenge. This talk presents my work on two techniques to tackle the
scalability challenge: scaling down, i.e., shrinking data to fit in RAM,
and scaling out to disk, i.e., organizing data on disk so that the
application can still run fast. I will describe new compact and
I/O-efficient data structures and their applications in computational
biology, stream processing, and storage.
In computational biology, my work shows how to shrink genomic and
transcriptomic indexes by a factor of two while accelerating queries by
an order of magnitude compared to the state-of-the-art tools. In stream
processing, my work bridges the gap between the worlds of external
memory and stream processing to perform scalable and precise real-time
event-detection on massive streams. In file systems, my work improves
file-system random-write performance by an order of magnitude without
sacrificing sequential read/write performance.
- when: 03/18/22
presenter:
url: https://www.linkedin.com/in/ippokratis-pandis-03b1402/
name: Ippokratis Pandis
affiliation: Amazon AWS
bio: |
Ippokratis Pandis is a senior principal engineer at Amazon Web Services, currently working in Amazon Redshift.
Redshift is Amazon's fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held
positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as
member of the research staff at the IBM Almaden Research Center, where he worked on IBM DB2 BLU.
Ippokratis received his PhD from the Electrical and Computer Engineering department at Carnegie Mellon
University. He is the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD 2011, and Test-of-Time
award at EDBT 2019. He has served or serving as PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019
and ICDE Industrial 2022, as well as General Chair of SIGMOD 2023.
title: |
Amazon Redshift Reinvented <font color ="navy"><b>(talk @ 12:00 PM)</b></font>
abstract: |
In 2013, eight years ago, Amazon Web Services revolutionized the data warehousing industry by launching Amazon
Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple
and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This
launch was a significant leap from the traditional on-premise data warehousing solutions which were expensive,
rigid (not elastic), and needed a lot of tribal knowledge to perform. Unsurprisingly, customers embraced Amazon
Redshift and it went on to become the fastest growing service in AWS. Today, tens of thousands of customers use
Amazon Redshift in AWS's global infrastructure of 25 launched Regions and 81 Availability Zones (AZs) to process
Exabytes of data daily.
The success of Amazon Redshift inspired a lot of innovation in the analytics industry
which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and
in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this
talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on
the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading
performance and scalability. We discuss how Amazon Redshift extends beyond traditional data warehousing
workloads, but integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for
analytics. We then talk about Amazon Redshift’s autonomics and Amazon Redshift Serverless. In particular, we
present how Redshift continuously monitors the system and uses machine learning to improve its performance and
operational health without the need of dedicated administration resources, in an easy to use offering.
- when: 04/22/22
presenter:
url: https://www.linkedin.com/in/venusatuluri/
name: Venu Satuluri
affiliation: Twitter
bio: |
Venu Satuluri is a Principal ML Engineer at Twitter. He has been
working on a variety of recommender systems and applied ML problems
during the last 10 years at Twitter. Venu got his Ph.D. focusing on
graph clustering and similarity search from the Ohio State University
in 2012.
title: |
SimClusters: Community-Based Representations for Heterogeneous
Recommendations at Twitter
<font color ="navy"><b>(talk @ 11:00 AM)</b></font>
abstract: |
Personalized recommendation products at Twitter target a variety of
items: Tweets, Events, Topics, Hashtags, and users. Each of these targets
varies in their cardinality (which affects the scale of the problem) and
their shelf life (which constrains the latency of generating the
recommendations). Although Twitter has built a variety of recommendation
systems before dating back a decade, solutions to the broader problem
were mostly tackled piecemeal. In this work, we present SimClusters, a
general-purpose representation layer based on overlapping communities
into which users as well as heterogeneous content can be captured as
sparse, interpretable vectors to support a multitude of recommendation
tasks. We propose a novel algorithm for community discovery based on
Metropolis-Hastings sampling, which is both more accurate and
significantly faster than off-the-shelf alternatives. SimClusters scales
to networks with billions of users and has been effective across a
variety of deployed applications at Twitter.
- when: 05/06/22
presenter:
url: https://raulcastrofernandez.com/
name: Raul Castro Fernandez
affiliation: University of Chicago
bio: |
I am interested in understanding the economics and value of data,
including the potential of data markets to unlock that value. The goal
of my research is to understand how to make the best use of data
possible. For that, I often build systems to share, discover, prepare,
integrate, and process data. I often use techniques from data
management, statistics, and machine learning. I am an assistant
professor in the Computer Science department at the University of
Chicago. Before UChicago, I did a postdoc at MIT with Sam Madden and
Mike Stonebraker. And before that, I completed my PhD at Imperial
College London with Peter Pietzuch.
title: |
Data Station: Delegated, Trustworthy, and Auditable Computation to
Enable Data-Sharing Consortia with a Data Escrow
abstract: |
Pooling and sharing data increases and distributes its value. But since
data cannot be revoked once shared, scenarios that require controlled
release of data for regulatory, privacy, and legal reasons default to
not sharing. Because selectively controlling what data to release is
difficult, the few data-sharing consortia that exist are often built
around data-sharing agreements resulting from long and tedious one-off
negotiations.
In this talk, I will present Data Station, a data escrow designed to
enable the formation of data-sharing consortia. Data owners share data
with the escrow knowing it will not be released without their consent.
Data users delegate their computation to the escrow. The data escrow
relies on delegated computation to execute queries without releasing the
data first. The Data Station leverages hardware enclaves to generate
trust among participants and exploit the centralization of data and
computation to generate an audit log.
Finally, I will contextualize the design and implementation of the Data Station
in a larger research agenda that explores “the value of data”, the “economics of
data”, and “data markets”.
- when: 05/13/22
presenter:
url: https://www.eurecom.fr/~appuswam/
name: Raja Appuswamy
affiliation: EURECOM
bio: |
Raja Appuswamy is an Assistant Professor in the Data Science
department at EURECOM--a Grandes Écoles located in the sunny Sophia
Antipolis tech-valley of southern France. Previously, he was as a
Researcher and Visiting Professor at EPFL, Switzerland, a Visiting
Researcher in the Systems and Networking group at Microsoft Research,
Cambridge, and as a Software Development Engineer in the Windows 7
kernel team at Microsoft, Redmond.
He received his Ph.D in Computer Science from the Vrije Universiteit, Amsterdam,
where he worked under the guidance of Prof. Andrew S. Tanenbaum on designing and
implementing a new storage stack for the MINIX 3 microkernel operating system.