Skip to content

Pixels 0.1.0

Latest
Compare
Choose a tag to compare
@bianhq bianhq released this 04 Aug 09:40
· 132 commits to master since this release
e5342d7

Main Features

  • Optimized columnar storage on HDFS, S3, MinIO, local FS, GCS, and Redis. Significantly outperforms Parquet and ORC.
  • A distributed in-memory columnar cache that further improves I/O performance for data analytics.
  • External query engine integrations for Trino, Presto, Hive, and DuckDB. Significantly improves query performance in these query engines.
  • Metadata (database schema and data catalog) management for data lakes and warehouses.
  • Internal query accelerator (Pixels-Turbo) in serverless computing environments, including AWS Lambda and vHive (K8S+Knative+Firecracker).
  • REST query API that exposes Pixels as a serverless analytics service for external users.

Release Notes

What's Changed

  • debug hive by @bianhq in #1
  • make pixels-hive usable by @bianhq in #7
  • Finish pixels-hive by @bianhq in #8
  • refine docs of pixels-hive. by @bianhq in #9
  • refine pixels-hive doc and fix pixels-load. by @bianhq in #10
  • Refine code and docs. by @bianhq in #11
  • add comments by @bianhq in #18
  • Rename the prefix of packages to io.pixelsdb by @ray6080 in #22
  • [Issue 24]: add balancers for pixels cache. by @bianhq in #25
  • [Issue #24]: move file if it is not local by @bianhq in #28
  • HOTFIX: upgrade fastjson to version 1.2.58 for security by @bianhq in #29
  • [Issue #30]: make NUMA interleaved. by @bianhq in #31
  • [Issue #30]: add start-vmtouch and make it NUMA interleaved. by @bianhq in #32
  • [HOTFIX]: remove mysql-connector from dependencies due to license conflicts. by @bianhq in #33
  • [HOTFIX]: remove or upgrade insecure packages from dependencies. by @bianhq in #34
  • [HOTFIX]: rollback jackson to 2.8.1. by @bianhq in #35
  • [Issue #36]: add copyright and license notice. by @bianhq in #37
  • Revert "[Issue #36]: add copyright and licence notice." by @bianhq in #38
  • [Issue #36]: add copyright and license notice. by @bianhq in #39
  • [Issue #36]: fix copyright and licence notice. by @bianhq in #40
  • [Issue #36]: rename LICENCE file. by @bianhq in #41
  • [Issue #36]: update NOTICE. by @bianhq in #42
  • HOTFIX: import external jars when starting pixels. by @bianhq in #43
  • HOTFIX: hive docs. by @bianhq in #45
  • [Issue #47]: using different lock files for coordinator and datanode daemons. by @bianhq in #48
  • [Issue #49] fix bugs in VarcharArrayBlock by @bianhq in #50
  • [Issue #52]: implement direct shared memory read. by @bianhq in #53
  • [Issue #52]: reduce memory copy in BooleanColumnReader and IntegerColumnReader. by @bianhq in #54
  • [Issue #55]: reduce memory copy in column readers. by @bianhq in #56
  • [HOTFIX]: bug in DynamicIntArray.toArray() and redundant memory copy. by @bianhq in #57
  • [Issue #58]: add close method to resources. by @bianhq in #59
  • Issue #58: add gc threshold to optimize gc for small queries. by @bianhq in #62
  • [Issue #44]: implement Etcd metadata store. by @bianhq in #65
  • [Issue #67] implement cache read / write coordination. by @bianhq in #68
  • [Issue #67] implement three-phase cache update. by @bianhq in #69
  • Hotfix: fix bugs in CacheWriter initialization and cache update. by @bianhq in #70
  • HOTFIX: clean unused exceptions. by @bianhq in #71
  • [Issue #72]: optimize memory allocation and access in PixelsCacheReader. by @bianhq in #73
  • [Issue #72]: fix bugs in pixels-cache and implement loading radix from index file. by @bianhq in #77
  • [Issue #67]: Implement cache read lease and optimize read performance. by @bianhq in #79
  • [Issue #78]: avoid cache probing on uncached tables in Presto. by @bianhq in #80
  • [Issue #78]: avoid cache probing on uncached tables in Hive. by @bianhq in #81
  • [Issue #83]: implement JIT splitting for ordered path. by @bianhq in #84
  • [Issue #85]: fix list table error when schema is empty. by @bianhq in #86
  • [Issue #87]: remove explicit gc from PixelsReader. by @bianhq in #89
  • [Issue #88]: fix MAX_READER_COUNT. by @bianhq in #90
  • [Issue #91]: use three bytes for cache reader count. by @bianhq in #92
  • [Issue #94]: Support Date and Time types. by @bianhq in #95
  • [Issue #98]: fix insert/update related metadata service and getLayout. by @bianhq in #101
  • [Issue #99]: fix null value storage. by @bianhq in #102
  • [Issue #103]: fix and enhance predicate processing. by @bianhq in #104
  • [Issue #105]: fix endless execution for select count(*). by @bianhq in #106
  • [Issue #100]: refine type management and add varchar/char support. by @bianhq in #107
  • [Issue #108]: replace hdfs FileSystem api with the unified Storage api. by @bianhq in #110
  • [Issue #108]: implement LocalFS and global auto-increment id. by @bianhq in #112
  • [Issue #113]: collect the cumulative memory usage in pixels record reader. by @bianhq in #116
  • [Issue #115]: replace message queue implementation. by @bianhq in #117
  • [Issue #114]:support S3 storage and asynchronous I/O scheduling. by @bianhq in #118
  • [Issue #120]: support configurable S3 clients. by @bianhq in #122
  • [Issue #121]: fix listing objects. by @bianhq in #123
  • [Issue #124]: refine read path. by @bianhq in #125
  • [Issue #126]: add rate limit and request retry policy. by @bianhq in #127
  • [Issue #128]: implement request diversion and refine java package layout. by @bianhq in #130
  • [Issue #131]: implement projections for compact layout. by @bianhq in #134
  • [Issue #133]: fix and refine retry policy. by @bianhq in #135
  • [Issue #136]: refine thread factory for async read using sync client. by @bianhq in #137
  • [Issue #132]: upgrade supported Presto version from 0.192 to 0.215. by @bianhq in #138
  • [Issue #142]: fix mbps rate-limit. by @bianhq in #143
  • [Issue #145]: fix date type for Presto-0.215. by @bianhq in #148
  • [Issue #149]: fix configuration and dependency. by @bianhq in #150
  • [Issue #153]: add adaptive reading method. by @bianhq in #154
  • [Issue #144]: fix scripts and finish docs. by @bianhq in #157
  • [Issue #156]: fix empty schema. by @bianhq in #159
  • [Issue #158]: support loading data from and to arbitrary storage. by @bianhq in #161
  • [Issue #160]: compact from and to arbitrary storage, including tail files. by @bianhq in #162
  • [Issue #163]: fix bounded varchar/char type support. by @bianhq in #165
  • [Issue #164]: define storage scheme in CREATE statement. by @bianhq in #166
  • [Issue #167]: fix retained size calculation of VarcharArrayBlock. by @bianhq in #168
  • [Issue #169]: add session properties about layout-path enabling to pixels-presto. by @bianhq in #171
  • [Issue #172]: implement record cursor and enhance record reader. by @bianhq in #173
  • [Issue #174]: implement transaction server and pass query (trans) id into record reader. by @bianhq in #176
  • [Issue #175]: enhance transaction info of queries and pass query id into I/O schedulers. by @bianhq in #177
  • [Issue #179]: fix show tables from information_schema. by @bianhq in #180
  • [Issue #178]: enable metrics server by configuration parameter. by @bianhq in #182
  • [Issue #183]: fix show columns from information_schema. by @bianhq in #184
  • [Issue #181]: fix start and stop of pixels-daemon. by @bianhq in #185
  • [Issue #186]: skip cache initialization if cache is disabled. by @bianhq in #187
  • [Issue #139]: stop retrying the request from terminated queries. by @bianhq in #188
  • [Issue #190]: fix data copying between S3 buckets. by @bianhq in #191
  • [Issue #194]: support views. by @bianhq in #195
  • [Issue #196]: refine type management and support decimal. by @bianhq in #197
  • [Issue #196]: fix DecimalColumnVector and use decimal in TPC-H schema. by @bianhq in #200
  • [Issue #192]: support multi-thread compaction. by @bianhq in #201
  • [Issue #192]: refine compact, S3, PixelsCompactor, and support multi-thread copying. by @bianhq in #202
  • [Issue #205]: fix integer overflow in request merging. by @bianhq in #206
  • [Issue #210]: some minor improvements. by @xxchan in #204
  • [Issue #208]: fix column statistics for decimal. by @bianhq in #211
  • [Issue #207]: fix the data type metadata in the file footer. by @bianhq in #212
  • [Issue #209]: clean code. by @bianhq in #213
  • [Issue #189]: support folders on S3. by @bianhq in #215
  • [Issue #189]: revise docs and scripts. by @bianhq in #216
  • [Issue #217]: split presto and hive integrations into sub-projects. by @bianhq in #218
  • [Issue #220] revise license, readme, and pom. by @bianhq in #221
  • [Issue #222]: disable mock file locations for storage systems that do not provide data locality. by @bianhq in #223
  • [Issue #170]: implemented scan operator. by @TiannanSha in #229
  • [Issue #224]: clean and refine cache key and cache entry implementation. by @bianhq in #230
  • [Issue #231]: update docs and comments. by @bianhq in #232
  • [Issue #225]: upgrade to Hadoop 3.3.1 and clean dependencies. by @bianhq in #234
  • [Issue #231]: enable storage schemes in configuration. by @bianhq in #235
  • [Issue #233]: fix log4j configurations. by @bianhq in #236
  • [Issue #193]: revise the README under modules. by @bianhq in #237
  • [Issue #170]: fix dependencies and logging for pixels-lambda. by @bianhq in #243
  • [Issue #238]: Add a script to install pixels by @xxchan in #239
  • [Issue #170]: implement filter. by @TiannanSha in #246
  • [Issue #170]: clean the unused files and reformat the code. by @bianhq in #247
  • [Issue #170]: update poms for pixels-lambda. by @bianhq in #248
  • [Issue #249]: remove InvalidActivityException. by @bianhq in #250
  • [Issue #245]: support reading remote config file. by @bianhq in #252
  • [Issue #170]: optimize Pixels S3 writer and lambda. by @bianhq in #253
  • [Issue #170]: implement table scan filter and refine scan worker. by @bianhq in #254
  • [Issue #170]: support direct write back to on-premise minio. by @bianhq in #255
  • [Issue #170]: support lambda scan. by @bianhq in #256
  • [Issue #170]: fix the discrete filter for string-based columns. by @bianhq in #257
  • [Issue #170]: fix S3 folder deletion. by @bianhq in #259
  • HOTFIX: refine comments. by @bianhq in #260
  • [Issue #261]: move table scan predicates into pixels-executor. by @bianhq in #262
  • [Issue #170]: implement hash partitioned join. by @bianhq in #263
  • [Issue #170]: implement broadcast join. by @bianhq in #264
  • [Issue #265]: fix reading row group number. by @bianhq in #266
  • [Issue #268]: fix null value check for join. by @bianhq in #272
  • [Issue #271]: improve discrete column filter. by @bianhq in #273
  • [Issue #270]: support full outer join. by @bianhq in #275
  • [Issue #170]: enhance joins and implement join tree executor. by @bianhq in #281
  • [Issue #170]: support join endian. by @bianhq in #282
  • [Issue #170]: disable left full outer broadcast join. by @bianhq in #283
  • [Issue #170]: fix join input splits generation, refine join inputs and join operator. by @bianhq in #284
  • [Issue #170]: fix and refine join inputs and join workers. by @bianhq in #285
  • [Issue #258]: implement table and column statistics. by @bianhq in #287
  • [Issue #258]: add join advisor and fix join execution. by @bianhq in #288
  • [Issue #258]: support multi-pipeline join and fix bugs. by @bianhq in #289
  • [Issue #258]: implement partitioned chain join. by @bianhq in #290
  • [Issue #258]: implement split size capping. by @bianhq in #291
  • [Issue #170]: add invoker factory and get worker name from config file. by @bianhq in #292
  • [Issue #170]: implement work exception handling and join output collection. by @bianhq in #293
  • [Issue #294]: fix blocking splits when lambda scan is enabled. by @bianhq in #295
  • [Issue #203]: support long decimal with 38 max digit precision and scale. by @bianhq in #296
  • [Issue #297]: fix timestamp type, stat recorders, null value filtering, and pixels-load. by @bianhq in #298
  • [Issue #258]: upgrade Prometheus dependencies. by @josephhany in #300
  • [Issue #170] implement aggregation execution. by @bianhq in #301
  • [Issue #170]: fix null-pointer in scan worker. by @bianhq in #302
  • [Issue #170]: fix column stats recorders and split size capping. by @bianhq in #303
  • [Issue #305]: support deleting more than 1000 files from S3. by @bianhq in #306
  • [Issue #170]: support scan projection in scan worker. by @bianhq in #307
  • [Issue #170]: implement min/min in column stats in metadata. by @bianhq in #308
  • [Issue #170]: fix aggregation worker. by @bianhq in #309
  • [Issue #170]: improve multi-thread copying. by @bianhq in #310
  • [Issue #170] add metadata cache and cost-based splits index. by @bianhq in #311
  • [Issue #170]: merge outputs in lambda worker. by @bianhq in #312
  • [Issue #170]: fix loading path. by @bianhq in #313
  • [Issue #170]: add row count broadcast threshold. by @bianhq in #314
  • [Issue #170]: join optimization for very large datasets. by @bianhq in #315
  • [Issue #170]: optimizations for large joins. by @bianhq in #316
  • [Issue #170]: optimizing hash functions. by @bianhq in #317
  • [Issue #170]: implement partition projection. by @bianhq in #318
  • [Issue #170]: improve execution pipeline. by @bianhq in #319
  • [Issue #170]: improve join algorithm selection and broadcast split size adjustment. by @bianhq in #321
  • [Issue #170]: add script to run before each new instance became in-service. by @TiannanSha in #322
  • [Issue #170]: using multi-thread for column encoding in Pixels writer. by @bianhq in #323
  • [Issue #170]: update spot scripts. by @bianhq in #324
  • [Issue #170] collect performance metrics from serverless workers. by @bianhq in #325
  • [Issue #170]: add trans concurrency and GC monitor, and tune log level. by @bianhq in #326
  • [Issue #170]: update aggregation plan and spot vm user data. by @bianhq in #327
  • [Issue #170]: improve get num partitions. by @bianhq in #329
  • [Issue #170]: remove existence check from workers. by @bianhq in #330
  • [Issue #214]: implement multi-thread S3 output stream. by @bianhq in #331
  • [Issue #214]: enable retry policy in S3OutputStream. by @bianhq in #332
  • [Issue #170]: improve dictionary encoding and metrics collection. by @bianhq in #333
  • [Issue #170]: add startling executor and fix the inputs of multi-pipeline broadcast join. by @bianhq in #334
  • [Issue #170] fix hang in partitioned join worker. by @bianhq in #335
  • [Issue #170]: optimize file existence checking in getFileSchema. by @bianhq in #336
  • [Issue #170]: support Redis storage. by @bianhq in #337
  • [Issue #170]: support default user in Redis. by @bianhq in #338
  • [Issue #170]: fix null value processing in aggregation. by @bianhq in #339
  • [Issue #170]: improve string comparison and aggregation. by @bianhq in #340
  • [Issue #170]: fix empty file problem in aggregation. by @bianhq in #341
  • [Issue #170]: add partitioning to aggregation and implement starling aggregation. by @bianhq in #342
  • [Issue #170]: add null fraction and cardinality statistics into pixels-load. by @bianhq in #343
  • [Issue #170]: support cardinality estimation for aggregation. by @bianhq in #344
  • [Issue #345]: fix double start of retry policy. by @bianhq in #346
  • [Issue #347]: reconnect to S3 when fail to get object. by @bianhq in #348
  • [Issue #170]: support count aggregation. by @bianhq in #349
  • [Issue #350]: remove request division. by @bianhq in #351
  • [Issue #352]: support google cloud storage. by @bianhq in #353
  • pixels partitioned cache protocol by @Yeeef in #355
  • [Issue #357]: fix compilation and dependency problem. by @bianhq in #358
  • [Issue #357] clean unused files. by @bianhq in #359
  • [Issue #170]: move the code related to query planning to pixels-optimizer. by @bianhq in #360
  • [Issue #170]: refine query queues. by @bianhq in #361
  • [Issue #357]: update readme for pixels-trino. by @bianhq in #362
  • [Issue #357]: refine auto scaling metrics. by @bianhq in #363
  • [Issue #357]: update metrics collector. by @bianhq in #364
  • [Issue #365]: fix statistics collection. by @bianhq in #366
  • [Issue #367]: fix BinaryColumnVector for dictionary encoding. by @bianhq in #368
  • [Issue #369]: implement dictionary-encoded column vector. by @bianhq in #370
  • [Issue #371]: update metadata for view creation in Trino. by @bianhq in #372
  • [Issue #373]: support direct read on localFS. by @bianhq in #375
  • [Issue #374]: fix the column vectors. by @bianhq in #376
  • [Issue #377]: fix ByteBufferInputStream. by @bianhq in #378
  • [Issue #379]: support configurable direct/non-direct I/O in LocalFS. by @bianhq in #382
  • [Issue #380]: rename pixels-optimizer to pixels-planner. by @bianhq in #383
  • [Issue #381]: move out pixels-load and pixels-tools. by @bianhq in #384
  • [Issue #386]: fix getRowNumber in PixelsRecordReaderImpl. by @bianhq in #387
  • [Issue #385]: refine docs and some comments, and fix timestamp format for AWS CloudWatch metrics. by @bianhq in #389
  • [Issue #388]: fix encoded column vector reading. by @bianhq in #390
  • [Issue #394] fix non-encoded integer column reading. by @bianhq in #395
  • [Issue #393] support mmap in local file systems. by @bianhq in #396
  • [Issue #397] update install.sh. by @bianhq in #398
  • [Issue #399] update docs and support listing the paths and statuses of the files in multiple directories. by @bianhq in #400
  • [Issue #401] support different input and output storage scheme in compactor and create parent dir automatically for local fs. by @bianhq in #402
  • [Issue #403] support async read on local fs. by @bianhq in #404
  • [Issue #405] refine configuration properties. by @bianhq in #406
  • [Issue #405] fix comments. by @bianhq in #407
  • [Issue #409]: add an http server that provides restful api. by @bianhq in #411
  • [Issue #408]: improve exception handling in pixels-daemon. by @bianhq in #412
  • [Issue #410] implement the REST API for SQL execution. by @bianhq in #413
  • [Issue #414] pixels reads replicated content on string column after the first row batch, when the column is not encoded. by @yuly16 in #415
  • [Issue #417] the data in integer and long column reader is aligned by @yuly16 in #418
  • [Issue #416] build pixels-turbo, split storage adapters, cf invokers, and scaling handlers into separate modules by @bianhq in #420
  • [Issue #421] init pixels-proxy. by @bianhq in #425
  • [Issue #419] fix timezone offset for date column. by @bianhq in #422
  • [pixels-cli] fix session properties for stat. by @bianhq in #427
  • [Issue #421] update docs and clean Date related code. by @bianhq in #428
  • [Issue #421] implement basic pixels-proxy. by @bianhq in #430
  • [Issue #429] upgrade Prometheus and exporters. by @bianhq in #432
  • [Issue #433] prepare for vhive integrations. by @bianhq in #434
  • [Issue #431] update docs, comments, and error print. by @bianhq in #440
  • [Issue #441] add grpc example into pixels-server. by @bianhq in #442
  • [Issue #443] add add-opens into manifest. by @bianhq in #444
  • [docs] split the main readme into docs. by @bianhq in #445
  • [docs] fix links. by @bianhq in #446
  • [docs] fix build instructions. by @bianhq in #447
  • [docs] refine readme. by @bianhq in #448
  • [docs] refine docs for Pixels Turbo. by @bianhq in #449
  • [Issue #450] fix lambda invoker unit tests and minor problems in lambda workers. by @bianhq in #451
  • [Issue #452] enable reading other storages than s3 in serverless workers. by @bianhq in #453
  • [docs] update pixels-turbo settings. by @bianhq in #454
  • SQLglot transpile integration by @voidforall in #455
  • Finish pixels parser by @voidforall in #456
  • [Issue #431] refine transaction protocol. by @bianhq in #458
  • [Issue #431] update docs and add file headers. by @bianhq in #461
  • [Issue #459] update row count automatically. by @bianhq in #462
  • [Issue #423] pixels vhive invoker and worker. by @zhaoshihan in #463
  • [Issue #431] finish pixels query server. by @bianhq in #464
  • [Issue #423] refine docs and add license headers. by @bianhq in #465
  • [Issue #466] fix storage endpoint config. by @bianhq in #467
  • [Issue #468] add operator name to the input of the cloud function workers. by @bianhq in #469
  • [Issue #468] improve operator name setting. by @bianhq in #470
  • [Issue #437] redesign the schema of metadata. by @bianhq in #474
  • [Issue #471] implement the C++ reader for pixels. by @yuly16 in #473
  • [docs] add Duckdb and C++ reader introduction. by @bianhq in #475
  • HOTFIX: catalog metadata id by @voidforall in #476
  • [docs] remove outdated information for statistics collection. by @bianhq in #478
  • [docs] revise data compaction. by @bianhq in #479
  • [Issue #423] modify settings for vHive worker & invoker by @zhaoshihan in #481
  • [Issue #482] fix incorrect join plan for broadcast join after complete broadcast chain join. by @bianhq in #483
  • [Issue #484] fix post partition for right side broadcast chain join. by @bianhq in #486
  • [Issue #487] add argument check in pixels-planner. by @bianhq in #488
  • [Issue #435] remove request id from scan output path. by @bianhq in #489
  • [docs] Update install.md: Maven version should be above or equal to 3.6. by @yuly16 in #494
  • [Issue #471] Integrate pixels reader c++. by @yuly16 in #496
  • [Issue #391] minor changes. by @bianhq in #497
  • [Issue #493] Align the location of column byte buffer in pxl file. by @yuly16 in #495
  • [Issue #498] remove the orders array from dictionary encoding. by @bianhq in #499
  • [Issue #498] fix wrrong offset in buffer read. by @bianhq in #500
  • [Issue #485] clean metadata cache and make it transactional. by @bianhq in #501
  • [Issue #471] some fix for c++ reader. by @yuly16 in #502
  • [Issue #490] support relaxed and best-effort query execution. by @bianhq in #503
  • [docs] Update pixels-turbo/pixels-worker-vhive/README.md by @jasha64 in #507
  • [Issue #491] fix query service and revise comments. by @bianhq in #509
  • [Issue #498] remove the orders array from StringColumnReader in C++. by @yuly16 in #510
  • [Issue #512] Fix the out-of-range issue caused by splitting a row by @yuly16 in #513
  • [Issue #514] Change pixels c++ column vector from 4k alignment to 32 byte alignment by @yuly16 in #515
  • [Issue #516] enhance exception handling in base workers by @bianhq in #518
  • [Issue #519] fix wrong column chunk offsets in a multi-row-group file by @bianhq in #520
  • [Issue #517] load single tbl file into multiple paths by @bianhq in #523
  • [Issue #524] using full path as the key in file footer cache by @bianhq in #525
  • [Issue #522] add layout configurations into the file format by @bianhq in #526
  • [Issue #527] Fix the bug that iovecs might free twice by @yuly16 in #528
  • [Issue #521] support chunk aligned compact file by @bianhq in #532
  • [Issue #471] pixels reader c++ code refactor by @yuly16 in #533
  • [Issue #531] support configurable endianness by @bianhq in #534
  • [Issue #535] change project version to 0.1.0 by @bianhq in #536

New Contributors

  • @bianhq and @ray6080 were the first two authors of the project. They co-authored the basic framework of Pixels
  • @taoyouxian made contributions to the Presto and Hive integrations of Pixels
  • @mzp0514 made contributions to the initial implementation of data upserts in Pixels
  • @xxchan made contributions to the Trino integration of Pixels and the initial implementation of snapshot query execution
  • @TiannanSha made contributions to serverless query acceleration in AWS Lambda
  • @josephhany made contributions to exploring query cost estimation solutions
  • @Yeeef made contributions to extending in-memory columnar cache to SSDs
  • @yuly16 made contributions to the implementation of the C++ reader and DuckDB integration of Pixels, and improved the query performance on SSDs
  • @voidforall made contributions to the query service implementation and the hybrid query execution in DuckDB
  • @zhaoshihan made contributions to serverless query execution and performance profiling in vHive
  • @jasha64 made contributions to SSD performance benchmarking and serverless query execution in vHive

Full Changelog: https://github.com/pixelsdb/pixels/commits/v0.1.0