Main Features
- Optimized columnar storage on HDFS, S3, MinIO, local FS, GCS, and Redis. Significantly outperforms Parquet and ORC.
- A distributed in-memory columnar cache that further improves I/O performance for data analytics.
- External query engine integrations for Trino, Presto, Hive, and DuckDB. Significantly improves query performance in these query engines.
- Metadata (database schema and data catalog) management for data lakes and warehouses.
- Internal query accelerator (Pixels-Turbo) in serverless computing environments, including AWS Lambda and vHive (K8S+Knative+Firecracker).
- REST query API that exposes Pixels as a serverless analytics service for external users.
Release Notes
What's Changed
- debug hive by @bianhq in #1
- make pixels-hive usable by @bianhq in #7
- Finish pixels-hive by @bianhq in #8
- refine docs of pixels-hive. by @bianhq in #9
- refine pixels-hive doc and fix pixels-load. by @bianhq in #10
- Refine code and docs. by @bianhq in #11
- add comments by @bianhq in #18
- Rename the prefix of packages to io.pixelsdb by @ray6080 in #22
- [Issue 24]: add balancers for pixels cache. by @bianhq in #25
- [Issue #24]: move file if it is not local by @bianhq in #28
- HOTFIX: upgrade fastjson to version 1.2.58 for security by @bianhq in #29
- [Issue #30]: make NUMA interleaved. by @bianhq in #31
- [Issue #30]: add start-vmtouch and make it NUMA interleaved. by @bianhq in #32
- [HOTFIX]: remove mysql-connector from dependencies due to license conflicts. by @bianhq in #33
- [HOTFIX]: remove or upgrade insecure packages from dependencies. by @bianhq in #34
- [HOTFIX]: rollback jackson to 2.8.1. by @bianhq in #35
- [Issue #36]: add copyright and license notice. by @bianhq in #37
- Revert "[Issue #36]: add copyright and licence notice." by @bianhq in #38
- [Issue #36]: add copyright and license notice. by @bianhq in #39
- [Issue #36]: fix copyright and licence notice. by @bianhq in #40
- [Issue #36]: rename LICENCE file. by @bianhq in #41
- [Issue #36]: update NOTICE. by @bianhq in #42
- HOTFIX: import external jars when starting pixels. by @bianhq in #43
- HOTFIX: hive docs. by @bianhq in #45
- [Issue #47]: using different lock files for coordinator and datanode daemons. by @bianhq in #48
- [Issue #49] fix bugs in VarcharArrayBlock by @bianhq in #50
- [Issue #52]: implement direct shared memory read. by @bianhq in #53
- [Issue #52]: reduce memory copy in BooleanColumnReader and IntegerColumnReader. by @bianhq in #54
- [Issue #55]: reduce memory copy in column readers. by @bianhq in #56
- [HOTFIX]: bug in DynamicIntArray.toArray() and redundant memory copy. by @bianhq in #57
- [Issue #58]: add close method to resources. by @bianhq in #59
- Issue #58: add gc threshold to optimize gc for small queries. by @bianhq in #62
- [Issue #44]: implement Etcd metadata store. by @bianhq in #65
- [Issue #67] implement cache read / write coordination. by @bianhq in #68
- [Issue #67] implement three-phase cache update. by @bianhq in #69
- Hotfix: fix bugs in CacheWriter initialization and cache update. by @bianhq in #70
- HOTFIX: clean unused exceptions. by @bianhq in #71
- [Issue #72]: optimize memory allocation and access in PixelsCacheReader. by @bianhq in #73
- [Issue #72]: fix bugs in pixels-cache and implement loading radix from index file. by @bianhq in #77
- [Issue #67]: Implement cache read lease and optimize read performance. by @bianhq in #79
- [Issue #78]: avoid cache probing on uncached tables in Presto. by @bianhq in #80
- [Issue #78]: avoid cache probing on uncached tables in Hive. by @bianhq in #81
- [Issue #83]: implement JIT splitting for ordered path. by @bianhq in #84
- [Issue #85]: fix list table error when schema is empty. by @bianhq in #86
- [Issue #87]: remove explicit gc from PixelsReader. by @bianhq in #89
- [Issue #88]: fix MAX_READER_COUNT. by @bianhq in #90
- [Issue #91]: use three bytes for cache reader count. by @bianhq in #92
- [Issue #94]: Support Date and Time types. by @bianhq in #95
- [Issue #98]: fix insert/update related metadata service and getLayout. by @bianhq in #101
- [Issue #99]: fix null value storage. by @bianhq in #102
- [Issue #103]: fix and enhance predicate processing. by @bianhq in #104
- [Issue #105]: fix endless execution for
select count(*)
. by @bianhq in #106 - [Issue #100]: refine type management and add varchar/char support. by @bianhq in #107
- [Issue #108]: replace hdfs FileSystem api with the unified Storage api. by @bianhq in #110
- [Issue #108]: implement LocalFS and global auto-increment id. by @bianhq in #112
- [Issue #113]: collect the cumulative memory usage in pixels record reader. by @bianhq in #116
- [Issue #115]: replace message queue implementation. by @bianhq in #117
- [Issue #114]:support S3 storage and asynchronous I/O scheduling. by @bianhq in #118
- [Issue #120]: support configurable S3 clients. by @bianhq in #122
- [Issue #121]: fix listing objects. by @bianhq in #123
- [Issue #124]: refine read path. by @bianhq in #125
- [Issue #126]: add rate limit and request retry policy. by @bianhq in #127
- [Issue #128]: implement request diversion and refine java package layout. by @bianhq in #130
- [Issue #131]: implement projections for compact layout. by @bianhq in #134
- [Issue #133]: fix and refine retry policy. by @bianhq in #135
- [Issue #136]: refine thread factory for async read using sync client. by @bianhq in #137
- [Issue #132]: upgrade supported Presto version from 0.192 to 0.215. by @bianhq in #138
- [Issue #142]: fix mbps rate-limit. by @bianhq in #143
- [Issue #145]: fix date type for Presto-0.215. by @bianhq in #148
- [Issue #149]: fix configuration and dependency. by @bianhq in #150
- [Issue #153]: add adaptive reading method. by @bianhq in #154
- [Issue #144]: fix scripts and finish docs. by @bianhq in #157
- [Issue #156]: fix empty schema. by @bianhq in #159
- [Issue #158]: support loading data from and to arbitrary storage. by @bianhq in #161
- [Issue #160]: compact from and to arbitrary storage, including tail files. by @bianhq in #162
- [Issue #163]: fix bounded varchar/char type support. by @bianhq in #165
- [Issue #164]: define storage scheme in CREATE statement. by @bianhq in #166
- [Issue #167]: fix retained size calculation of VarcharArrayBlock. by @bianhq in #168
- [Issue #169]: add session properties about layout-path enabling to pixels-presto. by @bianhq in #171
- [Issue #172]: implement record cursor and enhance record reader. by @bianhq in #173
- [Issue #174]: implement transaction server and pass query (trans) id into record reader. by @bianhq in #176
- [Issue #175]: enhance transaction info of queries and pass query id into I/O schedulers. by @bianhq in #177
- [Issue #179]: fix show tables from information_schema. by @bianhq in #180
- [Issue #178]: enable metrics server by configuration parameter. by @bianhq in #182
- [Issue #183]: fix show columns from information_schema. by @bianhq in #184
- [Issue #181]: fix start and stop of pixels-daemon. by @bianhq in #185
- [Issue #186]: skip cache initialization if cache is disabled. by @bianhq in #187
- [Issue #139]: stop retrying the request from terminated queries. by @bianhq in #188
- [Issue #190]: fix data copying between S3 buckets. by @bianhq in #191
- [Issue #194]: support views. by @bianhq in #195
- [Issue #196]: refine type management and support decimal. by @bianhq in #197
- [Issue #196]: fix DecimalColumnVector and use decimal in TPC-H schema. by @bianhq in #200
- [Issue #192]: support multi-thread compaction. by @bianhq in #201
- [Issue #192]: refine compact, S3, PixelsCompactor, and support multi-thread copying. by @bianhq in #202
- [Issue #205]: fix integer overflow in request merging. by @bianhq in #206
- [Issue #210]: some minor improvements. by @xxchan in #204
- [Issue #208]: fix column statistics for decimal. by @bianhq in #211
- [Issue #207]: fix the data type metadata in the file footer. by @bianhq in #212
- [Issue #209]: clean code. by @bianhq in #213
- [Issue #189]: support folders on S3. by @bianhq in #215
- [Issue #189]: revise docs and scripts. by @bianhq in #216
- [Issue #217]: split presto and hive integrations into sub-projects. by @bianhq in #218
- [Issue #220] revise license, readme, and pom. by @bianhq in #221
- [Issue #222]: disable mock file locations for storage systems that do not provide data locality. by @bianhq in #223
- [Issue #170]: implemented scan operator. by @TiannanSha in #229
- [Issue #224]: clean and refine cache key and cache entry implementation. by @bianhq in #230
- [Issue #231]: update docs and comments. by @bianhq in #232
- [Issue #225]: upgrade to Hadoop 3.3.1 and clean dependencies. by @bianhq in #234
- [Issue #231]: enable storage schemes in configuration. by @bianhq in #235
- [Issue #233]: fix log4j configurations. by @bianhq in #236
- [Issue #193]: revise the README under modules. by @bianhq in #237
- [Issue #170]: fix dependencies and logging for pixels-lambda. by @bianhq in #243
- [Issue #238]: Add a script to install pixels by @xxchan in #239
- [Issue #170]: implement filter. by @TiannanSha in #246
- [Issue #170]: clean the unused files and reformat the code. by @bianhq in #247
- [Issue #170]: update poms for pixels-lambda. by @bianhq in #248
- [Issue #249]: remove InvalidActivityException. by @bianhq in #250
- [Issue #245]: support reading remote config file. by @bianhq in #252
- [Issue #170]: optimize Pixels S3 writer and lambda. by @bianhq in #253
- [Issue #170]: implement table scan filter and refine scan worker. by @bianhq in #254
- [Issue #170]: support direct write back to on-premise minio. by @bianhq in #255
- [Issue #170]: support lambda scan. by @bianhq in #256
- [Issue #170]: fix the discrete filter for string-based columns. by @bianhq in #257
- [Issue #170]: fix S3 folder deletion. by @bianhq in #259
- HOTFIX: refine comments. by @bianhq in #260
- [Issue #261]: move table scan predicates into pixels-executor. by @bianhq in #262
- [Issue #170]: implement hash partitioned join. by @bianhq in #263
- [Issue #170]: implement broadcast join. by @bianhq in #264
- [Issue #265]: fix reading row group number. by @bianhq in #266
- [Issue #268]: fix null value check for join. by @bianhq in #272
- [Issue #271]: improve discrete column filter. by @bianhq in #273
- [Issue #270]: support full outer join. by @bianhq in #275
- [Issue #170]: enhance joins and implement join tree executor. by @bianhq in #281
- [Issue #170]: support join endian. by @bianhq in #282
- [Issue #170]: disable left full outer broadcast join. by @bianhq in #283
- [Issue #170]: fix join input splits generation, refine join inputs and join operator. by @bianhq in #284
- [Issue #170]: fix and refine join inputs and join workers. by @bianhq in #285
- [Issue #258]: implement table and column statistics. by @bianhq in #287
- [Issue #258]: add join advisor and fix join execution. by @bianhq in #288
- [Issue #258]: support multi-pipeline join and fix bugs. by @bianhq in #289
- [Issue #258]: implement partitioned chain join. by @bianhq in #290
- [Issue #258]: implement split size capping. by @bianhq in #291
- [Issue #170]: add invoker factory and get worker name from config file. by @bianhq in #292
- [Issue #170]: implement work exception handling and join output collection. by @bianhq in #293
- [Issue #294]: fix blocking splits when lambda scan is enabled. by @bianhq in #295
- [Issue #203]: support long decimal with 38 max digit precision and scale. by @bianhq in #296
- [Issue #297]: fix timestamp type, stat recorders, null value filtering, and pixels-load. by @bianhq in #298
- [Issue #258]: upgrade Prometheus dependencies. by @josephhany in #300
- [Issue #170] implement aggregation execution. by @bianhq in #301
- [Issue #170]: fix null-pointer in scan worker. by @bianhq in #302
- [Issue #170]: fix column stats recorders and split size capping. by @bianhq in #303
- [Issue #305]: support deleting more than 1000 files from S3. by @bianhq in #306
- [Issue #170]: support scan projection in scan worker. by @bianhq in #307
- [Issue #170]: implement min/min in column stats in metadata. by @bianhq in #308
- [Issue #170]: fix aggregation worker. by @bianhq in #309
- [Issue #170]: improve multi-thread copying. by @bianhq in #310
- [Issue #170] add metadata cache and cost-based splits index. by @bianhq in #311
- [Issue #170]: merge outputs in lambda worker. by @bianhq in #312
- [Issue #170]: fix loading path. by @bianhq in #313
- [Issue #170]: add row count broadcast threshold. by @bianhq in #314
- [Issue #170]: join optimization for very large datasets. by @bianhq in #315
- [Issue #170]: optimizations for large joins. by @bianhq in #316
- [Issue #170]: optimizing hash functions. by @bianhq in #317
- [Issue #170]: implement partition projection. by @bianhq in #318
- [Issue #170]: improve execution pipeline. by @bianhq in #319
- [Issue #170]: improve join algorithm selection and broadcast split size adjustment. by @bianhq in #321
- [Issue #170]: add script to run before each new instance became in-service. by @TiannanSha in #322
- [Issue #170]: using multi-thread for column encoding in Pixels writer. by @bianhq in #323
- [Issue #170]: update spot scripts. by @bianhq in #324
- [Issue #170] collect performance metrics from serverless workers. by @bianhq in #325
- [Issue #170]: add trans concurrency and GC monitor, and tune log level. by @bianhq in #326
- [Issue #170]: update aggregation plan and spot vm user data. by @bianhq in #327
- [Issue #170]: improve get num partitions. by @bianhq in #329
- [Issue #170]: remove existence check from workers. by @bianhq in #330
- [Issue #214]: implement multi-thread S3 output stream. by @bianhq in #331
- [Issue #214]: enable retry policy in S3OutputStream. by @bianhq in #332
- [Issue #170]: improve dictionary encoding and metrics collection. by @bianhq in #333
- [Issue #170]: add startling executor and fix the inputs of multi-pipeline broadcast join. by @bianhq in #334
- [Issue #170] fix hang in partitioned join worker. by @bianhq in #335
- [Issue #170]: optimize file existence checking in getFileSchema. by @bianhq in #336
- [Issue #170]: support Redis storage. by @bianhq in #337
- [Issue #170]: support default user in Redis. by @bianhq in #338
- [Issue #170]: fix null value processing in aggregation. by @bianhq in #339
- [Issue #170]: improve string comparison and aggregation. by @bianhq in #340
- [Issue #170]: fix empty file problem in aggregation. by @bianhq in #341
- [Issue #170]: add partitioning to aggregation and implement starling aggregation. by @bianhq in #342
- [Issue #170]: add null fraction and cardinality statistics into pixels-load. by @bianhq in #343
- [Issue #170]: support cardinality estimation for aggregation. by @bianhq in #344
- [Issue #345]: fix double start of retry policy. by @bianhq in #346
- [Issue #347]: reconnect to S3 when fail to get object. by @bianhq in #348
- [Issue #170]: support count aggregation. by @bianhq in #349
- [Issue #350]: remove request division. by @bianhq in #351
- [Issue #352]: support google cloud storage. by @bianhq in #353
- pixels partitioned cache protocol by @Yeeef in #355
- [Issue #357]: fix compilation and dependency problem. by @bianhq in #358
- [Issue #357] clean unused files. by @bianhq in #359
- [Issue #170]: move the code related to query planning to pixels-optimizer. by @bianhq in #360
- [Issue #170]: refine query queues. by @bianhq in #361
- [Issue #357]: update readme for pixels-trino. by @bianhq in #362
- [Issue #357]: refine auto scaling metrics. by @bianhq in #363
- [Issue #357]: update metrics collector. by @bianhq in #364
- [Issue #365]: fix statistics collection. by @bianhq in #366
- [Issue #367]: fix BinaryColumnVector for dictionary encoding. by @bianhq in #368
- [Issue #369]: implement dictionary-encoded column vector. by @bianhq in #370
- [Issue #371]: update metadata for view creation in Trino. by @bianhq in #372
- [Issue #373]: support direct read on localFS. by @bianhq in #375
- [Issue #374]: fix the column vectors. by @bianhq in #376
- [Issue #377]: fix ByteBufferInputStream. by @bianhq in #378
- [Issue #379]: support configurable direct/non-direct I/O in LocalFS. by @bianhq in #382
- [Issue #380]: rename pixels-optimizer to pixels-planner. by @bianhq in #383
- [Issue #381]: move out pixels-load and pixels-tools. by @bianhq in #384
- [Issue #386]: fix getRowNumber in PixelsRecordReaderImpl. by @bianhq in #387
- [Issue #385]: refine docs and some comments, and fix timestamp format for AWS CloudWatch metrics. by @bianhq in #389
- [Issue #388]: fix encoded column vector reading. by @bianhq in #390
- [Issue #394] fix non-encoded integer column reading. by @bianhq in #395
- [Issue #393] support mmap in local file systems. by @bianhq in #396
- [Issue #397] update install.sh. by @bianhq in #398
- [Issue #399] update docs and support listing the paths and statuses of the files in multiple directories. by @bianhq in #400
- [Issue #401] support different input and output storage scheme in compactor and create parent dir automatically for local fs. by @bianhq in #402
- [Issue #403] support async read on local fs. by @bianhq in #404
- [Issue #405] refine configuration properties. by @bianhq in #406
- [Issue #405] fix comments. by @bianhq in #407
- [Issue #409]: add an http server that provides restful api. by @bianhq in #411
- [Issue #408]: improve exception handling in pixels-daemon. by @bianhq in #412
- [Issue #410] implement the REST API for SQL execution. by @bianhq in #413
- [Issue #414] pixels reads replicated content on string column after the first row batch, when the column is not encoded. by @yuly16 in #415
- [Issue #417] the data in integer and long column reader is aligned by @yuly16 in #418
- [Issue #416] build pixels-turbo, split storage adapters, cf invokers, and scaling handlers into separate modules by @bianhq in #420
- [Issue #421] init pixels-proxy. by @bianhq in #425
- [Issue #419] fix timezone offset for date column. by @bianhq in #422
- [pixels-cli] fix session properties for stat. by @bianhq in #427
- [Issue #421] update docs and clean Date related code. by @bianhq in #428
- [Issue #421] implement basic pixels-proxy. by @bianhq in #430
- [Issue #429] upgrade Prometheus and exporters. by @bianhq in #432
- [Issue #433] prepare for vhive integrations. by @bianhq in #434
- [Issue #431] update docs, comments, and error print. by @bianhq in #440
- [Issue #441] add grpc example into pixels-server. by @bianhq in #442
- [Issue #443] add add-opens into manifest. by @bianhq in #444
- [docs] split the main readme into docs. by @bianhq in #445
- [docs] fix links. by @bianhq in #446
- [docs] fix build instructions. by @bianhq in #447
- [docs] refine readme. by @bianhq in #448
- [docs] refine docs for Pixels Turbo. by @bianhq in #449
- [Issue #450] fix lambda invoker unit tests and minor problems in lambda workers. by @bianhq in #451
- [Issue #452] enable reading other storages than s3 in serverless workers. by @bianhq in #453
- [docs] update pixels-turbo settings. by @bianhq in #454
- SQLglot transpile integration by @voidforall in #455
- Finish pixels parser by @voidforall in #456
- [Issue #431] refine transaction protocol. by @bianhq in #458
- [Issue #431] update docs and add file headers. by @bianhq in #461
- [Issue #459] update row count automatically. by @bianhq in #462
- [Issue #423] pixels vhive invoker and worker. by @zhaoshihan in #463
- [Issue #431] finish pixels query server. by @bianhq in #464
- [Issue #423] refine docs and add license headers. by @bianhq in #465
- [Issue #466] fix storage endpoint config. by @bianhq in #467
- [Issue #468] add operator name to the input of the cloud function workers. by @bianhq in #469
- [Issue #468] improve operator name setting. by @bianhq in #470
- [Issue #437] redesign the schema of metadata. by @bianhq in #474
- [Issue #471] implement the C++ reader for pixels. by @yuly16 in #473
- [docs] add Duckdb and C++ reader introduction. by @bianhq in #475
- HOTFIX: catalog metadata id by @voidforall in #476
- [docs] remove outdated information for statistics collection. by @bianhq in #478
- [docs] revise data compaction. by @bianhq in #479
- [Issue #423] modify settings for vHive worker & invoker by @zhaoshihan in #481
- [Issue #482] fix incorrect join plan for broadcast join after complete broadcast chain join. by @bianhq in #483
- [Issue #484] fix post partition for right side broadcast chain join. by @bianhq in #486
- [Issue #487] add argument check in pixels-planner. by @bianhq in #488
- [Issue #435] remove request id from scan output path. by @bianhq in #489
- [docs] Update install.md: Maven version should be above or equal to 3.6. by @yuly16 in #494
- [Issue #471] Integrate pixels reader c++. by @yuly16 in #496
- [Issue #391] minor changes. by @bianhq in #497
- [Issue #493] Align the location of column byte buffer in pxl file. by @yuly16 in #495
- [Issue #498] remove the orders array from dictionary encoding. by @bianhq in #499
- [Issue #498] fix wrrong offset in buffer read. by @bianhq in #500
- [Issue #485] clean metadata cache and make it transactional. by @bianhq in #501
- [Issue #471] some fix for c++ reader. by @yuly16 in #502
- [Issue #490] support relaxed and best-effort query execution. by @bianhq in #503
- [docs] Update pixels-turbo/pixels-worker-vhive/README.md by @jasha64 in #507
- [Issue #491] fix query service and revise comments. by @bianhq in #509
- [Issue #498] remove the orders array from StringColumnReader in C++. by @yuly16 in #510
- [Issue #512] Fix the out-of-range issue caused by splitting a row by @yuly16 in #513
- [Issue #514] Change pixels c++ column vector from 4k alignment to 32 byte alignment by @yuly16 in #515
- [Issue #516] enhance exception handling in base workers by @bianhq in #518
- [Issue #519] fix wrong column chunk offsets in a multi-row-group file by @bianhq in #520
- [Issue #517] load single tbl file into multiple paths by @bianhq in #523
- [Issue #524] using full path as the key in file footer cache by @bianhq in #525
- [Issue #522] add layout configurations into the file format by @bianhq in #526
- [Issue #527] Fix the bug that iovecs might free twice by @yuly16 in #528
- [Issue #521] support chunk aligned compact file by @bianhq in #532
- [Issue #471] pixels reader c++ code refactor by @yuly16 in #533
- [Issue #531] support configurable endianness by @bianhq in #534
- [Issue #535] change project version to 0.1.0 by @bianhq in #536
New Contributors
- @bianhq and @ray6080 were the first two authors of the project. They co-authored the basic framework of Pixels
- @taoyouxian made contributions to the Presto and Hive integrations of Pixels
- @mzp0514 made contributions to the initial implementation of data upserts in Pixels
- @xxchan made contributions to the Trino integration of Pixels and the initial implementation of snapshot query execution
- @TiannanSha made contributions to serverless query acceleration in AWS Lambda
- @josephhany made contributions to exploring query cost estimation solutions
- @Yeeef made contributions to extending in-memory columnar cache to SSDs
- @yuly16 made contributions to the implementation of the C++ reader and DuckDB integration of Pixels, and improved the query performance on SSDs
- @voidforall made contributions to the query service implementation and the hybrid query execution in DuckDB
- @zhaoshihan made contributions to serverless query execution and performance profiling in vHive
- @jasha64 made contributions to SSD performance benchmarking and serverless query execution in vHive
Full Changelog: https://github.com/pixelsdb/pixels/commits/v0.1.0