Release v0.7.0
Apache Uniffle (Incubating) Release v0.7.0
Highlight
- Better support for Spark AQE
- Leveraging the LOCAL_ORDER data distribution to improve performance
- Estimating assignment number of shuffle servers to improve user experience
- Better assignment policy of assigning adjacent partitions to the same shuffle server to improve performance
- Optimization of huge partition to improve stability of shuffle servers
- More bug fixes and usability improvements of K8S operator
- Add support of user quota management and more compression algorithms
- Add support of spark data eviction mechanism of stage level
- Some improvement of stability and performance
Changelog
- Add more badges in README by @kaijchen in #219
- Fix incorrect log format strings by @kaijchen in #220
- Change total lines badge url to sloc.xyz in README by @kaijchen in #222
- [MINOR] Fix warnings reported by lgtm by @kaijchen in #223
- [MINOR] Simplify creating buffer logic by @zuston in #227
- Support cancelling previous ci actions by @zuston in #225
- Use the conf of shuffleNodesNumber from jobs to be as checking factor by @zuston in #208
- Output the stderr and stdout to output file in startup script by @zuston in #226
- [ISSUE-48][FEATURE][FOLLOW UP] Add controller component by @wangao1236 in #214
- Add more metrics about requiring read memory by @zuston in #231
- Adjust the memory required times to match grpc max deadline conf by @zuston in #218
- [MINOR] Fix flaky test by @jerqi in #238
- [ISSUE-48][FEATURE][FOLLOW UP] Add yaml of components and crd exampes by @wangao1236 in #236
- Fix Flaky test GetShuffleReportForMultiPartTest by @leixm in #241
- Set the default disk capacity to the total space by @zuston in #237
- Add issue template by @jerqi in #8
- [MINOR] Fix inefficient map iteration by @kaijchen in #245
- Support deploy multiple shuffle servers in a single node by @xianjingfeng in #166
- Fast fail when reading failed in ComposedClientReadHandler by @zuston in #213
- Fix startup shell problem by @jerqi in #251
- New version 0.7.0-snapshot by @jerqi in #252
- [ISSUE-196] Fix flaky test about kerberos by @zuston in #250
- [ISSUE-48][FEATURE][FOLLOW UP] add unit test for validating rss objects by @wangao1236 in #248
- [Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy by @smallzhongfeng in #210
- [TYPO] Replace Chinese colon by ASCII colon by @kaijchen in #255
- Introduce startup-silent-period mechanism to avoid partial assignments by @zuston in #247
- Replace DISCLAIMER with DISCLAIMER-WIP by @jerqi in #258
- [ISSUE-244] Fix flaky test of CoordinatorGrpcTest.rpcMetricsTest by @zuston in #256
- Fix flaky test of ClientConfManagerTest by @smallzhongfeng in #260
- [Refactor] Optimize creating shuffle handlers by @zuston in #259
- Introduce data cleanup mechanism on stage level by @zuston in #249
- [ISSUE-48][FEATURE][FOLLOW UP] add docs for operator by @wangao1236 in #261
- Fix potenial missing reads of exclude nodes by @zuston in #269
- [ISSUE-257] RssMRUtils#getBlockId change the partitionId of int type to long by @fpkgithub in #266
- [ISSUE-273][BUG] Get shuffle result failed caused by concurrent calls to registerShuffle by @leixm in #274
- Add enum type test about case insensitive by @zuston in #280
- Support ZSTD by @zuston in #254
- [ISSUE-239][BUG] RssUtils#transIndexDataToSegments should consider the length of the data file by @leixm in #275
- Remove code quality badge and add release badge by @kaijchen in #284
- [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write by @xianjingfeng in #235
- Upgrade Github actions for Node.js 16 by @kaijchen in #292
- Fix NPE in WriteBufferManager.addRecord by @wForget in #296
- Fix AbstractStorage#containsWriteHandler by @xianjingfeng in #281
- Add more test cases on LocalStorageManager.selectStorage by @zuston in #298
- [ISSUE-137][Improvement][AQE] Sort MapId before the data are flushed by @zuston in #293
- [ISSUE-283][FEATURE] Support snappy compression/decompression by @amaliujia in #304
- [ISSUE-290] Make RpcNodePort and HttpNodePort optional by @amaliujia in #305
- [ISSUE-301][Subtask][Improvement][AQE] Merge continuous ShuffleDataSegment into single one by @zuston in #303
- Cleanup RuntimeException and fetchRemoteStorage logic in ClientUtils by @kaijchen in #295
- [ISSUE-135][FOLLOWUP][Improvement][AQE] Assign adjacent partitions to the same ShuffleServer by @leixm in #307
- Correct the contributing guide link in pull-request template by @zuston in #314
- Fix bug of "Comparison method violates its general contract" by @zuston in #315
- [AQE][LocalOrder] Fix potenial bug when merging continuous segments by @zuston in #318
- [AQE][LocalOrder] Fix wrong param of expectedTaskIds in LocalOrderSegmentSplit by @zuston in #319
- [Feature] Support the estimated number of ShuffleServers required. by @leixm in #322
- [Bug] Fix potenial bug when the index reading offset is greater than data length by @zuston in #320
- [ISSUE-154][Improvement] Support Empty assignment to Shuffle Server by @rhh777 in #325
- [Bug] Fix invalid owner of host path volumes by @wangao1236 in #330
- [ISSUE-309][FEATURE] Support ShuffleServer latency metrics. by @leixm in #327
- [ISSUE-329]Catch NPE in ShuffleTaskManager#addFinishedBlockIds by @xianjingfeng in #331
- [BUG] Fix wrong method name by @leixm in #335
- [ISSUE-328] Cleanup unused shuffle servers after stage completed by @xianjingfeng in #334
- [MINOR] Migrate RankValue to the package of the common class by @smallzhongfeng in #265
- [BUG] Fix incorrect spark metrics by @zuston in #324
- [Improvement][LocalOrder] Add tests about keeping consistent with FixedSize when no skew optimization by @zuston in #336
- [INFRA] Add k8s pipeline by @jerqi in #340
- Remove unused class of RssShuffleUtils by @zuston in #345
- [ISSUE-342][Improvement] Check Spark Serializer type by @chong0929 in #344
- [Feature] Support user's app quota level limit by @smallzhongfeng in #311
- [BUG][AQE][LocalOrder] Fix the bug of missed data due to block sorting by @zuston in #347
- [ISSUE-364] Fix
indexWriter
don't close if exception thrown when close dataWriter by @xianjingfeng in #349 - [BUG] Fix flaky test of AQESkewedJoinWithLocalOrderTest by @zuston in #350
- Add collaborators by @jerqi in #351
- [BUG][FOLLOWUP] Fix flaky test of AQESkewedJoinWithLocalOrderTest by @zuston in #355
- [BUG][AQE][LocalOrder] Remove check of discontinuous map task ids by @zuston in #354
- [Improvement] Task fast fail once blocks fail to send by @zuston in #332
- [ISSUE-228][FEATURE] Add a period local storage cleaner thread by @sfwang218 in #357
- [Improvement] Optimize the use of QuotaManager by @smallzhongfeng in #359
- [ISSUE-339] Optimize retry logic in send shuffle data by @xianjingfeng in #361
- [ISSUE-300] Make config type of RSS_CLIENT_TYPE as enum by @selectbook in #310
- [ISSUE-228] Fix the problem of protobuf-java incorrect dependency at compile time by @tsface in #362
- [ISSUE-124] Add fallback mechanism for blocks read inconsistent by @xianjingfeng in #276
- [BUG] Fix potential memory leak when encountering disk unhealthy by @zuston in #370
- [Improvement][AQE] Support getting memory data skip by upstream task ids by @zuston in #358
- [ISSUE-369] Don't throw exception if blocks are corrupted but have multi replicas by @xianjingfeng in #374
- [ISSUE-285][Improvement] Only use HDFS and LOCALFILE storageType in the test by @tiantingting5435 in #360
- Fix typo of PreferDiffHostAssignmentStrategy by @zuston in #379
- [ISSUE-376] Fix concurrency problems may occur when the ApplicationManager register app by @smallzhongfeng in #382
- [ISSUE-380] Refactor the flush process to fix fallback fail by @zuston in #383
- [Refactor] Make coordinator class more organized by @smallzhongfeng in #386
- [ISSUE-392] Fix the bug in the shuffle data cleanup checker that causes false reports of disk corruption by @zuston in #393
- [ISSUE-390] Print more infos after read finished by @xianjingfeng in #395
- [Improvement] Skip blocks when read from memory by @xianjingfeng in #294
- [Improvement] Small refactor for code quality by @advancedxy in #394
- [BUG] Fix incorrect block info statistics after read finished by @xianjingfeng in #401
- Revert "[Improvement] Skip blocks when read from memory (#294)" by @xianjingfeng in #403
- [ISSUE-388][ISSUE-244][Bug] Fix incorrect usage of
GRPCMetrics#setGauge
by @xianjingfeng in #404 - [MINOR] If there is no data flush to hdfs, return directly instead of throw exception by @xianjingfeng in #406
- Support writing multi files of single partition to improve speed in HDFS storage by @zuston in #396
- Fix incorrect metrics of event_queue_size and total_write_handler by @zuston in #411
- [Improvement] Support skip memory data when use multiple replicas by @xianjingfeng in #400
- [ISSUE-402] Flaky Test: QuorumTest#case1 by @Rembrant777 in #422
- Fix flaky test QuotaManagerTest#testDetectUserResource by @xianjingfeng in #421
- Improve README by @kaijchen in #427
- Remove unused data structure and method by @zuston in #429
- [FOLLOWUP] Remove unused methods in Storage interface by @zuston in #431
- Flaky Test: AppBalanceSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #438
- [Minor] Move GrpcServerTest to common.rpc package by @kaijchen in #439
- [Minor] refactor test code by @advancedxy in #432
- [Minor][Improvement] Introduce FileWriter interface for Localfile/HDFS file writer by @zuston in #444
- [ISSUE-169] Support metric reporter and Support promethues push gateway by @xianjingfeng in #415
- [Improvement] Avoid selecting storage which has reached the high watermark by @zuston in #424
- [Bug] Fix potential negative preAllocatedSize variable by @advancedxy in #428
- [Feature] add a configuration to control shuffle data flush by @advancedxy in #445
- [Improvement] Refactor getPartitionRange to calculate range directly by @a-li in #447
- Fix Flaky Test: AppBalanceSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #450
- Fix Flaky Test: QuotaManagerTest#testDetectUserResource by @smallzhongfeng in #453
- [ISSUE-451][Improvement] Read HDFS data files with random sequence to distribute pressure by @zuston in #452
- [ISSUE-455] Lazily create uncompressedData by @xianjingfeng in #457
- [ISSUE-378][HugePartition][Part-1] Record every partition data size for one app by @zuston in #458
- [ISSUE-456] Avoid removeResources for multiple times by @xianjingfeng in #459
- [ISSUE-461] Support Spark 3.3 by @kaijchen in #463
- [Deps] Bump slf4j to 1.7.36 to fix vulnerability in slf4j-log4j12 by @kaijchen in #464
- [ISSUE-448][Feature] shuffle server report storage info by @advancedxy in #449
- [ISSUE-472] Fix Flaky Test: LocalFileServerReadHandlerTest#testDataInconsistent by @zuston in #473
- [Improvement] Remove some unused empty server metrics by @zuston in #474
- [Improvement] Add more logs about data flush by @zuston in #482
- Fix potential race condition when registering remote storage info by @zuston in #481
- [Minor] Improve readability by replacing lambda with method reference by @iwangjie in #488
- [ISSUE-489][Minor] Cleanup some code by @iwangjie in #490
- [Test] Assume unknown blockID in LocalFileHandlerTestBase by @kaijchen in #478
- [ISSUE-475][Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter by @zjf2012 in #480
- [ISSUE-378][HugePartition][Part-2] Introduce memory usage limit and data flush by @zuston in #471
- [ISSUE-484] Fix accidentally remove the storage of appId when unregistering partial shuffle in HdfsStorageManager by @zuston in #485
- [Test] Cleanup tests with Files#createTempDir() by @kaijchen in #492
- [Test] Add ConfigUtilsTest by @kaijchen in #500
- [Deps] Bump protobuf to 3.19.6 to address vulnerability by @kaijchen in #499
- [Minor] Make Constants final by @kaijchen in #501
- Cleanup UnitConverter and improve UnitConverterTest by @kaijchen in #504
- Fixes errors in doc header and operator install command by @zuston in #506
- [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition by @zuston in #494
- [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions by @zuston in #505
- [Deps] Switch to slf4j-reload4j by @kaijchen in #508
- [ISSUE-512][Operator] Bump golang to 1.17 by @kaijchen in #515
- [ISSUE-514] Fix flaky test: ShuffleServerGrpcTest#clearResourceTest by @xianjingfeng in #516
- [ISSUE-507] Fix Flaky Test: ShuffleBufferManagerTest#cacheShuffleDataTest by @xianjingfeng in #511
- [ISSUE-509] Fix Flaky Test: ShuffleBufferManagerTest#shuffleFlushThreshold by @xianjingfeng in #510
- [ISSUE-479] [operator] refine operator's build system by @advancedxy in #491
- [ISSUE_479][operator][followup] use exec form instead of shell form in Dockerfile by @advancedxy in #518
- [SpotBugs] Set threshold to middle with exceptions by @kaijchen in #517
- [ISSUE-522] [operator] pass RSS_IP to coordinator container env by @advancedxy in #523
- [ISSUE-496][operator] infer resource request/limit from spec for init container by @advancedxy in #521
- [SpotBugs] Remove unread protected field in Checker by @kaijchen in #520
- [ISSUE-468] Put unavailable servers to the end of the list when sending shuffle data by @xianjingfeng in #470
- chore: add new collaborator by @advancedxy in #535
- [SpotBugs] Fix REC_CATCH_EXCEPTION by @kaijchen in #527
- [ISSUE-524][operator] Upgrading rss could also be deleted by @advancedxy in #531
- [ISSUE-553] Avoid removing buffer multiple times when clearing resources by @xianjingfeng in #534
- [ISSUE-469][operator] feat: supports adding labels to rss pods. by @wangao1236 in #528
- [SpotBugs] Fix UWF_UNWRITTEN_PUBLIC_OR_PROTECTED_FIELD by @kaijchen in #536
- Result of mkdirs() is ignored in LocalFileWriteHandler#createBasePath() by @kaijchen in #537
- [Minor] Optimize ShuffleServerInfo#hashCode by @xianjingfeng in #538
- [SpotBugs] Enable SC_START_IN_CTOR check by @kaijchen in #541
- [operator] fix error kind of ownerreference by @wangao1236 in #540
- [ISSUE-542] Ensure the elements of
StatusCode
andRssProtos.StatusCode
are the same by @xianjingfeng in #543 - Flaky Test: LowestIOSampleCostSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #544
- [Improvement] Only report to the shuffle servers that owns the blocks by @zuston in #539
- [Minor] Cleanup "throws RuntimeException" by @kaijchen in #549
- [ISSUE-546] Replace ResponseStatusCode with StatusCode by @xianjingfeng in #547
- [ISSUE-476][FEATURE] Respect
spark.shuffle.compress
configuration in Uniffle by @zjf2012 in #495 - [ISSUE-545][operator] feat: support setting runtime class name and env for rss by @wangao1236 in #548
- [ISSUE #525][operator] refine svc creations by @advancedxy in #530
- [#552] docs: add more doc about spark.serializer requirement by @advancedxy in #556
- [#559] test: use withEnvironmentVariable to replace RssUtilsTest#setEnv by @advancedxy in #560
- [Followup #249] refactor: cleanup code and unify interfaces by @kaijchen in #558
- [#554] feat: infer rss base storage conf from env by @advancedxy in #555
- [MINOR] Remove commiters from collaborators by @jerqi in #563
- [#525][FOLLOWUP] fix: add omitempty tag by @advancedxy in #565
- [#545][FOLLOWUP] update rbac rule for webhook by @advancedxy in #566
- [#571] feat: skip init for empty writable base dir by @advancedxy in #573
- [#567] feat: add a shuffle-server metric about read_used_buffer_size by @zuston in #568
- [MINOR] test: fix assertion in tests by @kaijchen in #574
- [#575] refactor: replace switch-case with EnumMap in ComposedClientReadHandler by @kaijchen in #570
- [#580] chore: improve CI workflows by @kaijchen in #579
- [#410] feat: support the hot reload of coordinator's configuration by @jerqi in #572
- [MINOR] chore(deps): bump go-restful to 2.16.0 in operator by @dependabot in #577
- [#580] chore: move deploy/kubernetes to a standalone workflow by @kaijchen in #578
- [MINOR] refactor: simplify ShuffleWriteClientImpl#genServerToBlocks() by @kaijchen in #594
- [MINOR] chore: remove duplicated dependency in rss-client-mr by @kaijchen in #599
- [#580] chore: speed up CI workflows by @kaijchen in #602
- [MINOR] chore(operator): bump prometheus/client_golang to 1.11.1 by @dependabot in #601
- [MINOR] docs: correct the format of server_guide doc by @zuston in #608
- [FOLLOWUP] fix: don't recreate base dir if it's already existed by @advancedxy in #622
New Contributors
- @fpkgithub made their first contribution in #266
- @amaliujia made their first contribution in #304
- @rhh777 made their first contribution in #325
- @chong0929 made their first contribution in #344
- @sfwang218 made their first contribution in #357
- @selectbook made their first contribution in #310
- @tsface made their first contribution in #362
- @tiantingting5435 made their first contribution in #360
- @Rembrant777 made their first contribution in #422
- @a-li made their first contribution in #447
- @iwangjie made their first contribution in #488
- @zjf2012 made their first contribution in #480
- @dependabot made their first contribution in #577
Full Changelog: v0.6.1...v0.7.0-rc1