released this
19 May 11:03
90 commits
to main
since this release
Major Features and Improvements
- Suport tf.int32 dtype using feature_column API
. - Make the rules of export frequencies and versions the same as the rule of export keys.
- Optimize cuda kernel implementation in GroupEmbedding.
- Support to read embedding files with mmap and madvise, and direct IO.
- Add double check in find_wait_free of lockless dense hashmap.
- Change Embedding init value of version in EV from 0 to -1.
- Interface 'GetSnapshot()' backward compatibility.
- Implement CPU GroupEmbedding lookup sparse Op.
- Make GroupEmbedding compatible with sequence feature_column interface.
- Fix sp_weights indices calculation error in GroupEmbedding.
- Add group_strategy to control parallelism of group_embedding.
Graph & Grappler Optimization
- Support SparseTensor as placeholder in Sample-awared Graph Compression.
- Add Dice fusion grappler and ops.
- Enable MKL Matmul + Bias + LeakyRelu fusion.
Runtime Optimization
- Avoid unnecessary polling in EventMgr.
- Reduce lock cost and memory usage in EventMgr when use multi-stream.
Ops & Hardware Acceleration
- Register GPU implementation of int64 type for Prod.
- Register GPU implementation of string type for Shape, ShapeN and ExpandDims.
- Optimize list of GPU SegmentReductionOps.
- Optimize zeros_like_impl by reducing calls to convert_to_tensor.
- Implement GPU version of SparseSlice Op.
- Delay Reshape when rank > 2 in keras.layers.Dense so that post op can be fused with MatMul.
- Implement setting max_num_threads hint to oneDNN at compile time.
- Implement TensorPackTransH2DOp to improve SmartStage performance on GPU.
- Add tensor shape meta-data support for ParquetDataset.
- Add arrow BINARY type support for ParquetDataset.
- Add Dice fusion to inference mode.
- Enable INFERENCE_MODE in processor.
- Support TensorRT 8.x in Inference.
- Add configure filed to control enable TensorRT or not.
- Add flag for device_placement_optimization.
- Avoid to clustering feature column related nodes when enable TensorRT.
- Optimize inference latency when load increment checkpoint.
- Optimize performance via only place TensorRT ops to gpu device.
Environment & Build
- Support CUDA 12.
- Move thirdparties from WORKSPACE to workspace.bzl.
- Update urls corresponding to colm, ragel, aliyun-oss-sdk and uuid.
- Fix constant op placing bug for device placement optimization.
- Fix Nan issue occurred in group_embedding API.
- Fix SOK not compatible with variable issue.
- Fix memory leak when update full model in serving.
- Fix 'cols_to_output_tensors' not setted issue in GroupEmbedding.
- Fix core dump issue about saving GPU EmbeddingVariable.
- Fix cuda resource issue in KvResourceImportV3 kernel.
- Fix loading signature_def with coo_sparse bug and add UT.
- Fix the bug that the training ends early when the workqueue is enabled.
- Fix the control edge connection issue in device placement optimization.
- Modify GroupEmbedding related function usage.
- Update masknet example with layernorm.
Tool & Documents
- Add tools for remove filtered features in checkpoint.
- Add Arm Compute Library (ACL) user documents.
- Update Embedding Variable document to fix initializer config example.
- Update GroupEmbedding document.
- Update processor documents.
- Add user documents for intel AMX.
- Add TensorRT usage documents.
- Update documents for ParquetDataset.
More details of features: https://deeprec.readthedocs.io/zh/latest/
Release Images
CPU Image
GPU Image