Releases: sophgo/tpu-mlir
TPU-MLIR v1.7 Release
Change Log
New Features
- Added support for new operations including flash attention, custom op dynamic compile, and tpulang ops.
- Enabled AttnReorder and added support for dynamic indices in ops like onehot, scatterelements, and cumsum.
- Added
--dump_dataframe
option for bmodel_checker and support for transpose with order[1, 2, 3, 0]
. - Introduced Watchpoint feature to TDB and added support for mixed-precision networks.
- Implemented optimizations for dma efficiency of flash attention and optimized backend for various models.
- Added support for local memory dump in pcie mode and added various quantization features like eva quant, swin quant, and detr quant.
- Enhanced multi-core support including support for LayerNorm and GroupNorm in coreParallel, and multi-core data slice in tensorLocation.
- Added new patterns for Cswin and Einsum operations.
- Improved support for LLM (Large Language Models) in bm1688.
Bug Fixes
- Fixed various bugs including kernel_module msg_id, SAM-VIT-encoder regression, and attention accuracy problems.
- Addressed logical issues in AddToScale pattern and issues in fp_forward.
- Resolved bugs in model info core dump, op's liveRange in coreParallel, and DevParallel bugs.
- Fixed issues in model combine with io alone and bugs in various ops like interp, RotaryPosEmbPattern, and efficient-lite4 permute.
Performance Improvements
- Improved the performance of TDB and the bmodel_checker for 1684x pcie.
- Optimized facenet and fixed performance issues of 1688 multicore.
- Enabled single-core mode optimizations where necessary.
Documentation and Testing
- Updated documentation, refined custom chapters, and ensured consistency in quick start docs.
- Added test cases for custom tpulang, multi-core with subnets, and custom cpuop.
- Fixed various documentation errors and updated the release note.
Other Changes
- Added restrictions to tpulang ops and net test cases.
- Adjusted descriptions and refined interfaces for better user experience.
- Updated backend .so files and addressed sensitive words in the codebase.
- Added support for int4 dtype in tpu_profile and ensured tool/scripts work in Python virtual environments.
Technical Preview
Features
- Added support for LLM Decoding by utilizing multi-cores to enhance processing efficiency.
- Introduced
fx2mlir
, a new functionality for enhanced MLIR conversion. - Implemented
nnvlc2.0
andnnvlc1.0
local activation and weight operations, respectively, for improved neural network performance. - Enabled
TPULANG
support for operations like sort, argsort, and additional ops, enhancing the language's functionality and flexibility. - Added
cv186x
support inrun_sensitive_layer.py
and for the TDB, expanding compatibility and debugging capabilities. - Introduced new ops and features like
Watchpoint
in TDB andactivation ops
support for scale & zero_point, broadening the range of functionalities available in thetpu-mlir
project. - Supports
BM1690
. - L2mem performs intermediate data exchange for active tensor.
Bug Fixes
- Resolved a variety of bugs affecting backend processes, including issues with the
1684x
backend,permutefuse2
,permutemulconstswap
, and more, improving overall stability and performance. - Fixed several critical issues across
tpulang
, including errors insort_by_key
operation,reshape
operations,where
operation, and more, enhancing the language's reliability for developers. - Addressed bugs in model processing, including fixes for
concat
logic,scale2conv
,scale2conv3d
,instance norm
, and several more, ensuring smoother model optimization and execution. - Corrected errors in the documentation, providing clearer and more accurate information for users and developers.
Documentation Updates
- Updated
tpulang
documentation to include new functionalities and optimizations, making it easier for users to understand and utilize the language effectively.
Performance Improvements
- Optimized TDB and
bmodel_checker
for1684x pcie
mode, significantly reducing processing times and enhancing efficiency for model analysis. - Improved the efficiency of DMA in flash attention operations, ensuring faster data handling and processing.
- Enabled IO tag mode and refined address mode for better memory management and operational flexibility.
TPU-MLIR v1.6.1
Full Changelog: v1.6...v1.6.1
TPU-MLIR v1.6 release
Change Log
Bug Fixes
- Fixed documentation errors and added checks for documentation errors during build.
- Set workaround for
ar.copy
cycle issue to 0, avoiding potential data overwriting in inplacing operations. - Addressed a bug in
Caffe DetectionOutput
and fixed a hang incv186x
. - Corrected
Mul buffer
size alignment issues and various other buffer size corrections. - Fixed issues with
attention accuracy
,RotaryPosEmbPattern
, andop status validation
before the matching process. - Addressed a series of backend bugs, including daily build errors, performance declines, and incorrect return values.
- Fixed
data_checker
issues,api_conv
bug, and a local slice calculation bug. - Resolved incorrect affineMap for Pooling buffer and fixed reshape bug for inner products.
- Corrected
Mul&Div
dynamic support for local operations and fixed issues withConv2d
buffer size calculations. - Addressed various matmul bugs, including fp8 support issues and quantization inconsistencies.
Features
- Enabled multicore optimizations and added support for multi-core model tests.
- Updated
libbackend_1688.so
and various backend updates for better performance and compatibility. - Introduced
groupParallel
operation, support for dynamic input data generation. - Added support for new patterns such as
Permute fuse pattern
andsplitQuantizedMLP pattern
. - Implemented
npz compare visualizer
tool and added support forbm1688 backend
. - Added
MatMul weight split case
and improved permute performance. - Added support for
img2col pattern
, attention interface, and several dialects for SG2260 operations.
Documentation Updates
- Updated release notes and resolved issues with document formatting.
- Standardized expression terminology and replaced sensitive words in documentation.
Performance Improvements
- Improved local softmax performance and optimized dataFlow checking in coreMatch.
- Enhanced performance for Vit L i8 4 batch operations and refined conv multi-core handling.
- Optimized VIT-B concurrency and addressed performance issues with
MaxPool
buffer sizes.
v1.6-beta.0
New Features
- Implemented SG2260 structureOp interface and structured transform, including a solver for finding transforms【ea234bc2†source】.
- Added OneHot converter and support for fp8 in the debugger【c03ba46c†source】【f87127bd†source】【fed7e68a†source】.
- Supported MatMulOp for special cases broadcast in batch dims and added interface for attention【90d4b327†source】【044c4fc3†source】.
- Provided "decompose linalg op" and "tile+fuse" pass for MatMul parallel supports more batch patterns【25f24e3d†source】.
- Unet single block test added【ea76f9c9†source】.
- Implemented fp8 support for Matmul and other ops including addconst, subconst, mul, add, sub, and abs【e09adbda†source】【7eaec57f†source】.
Performance Improvements
- Improved Matmul fp8 performance with new backend support【2b8dd03b†source】.
- Enabled distribute MLP and attention with improved performance for cascade_net input/output names and order【d5a42d7a†source】.
- Refactored tdb to improve disassembler serialize and resolve BM1688 decoding issue【e73450f8†source】【1457df29†source】.
- Improved weight reorder for ConvOp and optimized permute of attention matmul【a9045c3c†source】【91a353e3†source】.
Bug Fixes
- Resolved various bugs in MatMul, Conv, and other ops across multiple chipsets including SG2260, BM1688, and CV18xx【b809a8c1†source】【bfada4de†source】【9804e30c†source】.
- Fixed bugs related to ReduceOp, ArgOp, SliceOp, and others for better operation and tensor handling【2cdeb60d†source】【bbacf00f†source】.
- Addressed issues in SAM, daily test, and tdb related to core operations and functionality【83e1979c†source】【7c37e39d†source】.
- Fixed memory and data handling bugs for more accurate and stable operation of the models【2310cd8d†source】【0ed60f1f†source】.
Documentation Updates
- Updated documentation to remove sensitive words and improve clarity and comprehensiveness【43e0b428†source】【5d6c49fc†source】.
Miscellaneous
- Enhanced various backend libraries and supported new ops and patterns for more efficient and versatile model handling【1ca95d71†source】【8f1a2de8†source】.
- Improved scatterE and reduce dynamic shape_value handling for better model optimization【fa2ccf29†source】.
- Refinements in graph optimization, permute parallel indexMapping, and related areas for improved model processing【094f05da†source】【1ec6c16b†source】.
Technical Preview
TPU-MLIR Project Update
Bug Fixes and Dependency Updates
- Fix Dependency: Fixed the dependency of MLIRInputConversion.
- SDK Release Workflow: Fixed tpu-mlir tag for building and added workflow file for SDK release.
- Softplus LoweringINT8: Fixed 1684 Softplus LoweringINT8 issue.
- Slice Begin Index: Fixed bm1684 slice begin_index problem.
- Mul Conflict Resolution: Partially fixed the output data sign of mul conflict with chip restriction.
Feature Enhancements and Support
- Subgraph Split Support: Enhanced support for subgraph split.
- Quant IO List Note: Added quant io list note for better quantization handling.
- New Full Operation: Supported the aten::new_full operation.
- Torch Flip for bm1684x: Added support for torch.flip for bm1684x.
- Weight Input Shape Bind: Supported shape bind for weight input.
Updates and Implementations for Specific Operations
- Backend Update for sg2260: Updated sg2260 for backend for tag31.
- ScatterElements Implementation: Implemented ScatterElements for any axis.
- Unary Indexing Map: Added unary indexing map.
- Binary Indexing Map: Added binary (add/sub/mul/div/min/max) indexing map.
- Dynamic NMS Support: Featured support for dynamic nms for bm1684x.
Codebase and Documentation Refinements
- Cleanup: Removed test/sg2260 dialect.
- Documentation Update: Updated nntoolchain README and lib.
- Codegen Documentation: Added documentation for codegen.
- Template Format Update: Updated import mlir file template format.
- Quick Start Docs Modification: Modified quick start docs for tpu-mlir.
Optimizations and Performance Improvements
- Kernel Module Usage: Reverted to using the old kernel module.
- MLIR Conv2D Optimization: Improved 1684 mlir conv2d with 3ic optimization.
- SWINT Quantization: Added swint quant for better performance.
- Opt Parameter Addition: Added an optimization parameter.
- Loop and Fusion Enhancements: Supported interchange of inner loop, padOp transform, tensor op collapse, fusion on linalg-on-tensor, etc.
Technical Preview
🐳 Docker Image Update
Changed required Docker image from sophgo/tpuc_dev:v2.2 to sophgo/tpuc_dev:v3.1, which is based on Ubuntu 22.04.
📖 Documentation
Updated docs to add more parameters in model deployment.
🐛 Bug Fixes
Fixed TPU-MLIR dialect Python binding for DEBUG mode.
Resolved backward training bug.
Addressed average pooling and max pooling issues.
Several other bug fixes related to Winograd inference, training, and more.
🚀 Feature Additions
Added Deconv3D backend support.
Support for dynamic tile added for bm1684x.
Added Winograd feature.
Several other feature additions, including dual-core support in debugger, MatMulSliceMerge support for int8/int4, and more.
🔧 Code Maintenance
Code renaming and cleaning.
Regression adjustments and tests.
⚙️ Backend Updates
Backend updates for various architectures including BM1684 and sg2260.
Technical Preview
New Features and Enhancements
- Support for Various Operations: Added support for exp, erf, gelu, loopop, and other operations for specific platforms.
- Tooling and Visualization: Renamed profile.py, added visual tools for weights, and enhanced debugging capabilities.
- Model Support and Adjustments: Added daily release models, scripts, and support for specific model types like yolov8, yolov4s.
- Distribution and Multicore Support: Implemented distribution steps, multicore support, and group convolution transformation.
Bug Fixes and Resolutions
- Model and Parsing Fixes: Resolved issues in emvd models, parsing errors, slice bugs, and fixed specific issues in bm1684 and bm1686.
- Codegen and Canonicalization Fixes: Addressed type errors, canonicalization failures, and operand kind checks.
- Inference and Optimization Fixes: Fixed inference issues in max, where, and slice operations, and refined canonicalization.
Documentation and Cleanup
- Documentation Updates: Refined tpu-mlir docs, added supposed ops document, and updated specific documents.
- Code Cleanup and Refactoring: Removed unnecessary code, reconstructed permute move canonicalization, and prepared for LLVM upgrade.
Other Changes
- Testing and Calibration: Added test cases, calibration updates, and support for regression and tag in TDB.
- Backend and Runtime Adjustments: Updated backend, added support for auto-increase op, and fixed minor bugs.
Technical Preview
Features:
BM1686: support post handle op, provided parallelOp codegen, add DivOp for f16/bf16.
BM1684: Support dynamic compilation load tensor from L2mem, implement GROUP_3D local layer function, support more dynamic ops, like MinConst, MaxConst, Lut; and some static ops, like deform_conv2d.
CV18XX: Support more ops like equalOp.
Support IfOp for f16/bf16/int8 mode.
Implement post process function of sensitive layer, unranked tensor and dynamic tensor at frontend, add empty and baddbmm torch converter/interpreter.
Support weight split when layer group if op is broadcastbinary, suppoprt parse ops of each layer in top.mlir, support int32 to i/u8 inference for modeol_runner.py.
Remove onnx-sim and use unranked_type for all ops.
Implement more graph opimize: merge matmul + add to matmul if float type, fuse same operation pass, weight trans when permute+add.
Support more torch ops, like rmsnorm, ceil, remainder.
Other new operations: lowering of GatherElements, multi-input Add.
Bug Fixes:
Fix chatglm2 rmsnorm untransformed prob, ScaleOp inference error, bmodel_dis format bin, shape inference of matmul, subnet output order mismatch cause error in dynamic runtime.
Avoid duplicate name of inserted CastOp, distinguish caffe matmul shape.
Code Refactoring:
Use llvm::md5, llvm::sha256.
Use Clang to speed up code compilation.
Remove some unused header files.
Use rewriter.eraseOp instead of op->earse, use string to define padding mode.
Refine disassembler, refactor mix_precision.
Documentation Updates:
Update document version and change some model-zoo requirements.
Modified English part and modified developer_manual doc for visual.py part.
Testing and Verification:
Updated list of test models supported by BM1684X.
Technical Preview
Features:
Supported 'Conv3D', 'Pool3D', 'Pow2(n^x)', 'Softplus', 'GRU', 'Scale' for BM1684, more models available like wenet-encoder.
Supported some operations like 'DictConstruct', 'Sub', 'Ones_like', 'Zeros_like', 'ChannelShuffle', 'Activation', 'Conv3d', 'Compare', 'GroupNorm', 'InstanceNorm', 'Clamp' in PyTorchConverter.
New ONNX operations in OnnxConverter, like 'GridSample', 'CompareCst'.
Supported more dynamic more operations like 'Arg', 'Active', 'Reduce', 'Min', 'Max' for BM1684.
Add depth2space to backward pass, 1684x yolov5 postprocess, CopyMultiUseWeight pattern before shape_infer.
Improved the previous subnets's type check logic, add some parallel in learning quant.
Bug Fixes:
Running functions have improved and fixed: weight display problem in visual tool, model_deploy -- test_reference is none.
BM1684: fix8b large dilation weight reorder, MulConst, AddConst, SubConst local buffer size, mulshift local buffer.
BM1684X: 5dim broadcast add, attention and utest bug, scatternd support 5dim, YoloDetection inference bug, strideslice op need begin_mask/end_mask for dynamic shape.
CV18XX: fix gray fuse preprocess, fix TgScaleLutKernel pass.
OnnxConverter: convert_add_op fix broadcast channel when r_dim is 1, infer subgraph to get shape and fix attr:'axes' not in squeeze.
Others: fix sdk demo problem, hanging prob caused by assert in cmodel, group overlap tensor id error, fix python array with random data.
Code Refactoring:
Redesigned subnet splitting, sorting, merging and running order.
Refine 18xx codegen, conv quantization, gather lowering and debugger's dictionary-structure.
Rename bdc to tiu.
Reset pattern of onnx subconst op.
Simplify layernorm to single output.
Documentation Updates:
Fix quick_start typo.
Update yolov3_tiny output_names.
Refine yolov5 postprocess chapter, cv18xx quick start doc.
Testing and Verification:
Update yolov3 regression test, bayer2RGB model sample, squeezenet_v1.1_cf.
Save a copy of bert_base 2.11 version config for cali.
Add timeout checkout and model test timeout for test.
Add many cv18xx model regression.
Align cv18xx detect samples and YOLODetection Func.