This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Intel® Neural Speed v0.1 Release

kevinintel released this 22 Dec 14:47

· 187 commits to main since this release

Highlights
Features
Examples

Highlights

Created Neural Speed project, spinning off from Intel Extension for Transformers

Features

Support GPTQ models
Enable Beam Search post-processing.
Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
Support Tensor Parallelism with jblas and shared memory.
Improving the performance of Client CPUs.
Enabling streaming LLM for Runtime
Enhance QLoRA on CPU with optimized dropout operator.
Add Script for PPL Evaluation.
Refine Python API.
Allow CompileBF16 on GCC11.
Multi-Round chat with ChatGLM2.
Shift-RoPE-based Streaming-LLM.
Enable MHA fusion for LLM.
Support AVX_VNNI and AVX2
Optimize QBits backend.
GELU support

Examples

Enable finetune for Qwen-7b-chat on CPU.
Enable Whisper C++ API
Apply the STS task to BAAI/BGE models.
Enable Qwen graph.
Enable instruction_tuning Stable Diffusion examples.
Enable Mistral-7b.
Enable Falcon-180B
Enable Baichuan/Baichuan2 example.

Validated Configurations

Python 3.9, 3.10, 3.11
GCC 13.1, 11.1
Centos 8.4 & Ubuntu 20.04 & Windows 10

Assets 2