Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Intel® Neural Speed v0.1 Release

Compare
Choose a tag to compare
@kevinintel kevinintel released this 22 Dec 14:47
· 187 commits to main since this release
6d8bb4a

Highlights
Features
Examples

Highlights

  • Created Neural Speed project, spinning off from Intel Extension for Transformers

Features

  • Support GPTQ models
  • Enable Beam Search post-processing.
  • Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
  • Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
  • Support Tensor Parallelism with jblas and shared memory.
  • Improving the performance of Client CPUs.
  • Enabling streaming LLM for Runtime
  • Enhance QLoRA on CPU with optimized dropout operator.
  • Add Script for PPL Evaluation.
  • Refine Python API.
  • Allow CompileBF16 on GCC11.
  • Multi-Round chat with ChatGLM2.
  • Shift-RoPE-based Streaming-LLM.
  • Enable MHA fusion for LLM.
  • Support AVX_VNNI and AVX2
  • Optimize QBits backend.
  • GELU support

Examples

  • Enable finetune for Qwen-7b-chat on CPU.
  • Enable Whisper C++ API
  • Apply the STS task to BAAI/BGE models.
  • Enable Qwen graph.
  • Enable instruction_tuning Stable Diffusion examples.
  • Enable Mistral-7b.
  • Enable Falcon-180B
  • Enable Baichuan/Baichuan2 example.

Validated Configurations

  • Python 3.9, 3.10, 3.11
  • GCC 13.1, 11.1
  • Centos 8.4 & Ubuntu 20.04 & Windows 10