Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into mitruska/default_…
Browse files Browse the repository at this point in the history
…for_empty_reduce
  • Loading branch information
mitruska committed Nov 15, 2024
2 parents 946022e + 3e63de0 commit 99c1e89
Show file tree
Hide file tree
Showing 30 changed files with 725 additions and 661 deletions.
1 change: 1 addition & 0 deletions .github/actions/common/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ class EventType(Enum):
'public_linux_ubuntu_24_04_x86_64_release',
'public_windows_vs2019_Release',
'public_windows_vs2019_Debug',
'public_manylinux2014_x86_64_release',
)
ProductType = Enum('ProductType', {t.upper(): t for t in productTypes})

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/job_tensorflow_layer_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ env:
jobs:
TensorFlow_Layer_Tests:
name: TensorFlow Layer Tests
timeout-minutes: 30
timeout-minutes: 45
runs-on: ${{ inputs.runner }}
container: ${{ fromJSON(inputs.container) }}
defaults:
Expand Down
47 changes: 46 additions & 1 deletion .github/workflows/manylinux_2014.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ jobs:
options: -e SCCACHE_AZURE_BLOB_CONTAINER -e SCCACHE_AZURE_CONNECTION_STRING -e DOCKER_CONFIG -v ${{ github.workspace }}:${{ github.workspace }}
env:
CMAKE_BUILD_TYPE: 'Release'
ARCH: 'x86_64'
OPENVINO_REPO: ${{ github.workspace }}/src
INSTALL_DIR: ${{ github.workspace }}/install/openvino
INSTALL_WHEELS_DIR: ${{ github.workspace }}/install/wheels
Expand All @@ -99,6 +100,9 @@ jobs:
SCCACHE_SERVER_PORT: 35555
SCCACHE_CACHE_SIZE: 50G
SCCACHE_AZURE_KEY_PREFIX: manylinux_2014
ARTIFACTS_SHARE: "/mount/build-artifacts"
MANIFEST_PATH: ${{ github.workspace }}/manifest.yml
PRODUCT_TYPE: public_manylinux2014_x86_64_release

steps:
- name: Clone OpenVINO
Expand All @@ -109,6 +113,17 @@ jobs:

- name: System info
uses: ./src/.github/actions/system_info

- name: Generate product manifest and set CI_BUILD_NUMBER & CI_BUILD_DEV_TAG
id: create_manifest
uses: ./src/.github/actions/create_manifest
with:
repos: |
${{ env.OPENVINO_REPO }}
product_type: ${{ env.PRODUCT_TYPE }}
target_arch: ${{ env.ARCH }}
build_type: ${{ env.CMAKE_BUILD_TYPE }}
save_to: ${{ env.MANIFEST_PATH }}

- name: Create docker build cache
run: |
Expand All @@ -128,6 +143,8 @@ jobs:
-e SCCACHE_AZURE_KEY_PREFIX \
-e CMAKE_CXX_COMPILER_LAUNCHER \
-e CMAKE_C_COMPILER_LAUNCHER \
-e CI_BUILD_NUMBER \
-e CI_BUILD_DEV_TAG \
-w /work/src \
${{ fromJSON(needs.docker.outputs.images).ov_build.manylinux2014_x86_64 }} \
/bin/bash -c "
Expand Down Expand Up @@ -158,6 +175,8 @@ jobs:
-e SCCACHE_AZURE_KEY_PREFIX \
-e CMAKE_CXX_COMPILER_LAUNCHER \
-e CMAKE_C_COMPILER_LAUNCHER \
-e CI_BUILD_NUMBER \
-e CI_BUILD_DEV_TAG \
-w /work/src \
${{ fromJSON(needs.docker.outputs.images).ov_build.manylinux2014_x86_64 }} \
/bin/bash -c "
Expand Down Expand Up @@ -188,4 +207,30 @@ jobs:
with:
name: openvino_wheels
path: ${{ env.INSTALL_WHEELS_DIR }}/wheels/*.whl
if-no-files-found: 'error'
if-no-files-found: 'error'

- name: Store artifacts to a shared drive
id: store_artifacts
if: ${{ always() }}
uses: ./src/.github/actions/store_artifacts
with:
artifacts: |
${{ env.BUILD_DIR }}/openvino_package.tar.gz
${{ env.MANIFEST_PATH }}
${{ env.INSTALL_WHEELS_DIR }}/wheels
storage_dir: ${{ env.PRODUCT_TYPE }}
storage_root: ${{ env.ARTIFACTS_SHARE }}

Overall_Status:
name: ci/gha_overall_status_manylinux2014
needs: [Smart_CI, Build]
if: ${{ always() }}
runs-on: ubuntu-latest
steps:
- name: Check status of all jobs
if: >-
${{
contains(needs.*.result, 'failure') ||
contains(needs.*.result, 'cancelled')
}}
run: exit 1
185 changes: 165 additions & 20 deletions docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,70 +9,132 @@ This guide will give you extra details on how to utilize NPU with the GenAI flav
for information on how to start.

Prerequisites
#############
#####################

Install required dependencies:

.. code-block:: console
python -m venv npu-env
npu-env\Scripts\activate
pip install optimum-intel nncf==2.11 onnx==1.16.1
pip install nncf==2.12 onnx==1.16.1 optimum-intel==1.19.0
pip install --pre openvino openvino-tokenizers openvino-genai --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Export an LLM model via Hugging Face Optimum-Intel
##################################################

A chat-tuned TinyLlama model is used in this example. The following conversion & optimization
settings are recommended when using the NPU:
Since **symmetrically-quantized 4-bit (INT4) models are preffered for inference on NPU**, make sure to export
the model with the proper conversion and optimization settings.

.. code-block:: python
| You may export LLMs via Optimum-Intel, using one of two compression methods:
| **group quantization** - for both smaller and larger models,
| **channel-wise quantization** - remarkably effective but for models exceeding 1 billion parameters.
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --group-size 128 --ratio 1.0 TinyLlama
You select one of the methods by setting the ``--group-size`` parameter to either ``128`` or ``-1``, respectively. See the following examples:

**For models exceeding 1 billion parameters**, it is recommended to use **channel-wise
quantization** that is remarkably effective. For example, you can try the approach with the
llama-2-7b-chat-hf model:
.. tab-set::

.. tab-item:: Group quantization

.. code-block:: console
:name: group-quant
optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --sym --ratio 1.0 --group_size 128 TinyLlama-1.1B-Chat-v1.0
.. tab-item:: Channel-wise quantization

.. tab-set::

.. tab-item:: Data-free quantization


.. code-block:: console
:name: channel-wise-data-free-quant
optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --ratio 1.0 --group-size -1 Llama-2-7b-chat-hf
.. code-block:: python
.. tab-item:: Data-aware quantization

optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 Llama-2-7b-chat-hf
If you want to improve accuracy, make sure you:

1. Update NNCF: ``pip install nncf==2.13``
2. Use ``--scale_estimation --dataset=<dataset_name>`` and accuracy aware quantization ``--awq``:

.. code-block:: console
:name: channel-wise-data-aware-quant
optimum-cli export openvino -m meta-llama/Llama-2-7b-chat-hf --weight-format int4 --sym --group-size -1 --ratio 1.0 --awq --scale-estimation --dataset=wikitext2 Llama-2-7b-chat-hf
.. important::

Remember that the negative value of ``-1`` is required here, not ``1``.



You can also try using 4-bit (INT4)
`GPTQ models <https://huggingface.co/models?other=gptq,4-bit&sort=trending>`__,
which do not require specifying quantization parameters:

.. code-block:: console
optimum-cli export openvino -m TheBloke/Llama-2-7B-Chat-GPTQ
| Remember, NPU supports GenAI models quantized symmetrically to INT4.
| Below is a list of such models:
* meta-llama/Meta-Llama-3-8B-Instruct
* microsoft/Phi-3-mini-4k-instruct
* Qwen/Qwen2-7B
* mistralai/Mistral-7B-Instruct-v0.2
* openbmb/MiniCPM-1B-sft-bf16
* TinyLlama/TinyLlama-1.1B-Chat-v1.0
* TheBloke/Llama-2-7B-Chat-GPTQ
* Qwen/Qwen2-7B-Instruct-GPTQ-Int4


Run generation using OpenVINO GenAI
###################################

It is recommended to install the latest available
It is typically recommended to install the latest available
`driver <https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html>`__.

Use the following code snippet to perform generation with OpenVINO GenAI API:
Use the following code snippet to perform generation with OpenVINO GenAI API.
Note that **currently, the NPU pipeline supports greedy decoding only**. This means that
you need to add ``do_sample=False`` **to the** ``generate()`` **method:**

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
:emphasize-lines: 4
import openvino_genai as ov_genai
model_path = "TinyLlama"
pipe = ov_genai.LLMPipeline(model_path, "NPU")
print(pipe.generate("The Sun is yellow because", max_new_tokens=100))
print(pipe.generate("The Sun is yellow because", max_new_tokens=100, do_sample=False))
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
:emphasize-lines: 7, 9
#include "openvino/genai/llm_pipeline.hpp"
#include <iostream>
int main(int argc, char* argv[]) {
std::string model_path = "TinyLlama";
ov::genai::LLMPipeline pipe(model_path, "NPU");
std::cout << pipe.generate("The Sun is yellow because", ov::genai::max_new_tokens(100));
ov::genai::GenerationConfig config;
config.do_sample=false;
config.max_new_tokens=100;
std::cout << pipe.generate("The Sun is yellow because", config);
}
Additional configuration options
################################

Expand All @@ -88,9 +150,9 @@ user explicitly sets a lower length limit for the response.
You may configure both the 'maximum input prompt length' and 'minimum response length' using
the following parameters:

* ``MAX_PROMPT_LEN``: Defines the maximum number of tokens that the LLM pipeline can process
for the input prompt (default: 1024).
* ``MIN_RESPONSE_LEN``: Defines the minimum number of tokens that the LLM pipeline will generate
* ``MAX_PROMPT_LEN`` - defines the maximum number of tokens that the LLM pipeline can process
for the input prompt (default: 1024),
* ``MIN_RESPONSE_LEN`` - defines the minimum number of tokens that the LLM pipeline will generate
in its response (default: 150).

Use the following code snippet to change the default settings:
Expand All @@ -113,10 +175,93 @@ Use the following code snippet to change the default settings:
ov::AnyMap pipeline_config = { { "MAX_PROMPT_LEN", 1024 }, { "MIN_RESPONSE_LEN", 512 } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Cache compiled models
+++++++++++++++++++++

Specify the ``NPUW_CACHE_DIR`` option in ``pipeline_config`` for NPU pipeline to
cache the compiled models. Using the code snippet below shortens the initialization time
of the pipeline runs coming next:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
pipeline_config = { "NPUW_CACHE_DIR": ".npucache" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
ov::AnyMap pipeline_config = { { "NPUW_CACHE_DIR", ".npucache" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Disable memory allocation
+++++++++++++++++++++++++

In case of execution failures, either silent or with errors, try to update the NPU driver to
`32.0.100.3104 or newer <https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-windows.html>`__.
If the update is not possible, set the ``DISABLE_OPENVINO_GENAI_NPU_L0``
environment variable to disable NPU memory allocation, which might be supported
only on newer drivers for Intel Core Ultra 200V processors.

Set the environment variable in a terminal:

.. tab-set::

.. tab-item:: Linux
:sync: linux

.. code-block:: console
export DISABLE_OPENVINO_GENAI_NPU_L0=1
.. tab-item:: Windows
:sync: win

.. code-block:: console
set DISABLE_OPENVINO_GENAI_NPU_L0=1
Performance modes
+++++++++++++++++++++

You can configure the NPU pipeline with the ``GENERATE_HINT`` option to switch
between two different performance modes:

* ``FAST_COMPILE`` (default) - enables fast compilation at the expense of performance,
* ``BEST_PERF`` - ensures best possible performance at lower compilation speed.

Use the following code snippet:

.. tab-set::

.. tab-item:: Python
:sync: py

.. code-block:: python
pipeline_config = { "GENERATE_HINT": "BEST_PERF" }
pipe = ov_genai.LLMPipeline(model_path, "NPU", pipeline_config)
.. tab-item:: C++
:sync: cpp

.. code-block:: cpp
ov::AnyMap pipeline_config = { { "GENERATE_HINT", "BEST_PERF" } };
ov::genai::LLMPipeline pipe(model_path, "NPU", pipeline_config);
Additional Resources
####################

* :doc:`NPU Device <../../openvino-workflow/running-inference/inference-devices-and-modes/npu-device>`
* `OpenVINO GenAI Repo <https://github.com/openvinotoolkit/openvino.genai>`__
* `Neural Network Compression Framework <https://github.com/openvinotoolkit/nncf>`__
* `Neural Network Compression Framework <https://github.com/openvinotoolkit/nncf>`__
Original file line number Diff line number Diff line change
Expand Up @@ -120,15 +120,24 @@ def process_coveo_meta(meta, url, link):

for namespace, values in meta:
namespace_element = ET.SubElement(url, namespace)
loc_element = url.find("loc")

for tag_name, tag_value in values.items():
if tag_name == 'ovdoctype':
processed_link = process_link(link)
ET.SubElement(namespace_element, tag_name).text = processed_link
else:
ET.SubElement(namespace_element, tag_name).text = process_link(link)
elif tag_name == 'ovcategory' and loc_element is not None:
ET.SubElement(namespace_element, tag_name).text = extract_link(loc_element.text)
elif tag_name == 'ovversion':
ET.SubElement(namespace_element, tag_name).text = tag_value

def process_link(link):
if '/' in link:
return link.split('/')[0].replace("-", " ")
return link.split('.html')[0].replace("-", " ")
return link.split('.html')[0].replace("-", " ")

def extract_link(link):
path = link.split("://")[-1]
segments = path.split('/')[1:]
if segments and segments[-1].endswith('.html'):
segments = segments[:-1]
return '|'.join(segments)
Loading

0 comments on commit 99c1e89

Please sign in to comment.