GitHub - aws-neuron/neuronx-distributed-inference

NeuronX Distributed (NxD) Inference

This package provides a model hub for running inference on Neuronx Distributed (NxD).

Examples

This package includes examples that you can reference when you implement code that uses NxD Inference.

generation_demo.py - A basic generation example for Llama.

Run inference with the inference demo

This package includes an inference demo console script that you can use to run inference. This script includes benchmarking and accuracy checking features that are useful for developers to verify that their models and modules work correctly.

After you install this package, you can run the inference demo with inference-demo. See examples below for how to run the inference demo. You can also run inference_demo --help to view all available arguments.

Example 1: Llama inference with token matching accuracy check

inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-3.1-8B-Instruct/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-3.1-8B-Instruct/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --top-k 1 \
    --do-sample \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode token-matching \
    --benchmark

Example 2. DBRX inference with logit matching accuracy check

inference_demo \
  --model-type dbrx \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/dbrx-1layer/ \
    --compiled-model-path /home/ubuntu/traced_model/dbrx-1layer-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 1024 \
    --seq-len 1152 \
    --enable-bucketing \
    --top-k 1 \
    --do-sample \
    --pad-token-id 0 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is" \
    --check-accuracy-mode logit-matching

Example 3. Llama with speculation

inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/open_llama_7b/ \
    --compiled-model-path /home/ubuntu/traced_model/open_llama_7b/ \
    --draft-model-path /home/ubuntu/model_hf/open_llama_3b/ \
    --compiled-draft-model-path /home/ubuntu/traced_model/open_llama_3b/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 1 \
    --max-context-length 32 \
    --seq-len 64 \
    --enable-bucketing \
    --speculation-length 5 \
    --no-trace-tokengen-model \
    --top-k 1 \
    --do-sample \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --check-accuracy-mode token-matching \
    --benchmark

Example 4. Llama with quantization

inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --on-device-sampling \
    --enable-bucketing \
    --quantized \
    --quantized-checkpoints-path /home/ubuntu/model_hf/Llama-2-7b/model_quant.pt \
    --quantization-type per_channel_symmetric \
    --top-k 1 \
    --do-sample \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"

Example 5. Llama inference with logit matching accuracy check using custom error tolerances

inference_demo \
  --model-type llama \
  --task-type causal-lm \
  run \
    --model-path /home/ubuntu/model_hf/Llama-2-7b/ \
    --compiled-model-path /home/ubuntu/traced_model/Llama-2-7b-demo/ \
    --torch-dtype bfloat16 \
    --tp-degree 32 \
    --batch-size 2 \
    --max-context-length 32 \
    --seq-len 64 \
    --check-accuracy-mode logit-matching \
    --divergence-difference-tol 0.005 \
    --tol-map "{5: (1e-5, 0.02)}" \
    --enable-bucketing \
    --top-k 1 \
    --do-sample \
    --pad-token-id 2 \
    --prompt "I believe the meaning of life is" \
    --prompt "The color of the sky is"

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
neuron_test/unit_test		neuron_test/unit_test
src/neuronx_distributed_inference		src/neuronx_distributed_inference
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
build.sh		build.sh
mypy.ini		mypy.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuronX Distributed (NxD) Inference

Examples

Run inference with the inference demo

Example 1: Llama inference with token matching accuracy check

Example 2. DBRX inference with logit matching accuracy check

Example 3. Llama with speculation

Example 4. Llama with quantization

Example 5. Llama inference with logit matching accuracy check using custom error tolerances

About

Releases

Packages

Contributors 3

Languages

License

aws-neuron/neuronx-distributed-inference

Folders and files

Latest commit

History

Repository files navigation

NeuronX Distributed (NxD) Inference

Examples

Run inference with the inference demo

Example 1: Llama inference with token matching accuracy check

Example 2. DBRX inference with logit matching accuracy check

Example 3. Llama with speculation

Example 4. Llama with quantization

Example 5. Llama inference with logit matching accuracy check using custom error tolerances

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages