深挖RTX2080TI性能优化:跑千问大模型的命令记录(二)

一把老骨头 发布于 阅读:89 经验技巧

跑完llama.cpp的bench基准测试结果:
########################################################################

Find maximum number of model layers that can be written to your VRAM

########################################################################

Testing for: -ngl = 75
Testing for: -ngl = 112
Testing for: -ngl = 131
Testing for: -ngl = 140
Testing for: -ngl = 145
Testing for: -ngl = 147
Testing for: -ngl = 148
Testing for: -ngl = 149
Estimated max ngl = 149

Setting maximum -ngl to 149

Warmup performance history: [27.169813, 27.090305, 27.144696, 26.995448, 27.054011, 27.038189, 27.065603, 27.028237, 27.041289, 27.046516, 27.089866, 27.026862, 27.046862, 26.971674, 27.025514, 27.05764, 27.059785, 27.025481, 27.068116, 27.022433, 26.981009, 27.000351, 26.960303, 27.013747, 26.96748, 26.994402, 26.951979, 26.996994, 27.003112, 26.99966, 26.988973, 27.016048, 26.965876, 27.017555, 27.046099]

First stage: Initial exploration of parameter space

Best config Stage_1: {'batch': 15806, 'u_batch': 7127, 'threads': 24, 'gpu_layers': 95}
Best Stage_1 tg tokens/sec: 27.204362

Second stage: Grid search over categorical parameters

Best config Stage_2: {'flash_attn': 1, 'override_tensor': 'ffn_cpu_updown'}
Best Stage_2 tg tokens/sec: 27.271601

Third stage: Finetune final config

'gpu_layers': 115, 'flash_attn': 1, 'override_tensor': 'ffn_cpu_all'}
Best Stage_3 tg tokens/sec: 27.445436

You are ready to run a local llama-server:
If you launch llama-server, it will be listening at http://127.0.0.1:8080/ in your browser.

###################################################################

You can now launch an optimized llama-server.

just run next lines in your terminal:

###################################################################

LLAMA_BIN=/home/yblgt/llama.cpp/build/bin
MODEL=/data/models/qwen3.6-27b-mtp/Qwen3.6-27B-MTP-Q4_K_M.gguf

$LLAMABIN/llama-server --model $MODEL -t 20 --batch-size 1953 --ubatch-size 772 -ngl 115 --override-tensor "blk.(?:[0-9]*[02468]).ffn.*_exps.=CPU" --flash-attn

########################################################

Benchmarking your OPTIMIZED configuration

Let's run the following line on terminal:

########################################################

/home/yblgt/llama.cpp/build/bin/llama-bench --model /data/models/qwen3.6-27b-mtp/Qwen3.6-27B-MTP-Q4_KM.gguf -t 20 --batch-size 1953 --ubatch-size 772 -ngl 115 --flash-attn 1 -n 128 -p 256 -r 6 --no-warmup --progress --override-tensor "blk.(?:[0-9]*[02468]).ffn.*_exps.=CPU"

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 22183 MiB):
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 22183 MiB
model size params backend ngl threads n_batch n_ubatch fa ot test t/s

llama-bench: benchmark 1/2: starting
llama-bench: benchmark 1/2: prompt run 1/6
llama-bench: benchmark 1/2: prompt run 2/6
llama-bench: benchmark 1/2: prompt run 3/6
llama-bench: benchmark 1/2: prompt run 4/6
llama-bench: benchmark 1/2: prompt run 5/6
llama-bench: benchmark 1/2: prompt run 6/6
| qwen35 27B Q4K - Medium | 15.35 GiB | 27.32 B | CUDA | 115 | 20 | 1953 | 772 | 1 | blk.(?:[0-9]*[02468]).ffn._exps.=CPU | pp256 | 650.25 ± 14.02 |
llama-bench: benchmark 2/2: starting
llama-bench: benchmark 2/2: generation run 1/6
llama-bench: benchmark 2/2: generation run 2/6
llama-bench: benchmark 2/2: generation run 3/6
llama-bench: benchmark 2/2: generation run 4/6
llama-bench: benchmark 2/2: generation run 5/6
llama-bench: benchmark 2/2: generation run 6/6
| qwen35 27B Q4_K - Medium | 15.35 GiB | 27.32 B | CUDA | 115 | 20 | 1953 | 772 | 1 | blk.(?:[0-9]
[02468]).ffn_.*_exps.=CPU | tg128 | 27.18 ± 0.01 |

build: a957b7747 (9173)

########################################################

Compare your previous results with NON-OPTIMIZED case

Let's run the following line on terminal:

Look for results in column 't/s' (tokens/s)

row tg128 --> reports on token generation speed

row pp256 --> reports on prompt processing speed

########################################################

/home/yblgt/llama.cpp/build/bin/llama-bench --model /data/models/qwen3.6-27b-mtp/Qwen3.6-27B-MTP-Q4_K_M.gguf -n 128 -p 256 -r 6 --no-warmup --progress

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 22183 MiB):
Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes, VRAM: 22183 MiB
model size params backend ngl test t/s

llama-bench: benchmark 1/2: starting
llama-bench: benchmark 1/2: prompt run 1/6
llama-bench: benchmark 1/2: prompt run 2/6
llama-bench: benchmark 1/2: prompt run 3/6
llama-bench: benchmark 1/2: prompt run 4/6
llama-bench: benchmark 1/2: prompt run 5/6
llama-bench: benchmark 1/2: prompt run 6/6
| qwen35 27B Q4_K - Medium | 15.35 GiB | 27.32 B | CUDA | 99 | pp256 | 637.51 ± 35.49 |
llama-bench: benchmark 2/2: starting
llama-bench: benchmark 2/2: generation run 1/6
llama-bench: benchmark 2/2: generation run 2/6
llama-bench: benchmark 2/2: generation run 3/6
llama-bench: benchmark 2/2: generation run 4/6
llama-bench: benchmark 2/2: generation run 5/6
llama-bench: benchmark 2/2: generation run 6/6
| qwen35 27B Q4_K - Medium | 15.35 GiB | 27.32 B | CUDA | 99 | tg128 | 26.97 ± 0.03 |

build: a957b7747 (9173)

优化命令
./build/bin/llama-server \
-m /data/models/qwen3.6-27b-mtp/Qwen3.6-27B-MTP-Q4_KM.gguf \
--port 8000 --host 0.0.0.0 \
-t 20 \
--threads-batch 20 \ # 建议与 -t 相同
--n-gpu-layers 115 \ # 替换原来的 -1
-c 196608 \
--batch-size 1953 \
--ubatch-size 772 \
--flash-attn on \
--override-tensor "blk.(?:[0-9]*[02468]).ffn
.*_exps.=CPU" \
--cont-batching \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.7 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--parallel 1 \
--temp 0.6 \
--mlock \
--no-warmup \
--prio 3