Grok2 OnPrem Deployment via sglang¶

Published: September 14, 2025

x.ai just released their massive grok2 model, this thirsty beast can eat a lot of GPU compute power on your clusters, so be ready for the heat.

In last blog we saw the deployment of GPToSS 20B and 120B models which is done by vllm, but this time we’ll be using sglang inferencing enginebecause vllm does not support this transformer model as of today, but it’s in progress, https://github.com/vllm-project/vllm/issues/23557

Grok-2: xAI’s Frontier Model¶

Grok-2, developed by xAI, represents cutting-edge capabilities in reasoning and knowledge synthesis. Key characteristics:

Size: Large-scale transformer architecture
Quantization: FP8 precision for memory efficiency
Serving Framework: SGLang for optimized inference

Let’s start with the prerequisites:

A K8s cluster installed with Nvidia GPU Operator

NAMESPACE         NAME                                                              READY   STATUS      RESTARTS      AGE
cert-manager      cert-manager-5969544f77-pmrt7                                     1/1     Running     0             9d
cert-manager      cert-manager-cainjector-65967ff5cc-tbntm                          1/1     Running     0             9d
cert-manager      cert-manager-webhook-7c665868cb-n9n7r                             1/1     Running     0             9d
addons            gpu-feature-discovery-2c2bz                                       1/1     Running     0             19d
addons            gpu-feature-discovery-js5k7                                       1/1     Running     0             45h
addons            gpu-feature-discovery-ss2zk                                       1/1     Running     0             19d
addons            gpu-operator-76db5d4656-27lmb                                     1/1     Running     0             19d
addons            ingress-nginx-controller-677d485949-nnsmn                         1/1     Running     0             13d
addons            local-path-provisioner-596b4859f7-nhp8j                           1/1     Running     0             13d
addons            metallb-controller-54574fc5b7-2c625                               1/1     Running     0             13d
addons            metallb-speaker-2dv8n                                             4/4     Running     0             13d
addons            metallb-speaker-k2r7z                                             4/4     Running     0             13d
addons            metallb-speaker-rdqtb                                             4/4     Running     0             45h
addons            nvidia-container-toolkit-daemonset-c6dsr                          1/1     Running     0             45h
addons            nvidia-container-toolkit-daemonset-wndt4                          1/1     Running     0             19d
addons            nvidia-container-toolkit-daemonset-xnlmw                          1/1     Running     0             19d
addons            nvidia-cuda-validator-44zsb                                       0/1     Completed   0             19d
addons            nvidia-cuda-validator-fvf77                                       0/1     Completed   0             19d
addons            nvidia-cuda-validator-mjm7r                                       0/1     Completed   0             45h
addons            nvidia-dcgm-exporter-hflzf                                        1/1     Running     0             45h
addons            nvidia-dcgm-exporter-knqbf                                        1/1     Running     0             19d
addons            nvidia-dcgm-exporter-spst5                                        1/1     Running     0             19d
addons            nvidia-device-plugin-daemonset-7lxtk                              1/1     Running     0             45h
addons            nvidia-device-plugin-daemonset-djz4l                              1/1     Running     0             19d
addons            nvidia-device-plugin-daemonset-phz6m                              1/1     Running     0             19d
addons            nvidia-driver-daemonset-5.15.0-140-generic-ubuntu22.04-4xqrc      2/2     Running     4 (19d ago)   19d
addons            nvidia-driver-daemonset-5.15.0-140-generic-ubuntu22.04-p5tkn      2/2     Running     3 (19d ago)   19d
addons            nvidia-driver-daemonset-5.15.0-140-generic-ubuntu22.04-s8djl      2/2     Running     5 (45h ago)   45h
addons            nvidia-gpu-operator-node-feature-discovery-gc-85cbffc74d-4nxn4    1/1     Running     0             19d
addons            nvidia-gpu-operator-node-feature-discovery-master-7f8d4b68582kn   1/1     Running     0             19d
addons            nvidia-gpu-operator-node-feature-discovery-worker-mkxzq           1/1     Running     0             45h
addons            nvidia-gpu-operator-node-feature-discovery-worker-v24vq           1/1     Running     0             19d
addons            nvidia-gpu-operator-node-feature-discovery-worker-x8cxv           1/1     Running     0             19d
addons            nvidia-mig-manager-2vzkf                                          1/1     Running     0             19d
addons            nvidia-mig-manager-vs7w2                                          1/1     Running     0             45h
addons            nvidia-mig-manager-xhdbk                                          1/1     Running     0             19d
addons            nvidia-operator-validator-fc69v                                   1/1     Running     0             19d
addons            nvidia-operator-validator-qjmkh                                   1/1     Running     0             45h
addons            nvidia-operator-validator-v6jsq                                   1/1     Running     0             19d
compass-system    compass-agent-678f4d9cbd-z55rv                                    1/1     Running     9 (13d ago)   19d
grok              grok2-sglang-f68f88b9b-9bsvl                                      0/1     Init:0/1   0             2m20s
kube-system       calico-kube-controllers-868cbf9cc-h7sz2                           1/1     Running     0             19d
kube-system       calico-node-8t486                                                 1/1     Running     0             19d
kube-system       calico-node-stkng                                                 1/1     Running     0             45h
kube-system       calico-node-xt56j                                                 1/1     Running     0             19d
kube-system       coredns-75bc46dc6c-6wvnt                                          1/1     Running     0             19d
kube-system       coredns-75bc46dc6c-x4smv                                          1/1     Running     0             19d
kube-system       etcd-hgx-gpu-compute-150                                          1/1     Running     0             19d
kube-system       kube-apiserver-hgx-gpu-compute-150                                1/1     Running     0             19d
kube-system       kube-controller-manager-hgx-gpu-compute-150                       1/1     Running     0             19d
kube-system       kube-proxy-6gk57                                                  1/1     Running     0             19d
kube-system       kube-proxy-mqmsg                                                  1/1     Running     0             19d
kube-system       kube-proxy-qvkvn                                                  1/1     Running     0             45h
kube-system       kube-scheduler-hgx-gpu-compute-150                                1/1     Running     0             19d
longhorn-system   csi-attacher-5cfcfffdf-mpkfx                                      1/1     Running     1 (19d ago)   19d
longhorn-system   csi-attacher-5cfcfffdf-mqpbr                                      1/1     Running     1 (19d ago)   19d
longhorn-system   csi-attacher-5cfcfffdf-skjq6                                      1/1     Running     1 (19d ago)   19d
longhorn-system   csi-provisioner-76bf7c68ff-4z8qc                                  1/1     Running     0             19d
longhorn-system   csi-provisioner-76bf7c68ff-8mtx4                                  1/1     Running     0             19d
longhorn-system   csi-provisioner-76bf7c68ff-vj5fb                                  1/1     Running     1 (19d ago)   19d
longhorn-system   csi-resizer-75c4685b5b-46n5s                                      1/1     Running     0             19d
longhorn-system   csi-resizer-75c4685b5b-nlc42                                      1/1     Running     0             19d
longhorn-system   csi-resizer-75c4685b5b-qw2mp                                      1/1     Running     0             19d
longhorn-system   csi-snapshotter-769588d6bb-czkkl                                  1/1     Running     0             19d
longhorn-system   csi-snapshotter-769588d6bb-jdsz5                                  1/1     Running     0             19d
longhorn-system   csi-snapshotter-769588d6bb-t49d9                                  1/1     Running     0             19d
longhorn-system   engine-image-ei-b4bcf0a5-572vl                                    1/1     Running     0             19d
longhorn-system   engine-image-ei-b4bcf0a5-nxp4p                                    1/1     Running     0             45h
longhorn-system   engine-image-ei-b4bcf0a5-rglbk                                    1/1     Running     0             19d
longhorn-system   instance-manager-2c1a767560aac05a8eebcfefc4fd72e4                 1/1     Running     0             19d
longhorn-system   instance-manager-6be5230635b99ba14cbcabdcc2b2f343                 1/1     Running     0             45h
longhorn-system   instance-manager-fd77d6a642bfebd11e64659fadc84aaf                 1/1     Running     0             19d
longhorn-system   longhorn-csi-plugin-7zcz4                                         3/3     Running     0             19d
longhorn-system   longhorn-csi-plugin-92cw8                                         3/3     Running     0             19d
longhorn-system   longhorn-csi-plugin-vrxjg                                         3/3     Running     1 (45h ago)   45h
longhorn-system   longhorn-driver-deployer-58c9dd465-wth7d                          1/1     Running     0             19d
longhorn-system   longhorn-manager-2pm64                                            2/2     Running     0             45h
longhorn-system   longhorn-manager-lfsfv                                            2/2     Running     0             19d
longhorn-system   longhorn-manager-t9jlz                                            2/2     Running     0             19d
longhorn-system   longhorn-ui-7f7b9f785f-879kc                                      1/1     Running     0             19d
longhorn-system   longhorn-ui-7f7b9f785f-pq6m6                                      1/1     Running     0             19d
prometheus        alertmanager-prometheus-kube-prometheus-alertmanager-0            2/2     Running     0             12d
prometheus        prometheus-grafana-bfc8bdb48-tvhzb                                3/3     Running     0             12d
prometheus        prometheus-kube-prometheus-operator-df5494fb5-fl7ds               1/1     Running     0             12d
prometheus        prometheus-kube-state-metrics-86847bb8bc-rl6sk                    1/1     Running     0             12d
prometheus        prometheus-prometheus-kube-prometheus-prometheus-0                2/2     Running     0             12d
prometheus        prometheus-prometheus-node-exporter-7l6d8                         1/1     Running     0             45h
prometheus        prometheus-prometheus-node-exporter-cw5ft                         1/1     Running     0             12d
prometheus        prometheus-prometheus-node-exporter-j4tjb                         1/1     Running     0             12d

Storage class configured

kubectl get sc
NAME                 PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE     ALLOWVOLUMEEXPANSION    AGE
local-path           rancher.io/local-path   Delete          WaitForFirstConsumer  true                    13d
longhorn (default)   driver.longhorn.io      Delete          Immediate             true                    19d
longhorn-static      driver.longhorn.io      Delete          Immediate             true                    19d

Check Available resource on Nodes:

Capacity:
  cpu: 192
  ephemeral-storage: 1844284980Ki
  hugepages-1Gi: 0
  hugepages-2Mi: 0
  memory: 1056290728Ki
  nvidia.com/gpu: 8
  pods: 110
Allocatable:
  cpu: 192
  ephemeral-storage: 1699693034754
  hugepages-1Gi: 0
  hugepages-2Mi: 0
  memory: 1056188328Ki
  nvidia.com/gpu: 8
  pods: 110

Create PVC for Model weights storage

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grok2-weights-pvc
  namespace: grok
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 700Gi
  storageClassName: local-path # Use the appropriate storage class in your cluster

Grok2 Deployement Manifest file¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: grok2-sglang
  namespace: grok
  labels:
    app: grok2-sglang
spec:
  replicas: 1
  selector:
    matchLabels:
      app: grok2-sglang
  template:
    metadata:
      labels:
        app: grok2-sglang
    spec:
      imagePullSecrets:
        - name: dockerhub-creds
      initContainers:
        - name: fetch-weights
          image: python:3.11-slim
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh", "-lc"]
          args:
            - |
              set -euo pipefail
              if [ -d "${MODEL_DIR}" ] && [ "$(ls -A ${MODEL_DIR})" ]; then
                echo "Weights already present. Skipping download."
                exit 0
              fi
              mkdir -p "${MODEL_DIR}"
              pip install --no-cache-dir --upgrade pip huggingface_hub hf_transfer protobuf
              hf download "${HF_REPO}" --local-dir "${MODEL_DIR}"
          env:
            - name: HF_REPO
              value: xai-org/grok-2
            - name: MODEL_DIR
              value: /models/grok-2
            - name: LOG_LEVEL
              value: DEBUG
          resources:
            requests:
              cpu: "1"
              memory: 1Gi
            limits:
              cpu: "4"
              memory: 8Gi
          volumeMounts:
            - name: weights
              mountPath: /models/grok-2
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      containers:
        - name: sglang
          image: lmsysorg/sglang:v0.5.2rc2-cu126
          imagePullPolicy: IfNotPresent
          command: ["python", "-m", "sglang.launch_server"]
          args:
            - --model-path
            - /models/grok-2
            - --tokenizer-path
            - /models/grok-2/tokenizer.tok.json
            - --tp
            - "8"
            - --quantization
            - fp8
            - --attention-backend
            - triton
            - --host
            - 0.0.0.0
            - --port
            - "30000"
          env:
            - name: LOG_LEVEL
              value: DEBUG
          ports:
            - containerPort: 30000
              protocol: TCP
          resources:
            requests:
              nvidia.com/gpu: "8"
            limits:
              nvidia.com/gpu: "8"
          volumeMounts:
            - name: weights
              mountPath: /models/grok-2
              readOnly: true
            - name: shm
              mountPath: /dev/shm
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      volumes:
        - name: weights
          persistentVolumeClaim:
            claimName: grok2-weights-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 32Gi

Logs¶

kubectl logs grok2-sglang-767c87b96-gvwxj -n grok
Defaulted container "sglang" out of: sglang, fetch-weights (init)
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
W0911 10:24:10.440000 1 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:10.440000 1 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 10:24:10] server_args=ServerArgs(model_path='/models/grok-2', tokenizer_path='/models/grok-2/tokenizer.tok.json', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization='fp8', quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.831, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=176157154, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/models/grok-2', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='triton', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
[2025-09-11 10:24:11] Using default HuggingFace chat template with detected content format: string
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
W0911 10:24:17.065000 224 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:17.065000 224 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.370000 218 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.370000 218 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.445000 221 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.445000 221 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.455000 216 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.455000 216 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.570000 217 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.570000 217 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
W0911 10:24:19.668000 222 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.668000 222 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.724000 223 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.724000 223 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0911 10:24:19.729000 219 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.729000 219 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
W0911 10:24:19.765000 220 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0911 10:24:19.765000 220 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-11 10:24:20 TP0] Init torch distributed begin.
`torch_dtype` is deprecated! Use `dtype` instead!
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-09-11 10:24:21 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-09-11 10:24:23 TP0] Init torch distributed ends. mem usage=1.25 GB
[2025-09-11 10:24:24 TP0] Load weight begin. avail mem=77.43 GB
[2025-09-11 10:24:25 TP0] #parameters (analytical): 243.74 B, #parameters (actual): 269.56 B
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards:   0% Completed | 0/18 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/18 [00:23<06:44, 23.81s/it]
Loading safetensors checkpoint shards:  11% Completed | 2/18 [00:31<03:46, 14.18s/it]
Loading safetensors checkpoint shards:  17% Completed | 3/18 [00:38<02:44, 10.94s/it]
Loading safetensors checkpoint shards:  22% Completed | 4/18 [00:39<01:41,  7.25s/it]
Loading safetensors checkpoint shards:  28% Completed | 5/18 [00:40<01:01,  4.73s/it]
Loading safetensors checkpoint shards:  33% Completed | 6/18 [00:44<00:54,  4.54s/it]
Loading safetensors checkpoint shards:  39% Completed | 7/18 [00:48<00:48,  4.44s/it]
Loading safetensors checkpoint shards:  61% Completed | 11/18 [00:49<00:11,  1.63s/it]
Loading safetensors checkpoint shards:  67% Completed | 12/18 [00:53<00:13,  2.21s/it]
Loading safetensors checkpoint shards:  72% Completed | 13/18 [00:53<00:08,  1.77s/it]
Loading safetensors checkpoint shards:  83% Completed | 15/18 [00:54<00:03,  1.13s/it]
[2025-09-11 10:25:26 TP3] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:26 TP4] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:26 TP2] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:27 TP6] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:31 TP1] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
Loading safetensors checkpoint shards:  89% Completed | 16/18 [01:07<00:07,  3.78s/it]
Loading safetensors checkpoint shards: 100% Completed | 18/18 [01:08<00:00,  2.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 18/18 [01:08<00:00,  3.78s/it]

[2025-09-11 10:25:33 TP0] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:33 TP0] Load weight end. type=Grok1ForCausalLM, dtype=torch.bfloat16, avail mem=40.69 GB, mem usage=36.73 GB.
[2025-09-11 10:25:37 TP5] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:42 TP7] #all_names: 835, #hit_names: 707, #missing_exclude_scales: 0
[2025-09-11 10:25:43 TP3] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP6] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP1] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP2] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP4] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP0] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP0] Memory pool end. avail mem=11.22 GB
[2025-09-11 10:25:43 TP5] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP7] KV Cache is allocated. #tokens: 903439, K size: 13.79 GB, V size: 13.79 GB
[2025-09-11 10:25:43 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=11.16 GB
[2025-09-11 10:25:44 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160]
Capturing batches (bs=160 avail_mem=10.92 GB):   0%|          | 0/23 [00:00<?, ?it/s][2025-09-11 10:25:47 TP3] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP6] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP4] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP5] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP1] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP7] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP2] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
[2025-09-11 10:25:47 TP0] Config file not found at /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Fallback to triton version 3.1.0 and use MoE kernel config from /sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_1_0/E=8,N=2048,device_name=NVIDIA_H100_80GB_HBM3,dtype=fp8_w8a8.json. Performance might be sub-optimal!
Capturing batches (bs=1 avail_mem=10.28 GB): 100%|██████████| 23/23 [00:16<00:00,  1.42it/s]
[2025-09-11 10:26:00 TP3] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP1] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP7] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP4] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP5] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP6] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP0] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP2] Registering 2967 cuda graph addresses
[2025-09-11 10:26:00 TP0] Capture cuda graph end. Time elapsed: 17.13 s. mem usage=0.89 GB. avail mem=10.26 GB.
[2025-09-11 10:26:01 TP0] max_total_num_tokens=903439, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=3529, context_len=131072, available_gpu_mem=10.26 GB
[2025-09-11 10:26:01] INFO:     Started server process [1]
[2025-09-11 10:26:01] INFO:     Waiting for application startup.
[2025-09-11 10:26:01] INFO:     Application startup complete.
[2025-09-11 10:26:01] INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-09-11 10:26:02] INFO:     127.0.0.1:60028 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-11 10:26:02 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-11 10:26:05] INFO:     127.0.0.1:60032 - "POST /generate HTTP/1.1" 200 OK
[2025-09-11 10:26:05] The server is fired up and ready to roll!
[2025-09-11 10:26:21 TP0] Prefill batch. #new-seq: 1, #new-token: 15, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-11 10:26:22 TP0] Decode batch. #running-req: 1, #token: 48, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1.88, #queue-req: 0,
[2025-09-11 10:26:22] INFO:     10.46.124.148:54800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-12 07:18:12 TP0] Prefill batch. #new-seq: 1, #new-token: 12, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-12 07:18:14 TP0] Decode batch. #running-req: 1, #token: 19, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.00, #queue-req: 0,
[2025-09-12 07:18:14] INFO:     10.34.109.36:33844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-12 07:27:13 TP0] Prefill batch. #new-seq: 1, #new-token: 15, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-12 07:27:14 TP0] Decode batch. #running-req: 1, #token: 44, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.07, #queue-req: 0,
[2025-09-12 07:27:14 TP0] Decode batch. #running-req: 1, #token: 84, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.94, #queue-req: 0,
[2025-09-12 07:27:15 TP0] Decode batch. #running-req: 1, #token: 124, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.71, #queue-req: 0,
[2025-09-12 07:27:15 TP0] Decode batch. #running-req: 1, #token: 164, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.44, #queue-req: 0,
[2025-09-12 07:27:15 TP0] Decode batch. #running-req: 1, #token: 204, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.30, #queue-req: 0,
[2025-09-12 07:27:16 TP0] Decode batch. #running-req: 1, #token: 244, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.24, #queue-req: 0,
[2025-09-12 07:27:16 TP0] Decode batch. #running-req: 1, #token: 284, token usage: 0.00, cuda graph: True, gen throughput (token/s): 97.32, #queue-req: 0,
[2025-09-12 07:27:17 TP0] Decode batch. #running-req: 1, #token: 324, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.90, #queue-req: 0,
[2025-09-12 07:27:17 TP0] Decode batch. #running-req: 1, #token: 364, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.80, #queue-req: 0,
[2025-09-12 07:27:17 TP0] Decode batch. #running-req: 1, #token: 404, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.72, #queue-req: 0,
[2025-09-12 07:27:18 TP0] Decode batch. #running-req: 1, #token: 444, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.65, #queue-req: 0,
[2025-09-12 07:27:18 TP0] Decode batch. #running-req: 1, #token: 484, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.68, #queue-req: 0,
[2025-09-12 07:27:18] INFO:     10.34.109.36:40636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-12 07:38:09 TP0] Prefill batch. #new-seq: 1, #new-token: 16, #cached-token: 2, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-12 07:38:09 TP0] Decode batch. #running-req: 1, #token: 40, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.06, #queue-req: 0,
[2025-09-12 07:38:09 TP0] Decode batch. #running-req: 1, #token: 80, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.85, #queue-req: 0,
[2025-09-12 07:38:10 TP0] Decode batch. #running-req: 1, #token: 120, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.70, #queue-req: 0,
[2025-09-12 07:38:10 TP0] Decode batch. #running-req: 1, #token: 160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.47, #queue-req: 0,
[2025-09-12 07:38:10 TP0] Decode batch. #running-req: 1, #token: 200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.32, #queue-req: 0,
[2025-09-12 07:38:11 TP0] Decode batch. #running-req: 1, #token: 240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 98.22, #queue-req: 0,
[2025-09-12 07:38:11 TP0] Decode batch. #running-req: 1, #token: 280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 97.46, #queue-req: 0,
[2025-09-12 07:38:12 TP0] Decode batch. #running-req: 1, #token: 320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.91, #queue-req: 0,
[2025-09-12 07:38:12 TP0] Decode batch. #running-req: 1, #token: 360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.80, #queue-req: 0,
[2025-09-12 07:38:13 TP0] Decode batch. #running-req: 1, #token: 400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.76, #queue-req: 0,
[2025-09-12 07:38:13 TP0] Decode batch. #running-req: 1, #token: 440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.69, #queue-req: 0,
[2025-09-12 07:38:13 TP0] Decode batch. #running-req: 1, #token: 480, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.73, #queue-req: 0,
[2025-09-12 07:38:14 TP0] Decode batch. #running-req: 1, #token: 520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 96.45, #queue-req: 0,
[2025-09-12 07:38:14 TP0] Decode batch. #running-req: 1, #token: 560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.28, #queue-req: 0,
[2025-09-12 07:38:15 TP0] Decode batch. #running-req: 1, #token: 600, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.12, #queue-req: 0,
[2025-09-12 07:38:15 TP0] Decode batch. #running-req: 1, #token: 640, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.04, #queue-req: 0,
[2025-09-12 07:38:15 TP0] Decode batch. #running-req: 1, #token: 680, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.03, #queue-req: 0,
[2025-09-12 07:38:16 TP0] Decode batch. #running-req: 1, #token: 720, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.03, #queue-req: 0,
[2025-09-12 07:38:16 TP0] Decode batch. #running-req: 1, #token: 760, token usage: 0.00, cuda graph: True, gen throughput (token/s): 95.05, #queue-req: 0,
[2025-09-12 07:38:17 TP0] Decode batch. #running-req: 1, #token: 800, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.83, #queue-req: 0,
[2025-09-12 07:38:17 TP0] Decode batch. #running-req: 1, #token: 840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.47, #queue-req: 0,
[2025-09-12 07:38:18 TP0] Decode batch. #running-req: 1, #token: 880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.48, #queue-req: 0,
[2025-09-12 07:38:18 TP0] Decode batch. #running-req: 1, #token: 920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.51, #queue-req: 0,
[2025-09-12 07:38:18 TP0] Decode batch. #running-req: 1, #token: 960, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.49, #queue-req: 0,
[2025-09-12 07:38:19 TP0] Decode batch. #running-req: 1, #token: 1000, token usage: 0.00, cuda graph: True, gen throughput (token/s): 93.52, #queue-req: 0,
[2025-09-12 07:38:19 TP0] Decode batch. #running-req: 1, #token: 1040, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.94, #queue-req: 0,
[2025-09-12 07:38:20 TP0] Decode batch. #running-req: 1, #token: 1080, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.00, #queue-req: 0,
[2025-09-12 07:38:20 TP0] Decode batch. #running-req: 1, #token: 1120, token usage: 0.00, cuda graph: True, gen throughput (token/s): 91.99, #queue-req: 0,
[2025-09-12 07:38:21 TP0] Decode batch. #running-req: 1, #token: 1160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.02, #queue-req: 0,
[2025-09-12 07:38:21 TP0] Decode batch. #running-req: 1, #token: 1200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.03, #queue-req: 0,
[2025-09-12 07:38:21 TP0] Decode batch. #running-req: 1, #token: 1240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.02, #queue-req: 0,
[2025-09-12 07:38:22 TP0] Decode batch. #running-req: 1, #token: 1280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 92.04, #queue-req: 0,
[2025-09-12 07:38:23 TP0] Decode batch. #running-req: 1, #token: 1320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 33.16, #queue-req: 0,
[2025-09-12 07:38:24 TP0] Decode batch. #running-req: 1, #token: 1360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 38.38, #queue-req: 0,
[2025-09-12 07:38:25 TP0] Decode batch. #running-req: 1, #token: 1400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 90.65, #queue-req: 0,
[2025-09-12 07:38:25 TP0] Decode batch. #running-req: 1, #token: 1440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 90.67, #queue-req: 0,
[2025-09-12 07:38:25] INFO:     10.34.109.36:37208 - "POST /v1/chat/completions HTTP/1.1" 200 OK