Skip to content

Use Nebius vLLM with Opencode

Use this example if you want to publish a Nebius-hosted vLLM service plan with omctl, create a GPU-backed instance, expose the model over the Omnistrate-managed DNS endpoint, and use that endpoint from Opencode.

This example builds on the account setup in Nebius account onboarding with CTL.

  • model: Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled
  • served model alias: qwen3.5-27b-claude-4.6-opus-reasoning-distilled
  • tokenizer override: Qwen/Qwen3.5-27B
  • public OpenAI-compatible vLLM API over HTTPS on the Omnistrate-managed DNS endpoint
  • tool-calling enabled so Opencode can use /v1/chat/completions
  • Omnistrate dashboards for both vLLM metrics and NVIDIA GPU metrics

1. Choose the spec

Clone the example from https://github.com/omnistrate-community/nebius-vllm-demo.

The repository contains two variants:

  • spec.yaml: single-node Nebius deployment on a single H200 GPU
  • spec-gpu-cluster.yaml: multi-GPU Nebius GPU-cluster deployment on H200 with tensor parallelism enabled

Use spec.yaml if you want the simplest path for Opencode.

Use spec-gpu-cluster.yaml only if you already have a Nebius GPU cluster and want to run the same model with 8 GPUs. Before building that spec, replace the placeholder GpuClusterID: compute-cluster-id with your real Nebius GPU cluster ID.

2. Build and release the service plan

Run the build from the directory that contains the spec you want to publish.

Single-GPU example:

omctl build -f spec.yaml \
  --product-name 'Nebius' \
  --spec-type ServicePlanSpec \
  --release-as-preferred

GPU-cluster example:

omctl build -f spec-gpu-cluster.yaml \
  --product-name 'Nebius' \
  --spec-type ServicePlanSpec \
  --release-as-preferred

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 environment  plan_id        plan_name                  release_description          service_id    service_name      version  version_set_status │─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
 Dev          pt-VRWPNLImF0  Nebius vLLM GPU Inference  Initial Release Version Set  s-GuWdEcfcP5  Nebius vLLM Demo  1.0      Preferred          ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Check the service plan result at: https://omnistrate.cloud/product-tier?serviceId=s-GuWdEcfcP5&environmentId=se-BGBkF9Nwqq
Access your SaaS offer at: https://saasportal.instance-w6vidhd14.hc-pelsk80ph.us-east-2.aws.f2e0a955bb84.cloud/service-plans?serviceId=s-GuWdEcfcP5&environmentId=se-BGBkF9Nwqq

This creates or updates the service named Nebius and releases the plan version as the preferred version in the target environment. In this example, the plan name inside the spec is Nebius vLLM GPU Inference.

3. Create a Nebius vLLM instance

Once the plan is released, create an instance in a region backed by a READY Nebius binding.

omctl instance create \
  --service=Nebius \
  --environment=dev \
  --plan='Nebius vLLM GPU Inference' \
  --resource=vllm \
  --cloud-provider=nebius \
  --region=<ready-nebius-region>

╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 cloud_provider  environment  instance_id         plan                       region       resource  service           status     subscription_id  tags  version │────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
 nebius          Dev          <instance-id>       Nebius vLLM GPU Inference  us-central1  vllm      Nebius vLLM Demo  DEPLOYING  sub-6OR7umU3Ei         1.0     ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

If you are using a BYOA plan instead of a hosted plan, add --customer-account-id <customer-account-instance-id> to the create command.

Keep the returned instance ID. You will use it to inspect the endpoint, configure Opencode, and later delete the deployment.

If this is the first deployment in that Nebius tenant and region, expect the initial infrastructure bring-up to take longer than a normal redeploy.

You can track the instance status with:

omctl instance describe <instance-id>

# Filter using jq
omctl instance describe <instance-id> | jq '.consumptionResourceInstanceResult.status'
"RUNNING"

If something is not progressing, inspect the deployment with:

omctl instance debug <instance-id>

nebius instance debug nebius instance debug events

4. Get the generated inference endpoint

List the instance endpoints:

omctl instance list-endpoints <instance-id>

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 endpoint_name  endpoint_type  network_type  ports  resource_name  status   url                                                                                │───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────│
 api            additional     PUBLIC        443    vllm           HEALTHY  r-o3davudsia.<instance-id>.hc-xyxgpfjol.us-central1.nebius.f2e0a955bb84.cloud      ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

You should see a public api endpoint pointing at the Omnistrate-managed DNS name for the vLLM service.

5. Verify the endpoint before wiring in Opencode

Use the generated endpoint directly over HTTPS. The ingress already serves TLS on port 443, so do not append a port.

Health check:

curl -i https://<endpoint>/health

List the served models:

curl https://<endpoint>/v1/models

You should see the served model ID:

qwen3.5-27b-claude-4.6-opus-reasoning-distilled

Run a basic chat completion:

curl -sS \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5-27b-claude-4.6-opus-reasoning-distilled",
    "messages": [
      {
        "role": "user",
        "content": "Write a Python function that reverses a linked list."
      }
    ],
    "max_tokens": 512,
    "temperature": 0
  }' \
  https://<endpoint>/v1/chat/completions

Because Opencode uses tool-enabled OpenAI-compatible chat requests, it is worth verifying tool calling too:

curl -sS \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen3.5-27b-claude-4.6-opus-reasoning-distilled",
    "messages": [
      {
        "role": "user",
        "content": "What is 2 + 2? Use the tool if needed."
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "echo",
          "description": "Echo the provided text",
          "parameters": {
            "type": "object",
            "properties": {
              "text": {
                "type": "string"
              }
            },
            "required": ["text"]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "max_tokens": 256
  }' \
  https://<endpoint>/v1/chat/completions

The shipped spec already enables the vLLM flags required for this flow, including --enable-auto-tool-choice, --tool-call-parser qwen3_coder, and --reasoning-parser qwen3.

If this request returns a 400 complaining that auto tool choice is not enabled, the instance is still running an older plan version or an older chart release.

6. Configure Opencode

Point Opencode at the Nebius vLLM endpoint by editing ~/.config/opencode/opencode.json.

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "nebius-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Nebius vLLM Qwen 3.5 Distilled",
      "options": {
        "baseURL": "https://<endpoint>/v1"
      },
      "models": {
        "qwen3.5-27b-claude-4.6-opus-reasoning-distilled": {
          "name": "qwen3.5-27b-claude-4.6-opus-reasoning-distilled"
        }
      }
    }
  }
}

Notes:

  • Replace <endpoint> with the DNS name returned by omctl instance list-endpoints.
  • Do not append :443 or :8000; the ingress already exposes the service over HTTPS on the default port.
  • If your endpoint requires authentication, add options.apiKey or options.headers.
  • If you want Opencode to understand model limits more precisely, add a limit block under the model entry.

After saving the config, start Opencode and run:

/models

Select:

  • provider: nebius-vllm
  • model: qwen3.5-27b-claude-4.6-opus-reasoning-distilled

At that point Opencode will use your Nebius-hosted vLLM endpoint for chat and tool-enabled coding requests.

opencode integration

7. Inspect the Omnistrate dashboards

This example includes Omnistrate dashboard integrations for both vLLM and NVIDIA GPU telemetry.

Open the dashboards with:

omctl instance dashboard <instance-id>

Out of the box, the dashboards expose:

  • vLLM request concurrency, throughput, latency, KV cache usage, and prefix cache hit rate
  • NVIDIA GPU inventory, utilization, framebuffer usage, PCIe throughput, thermals, power, and error signals

nebius dashboards nebius gpu dashboard nebius vllm dashboard

8. Delete the instance

Delete the deployment when you are finished:

omctl instance delete <instance-id>

Add -y if you want to skip the confirmation prompt:

omctl instance delete <instance-id> -y