Asoba Ona Terminal

EnergyAnalyst-v0.1 — AWS deployment and simple chat UI

This guide shows how to deploy asoba/EnergyAnalyst-v0.1 on AWS and wire up a minimal chat interface that inherits this documentation site’s styling.

1) Quick architecture

graph TB
    subgraph "Static Hosting"
        A["GitHub Pages<br/>Documentation Site"]
    end
    
    subgraph "AWS Infrastructure"
        B["API Gateway / ALB<br/>Load Balancer"]
        C["EC2 GPU Instance<br/>g5.xlarge"]
        D["TGI/vLLM Container<br/>Model Server"]
    end
    
    subgraph "Model Storage"
        E["Hugging Face Hub<br/>asoba/EnergyAnalyst-v0.1"]
    end
    
    A -->|HTTPS Request| B
    B -->|Private HTTPS| C
    C -->|Runs| D
    D -->|Downloads/Inference| E

Tip — choose your front door:

SSH in and install Docker + NVIDIA toolkit (skip if using DLAMI with Docker+NVIDIA preconfigured):

# Docker
sudo apt-get update -y
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER
sudo systemctl enable --now docker

# NVIDIA container runtime (for GPU access in Docker)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update -y && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Login again so your user is in the docker group
exit

Log back in, set your Hugging Face token:

export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

3) Start a text-generation server

Option A — Hugging Face Text Generation Inference (TGI):

docker run --gpus all --rm -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.0.4 \
  --model-id asoba/EnergyAnalyst-v0.1 \
  --num-shard 1 \
  --max-input-length 2048 \
  --max-total-tokens 3072

Option B — vLLM (OpenAI-compatible server):

docker run --gpus all --rm -p 8000:8000 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/vllm-project/vllm-openai:latest \
  --model asoba/EnergyAnalyst-v0.1 \
  --dtype bfloat16 \
  --max-model-len 3072

3.1) Quick path (≈15 min): Single EC2 with Nginx rate limiting

Goal: add a lightweight reverse proxy with basic rate limiting in front of TGI (adjust ports for vLLM).

1) Start the model server (from section 3, Option A):

docker run --gpus all --rm -p 8080:80 \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:2.0.4 \
  --model-id asoba/EnergyAnalyst-v0.1 \
  --num-shard 1 \
  --max-input-length 2048 \
  --max-total-tokens 3072

2) Create nginx.conf:

worker_processes auto;
events { worker_connections 1024; }
http {
  limit_req_zone $binary_remote_addr zone=ip_rl:10m rate=30r/m;
  upstream tgi { server 127.0.0.1:8080; }

  server {
    listen 80;
    add_header Access-Control-Allow-Origin "https://asobacloud.github.io" always;
    add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
    add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;

    location / {
      limit_req zone=ip_rl burst=10 nodelay;
      proxy_pass http://tgi;
    }

    error_page 429 = @limited;
    location @limited { return 429; }
  }
}

3) Run Nginx (host networking keeps it simple):

docker run -d --name gateway --network host \
  -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro \
  nginx:1.25-alpine

4) Point the chat UI (section 5) to the proxy by setting:

5) For stronger protections (per-API-key limits, 429 UI handling), see section 10.

4) Test the endpoint

TGI example:

curl -s http://<EC2_PUBLIC_IP>:8080/generate \
  -H 'Content-Type: application/json' \
  -d '{
    "inputs": "You are a regulatory compliance expert.\n\nInstruction: List key compliance requirements for utility-scale solar projects.",
    "parameters": {"max_new_tokens": 256, "temperature": 0.7}
  }'

vLLM example:

curl -s http://<EC2_PUBLIC_IP>:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "asoba/EnergyAnalyst-v0.1",
    "messages": [
      {"role": "system", "content": "You are a regulatory compliance expert."},
      {"role": "user", "content": "List key compliance requirements for utility-scale solar projects."}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

5) Minimal chat UI (matches this site’s look)

Drop this HTML into a new page (e.g., chat.html in this repo), or embed the widget section inside any doc page. It inherits the site’s typography and spacing.

<div class="chat-widget">
  <h2>EnergyAnalyst Chat</h2>
  <div id="chat-log" style="background:#fff;border:1px solid #e2e8f0;border-radius:8px;padding:12px;max-height:380px;overflow:auto"></div>
  <div style="display:flex;gap:8px;margin-top:10px">
    <input id="chat-input" type="text" placeholder="Ask about energy policy…" style="flex:1;padding:10px;border:1px solid #e2e8f0;border-radius:6px" />
    <button id="chat-send" style="background:#4551bf;color:#fff;border:none;border-radius:6px;padding:10px 14px;cursor:pointer">Send</button>
  </div>
  <small style="color:#64748b">Model: asoba/EnergyAnalyst-v0.1</small>
</div>
<script>
  const API_BASE = 'http://<EC2_PUBLIC_IP>:8080'; // TGI
  // For vLLM (OpenAI API), set API_BASE to 'http://<IP>:8000/v1' and adjust fetch below.

  async function askTGI(prompt) {
    const res = await fetch(`${API_BASE}/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        inputs: `You are a regulatory compliance expert.\n\n### Instruction:\n${prompt}\n\n### Response:`,
        parameters: { max_new_tokens: 256, temperature: 0.7 }
      })
    });
    if (!res.ok) throw new Error('Request failed');
    const data = await res.json();
    // TGI returns either {generated_text: ...} or {generated_texts: [...]} depending on version
    return (data.generated_text ?? data[0]?.generated_text ?? JSON.stringify(data));
  }

  // UI wiring
  const log = document.getElementById('chat-log');
  const input = document.getElementById('chat-input');
  const send = document.getElementById('chat-send');

  function append(role, text) {
    const row = document.createElement('div');
    row.style.margin = '8px 0';
    row.innerHTML = `<strong style="color:${role==='You' ? '#1e293b' : '#4551bf'}">${role}:</strong> <span>${text.replace(/</g,'&lt;')}</span>`;
    log.appendChild(row);
    log.scrollTop = log.scrollHeight;
  }

  async function handleSend() {
    const q = input.value.trim();
    if (!q) return;
    append('You', q);
    input.value = '';
    try {
      const a = await askTGI(q);
      append('EnergyAnalyst', a);
    } catch (e) {
      append('Error', 'Failed to reach model API');
    }
  }

  send.addEventListener('click', handleSend);
  input.addEventListener('keydown', (e) => { if (e.key === 'Enter') handleSend(); });
</script>

Notes:

add_header Access-Control-Allow-Origin "https://asobacloud.github.io" always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Content-Type, Authorization" always;

6) Make it production-ready (optional)

7) Local usage (reference)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("asoba/EnergyAnalyst-v0.1")
tokenizer = AutoTokenizer.from_pretrained("asoba/EnergyAnalyst-v0.1")

prompt = (
  "You are a regulatory compliance expert.\n\n" 
  "### Instruction:\nWhat are the key compliance requirements for utility-scale solar projects?\n\n"
  "### Response:\n"
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

8) Instance sizing tips

9) Security

That’s it—after step 3 you’ll have a live endpoint; add the chat widget to any page in this docs site to talk to EnergyAnalyst-v0.1.

10) Lightweight rate limiting (client + server)

If you want a simple deployment that still protects against bots and abusive usage, add a tiny reverse proxy with rate limiting and a small client-side throttle. This avoids heavy infrastructure while providing practical safeguards.

Server: Nginx reverse proxy with rate limiting

Run Nginx in front of TGI or vLLM and enforce limits per IP and per API key. Example using Docker Compose with TGI (adapt to vLLM by changing the upstream and port):

# docker-compose.yml (example)
version: "3.8"
services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:2.0.4
    command: [
      "--model-id","asoba/EnergyAnalyst-v0.1",
      "--num-shard","1",
      "--max-input-length","2048",
      "--max-total-tokens","3072"
    ]
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    # Do not publish; reachable by service name from gateway

  gateway:
    image: nginx:1.25-alpine
    depends_on:
      - tgi
    ports:
      - "80:80"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro

Create nginx/nginx.conf with sensible defaults, CORS for this docs site, and rate limiting:

worker_processes auto;
events { worker_connections 1024; }
http {
  # Token-bucket per-IP and per-API-key
  limit_req_zone $binary_remote_addr zone=ip_rl:10m rate=30r/m;   # 30 requests/min per IP
  map $http_x_api_key $api_key { default "anon"; "~.+" $http_x_api_key; }
  limit_req_zone $api_key zone=key_rl:10m rate=60r/m;            # 60 requests/min per key
  limit_conn_zone $binary_remote_addr zone=addr_conn:10m;        # concurrent connection cap

  upstream tgi { server tgi:80; }

  server {
    listen 80;

    # CORS: allow only the docs site (adjust to your domain)
    set $allowed_origin "https://asobacloud.github.io";
    if ($request_method = 'OPTIONS') {
      add_header Access-Control-Allow-Origin $allowed_origin always;
      add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
      add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;
      return 204;
    }
    add_header Access-Control-Allow-Origin $allowed_origin always;
    add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
    add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;

    # Basic abuse controls
    client_max_body_size 128k;        # prompt size guard
    limit_conn addr_conn 5;           # concurrent connections per IP
    limit_req zone=ip_rl  burst=10 nodelay;
    limit_req zone=key_rl burst=20 nodelay;

    # Help clients back off when limited
    add_header Retry-After 15 always;

    location / {
      proxy_pass http://tgi;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_read_timeout 600s;
    }

    error_page 429 = @ratelimited;
    location @ratelimited {
      default_type application/json;
      return 429 '{"error":"rate_limited","message":"Too many requests. Please slow down."}';
    }
  }
}

Notes:

Optional hardening:

Client: throttle requests and handle 429

Add a small token-bucket throttle to the chat UI and surface server 429 responses. Insert the following into the earlier chat.html script (replace the existing handleSend with this, and add the limiter helper):

<script>
  // Simple token bucket: 10 requests/min, refills continuously
  const rl = {
    capacity: 10,
    tokens: 10,
    refillMs: 6000, // 1 token every 6s
    last: Date.now()
  };
  function tryConsume() {
    const now = Date.now();
    const elapsed = now - rl.last;
    const refill = Math.floor(elapsed / rl.refillMs);
    if (refill > 0) {
      rl.tokens = Math.min(rl.capacity, rl.tokens + refill);
      rl.last = rl.last + refill * rl.refillMs;
    }
    if (rl.tokens <= 0) return false;
    rl.tokens -= 1;
    return true;
  }

  async function handleSend() {
    const q = input.value.trim();
    if (!q) return;
    if (!tryConsume()) {
      append('Info', 'You are sending messages too quickly. Please wait a few seconds.');
      return;
    }
    append('You', q);
    input.value = '';
    send.disabled = true;
    try {
      const a = await askTGI(q);
      append('EnergyAnalyst', a);
    } catch (e) {
      if (e?.message?.includes('429')) {
        append('Error', 'Rate limited by server. Please wait and try again.');
      } else {
        append('Error', 'Failed to reach model API');
      }
    } finally {
      send.disabled = false;
    }
  }

  // Intercept 429 on fetch
  async function askTGI(prompt) {
    const res = await fetch(`${API_BASE}/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        inputs: `You are a regulatory compliance expert.\n\n### Instruction:\n${prompt}\n\n### Response:`,
        parameters: { max_new_tokens: 256, temperature: 0.7 }
      })
    });
    if (res.status === 429) throw new Error('429');
    if (!res.ok) throw new Error('Request failed');
    const data = await res.json();
    return (data.generated_text ?? data[0]?.generated_text ?? JSON.stringify(data));
  }
</script>

Optional bot friction (client-only, zero backend changes):

This combination provides practical protection with minimal moving parts: a tiny Nginx container to enforce server-side fairness and a lightweight client throttle to improve UX and deter basic automation.