EnergyAnalyst-v0.1 — AWS deployment and simple chat UI
This guide shows how to deploy asoba/EnergyAnalyst-v0.1
on AWS and wire up a minimal chat interface that inherits this documentation site’s styling.
- Model: Mistral-7B-v0.3 (LoRA fine-tuned)
- License: Apache-2.0
- Best for: Energy policy and regulatory compliance analysis
1) Quick architecture
graph TB
subgraph "Static Hosting"
A["GitHub Pages<br/>Documentation Site"]
end
subgraph "AWS Infrastructure"
B["API Gateway / ALB<br/>Load Balancer"]
C["EC2 GPU Instance<br/>g5.xlarge"]
D["TGI/vLLM Container<br/>Model Server"]
end
subgraph "Model Storage"
E["Hugging Face Hub<br/>asoba/EnergyAnalyst-v0.1"]
end
A -->|HTTPS Request| B
B -->|Private HTTPS| C
C -->|Runs| D
D -->|Downloads/Inference| E
- Keep this site static (GitHub Pages or S3).
- Host the model server on an AWS GPU EC2 instance using a container runtime.
- Optionally front with an ALB or API Gateway + ACM TLS; enable CORS for the docs origin.
Tip — choose your front door:
- Lightweight (single EC2): run an on-instance Nginx reverse proxy for CORS and basic rate limiting. Fastest to set up. See “3.1 Quick path (≈15 min)”.
- Managed: use ALB or API Gateway with ACM TLS and add AWS WAF/usage plans for rate limiting and bot mitigation. Best for production scale.
2) Recommended EC2 setup (simple and fast)
- Instance type:
g5.xlarge
(org4dn.xlarge
), 100–200 GB gp3 - AMI: Ubuntu 22.04 or AWS Deep Learning AMI (comes with NVIDIA drivers)
- Security group: open TCP 80 or 8080 from your office/VPN or the ALB only
SSH in and install Docker + NVIDIA toolkit (skip if using DLAMI with Docker+NVIDIA preconfigured):
# Docker
sudo apt-get update -y
sudo apt-get install -y docker.io
sudo usermod -aG docker $USER
sudo systemctl enable --now docker
# NVIDIA container runtime (for GPU access in Docker)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update -y && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
# Login again so your user is in the docker group
exit
Log back in, set your Hugging Face token:
export HF_TOKEN=hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3) Start a text-generation server
Option A — Hugging Face Text Generation Inference (TGI):
docker run --gpus all --rm -p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id asoba/EnergyAnalyst-v0.1 \
--num-shard 1 \
--max-input-length 2048 \
--max-total-tokens 3072
- API base:
http://<EC2_PUBLIC_IP>:8080
- Health check:
GET /
returns JSON - Inference:
POST /generate
(TGI) with JSON payload
Option B — vLLM (OpenAI-compatible server):
docker run --gpus all --rm -p 8000:8000 \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/vllm-project/vllm-openai:latest \
--model asoba/EnergyAnalyst-v0.1 \
--dtype bfloat16 \
--max-model-len 3072
- API base:
http://<EC2_PUBLIC_IP>:8000/v1
- Inference:
POST /v1/chat/completions
3.1) Quick path (≈15 min): Single EC2 with Nginx rate limiting
Goal: add a lightweight reverse proxy with basic rate limiting in front of TGI (adjust ports for vLLM).
1) Start the model server (from section 3, Option A):
docker run --gpus all --rm -p 8080:80 \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-generation-inference:2.0.4 \
--model-id asoba/EnergyAnalyst-v0.1 \
--num-shard 1 \
--max-input-length 2048 \
--max-total-tokens 3072
2) Create nginx.conf
:
worker_processes auto;
events { worker_connections 1024; }
http {
limit_req_zone $binary_remote_addr zone=ip_rl:10m rate=30r/m;
upstream tgi { server 127.0.0.1:8080; }
server {
listen 80;
add_header Access-Control-Allow-Origin "https://asobacloud.github.io" always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;
location / {
limit_req zone=ip_rl burst=10 nodelay;
proxy_pass http://tgi;
}
error_page 429 = @limited;
location @limited { return 429; }
}
}
3) Run Nginx (host networking keeps it simple):
docker run -d --name gateway --network host \
-v $(pwd)/nginx.conf:/etc/nginx/nginx.conf:ro \
nginx:1.25-alpine
4) Point the chat UI (section 5) to the proxy by setting:
const API_BASE = 'http://<EC2_PUBLIC_IP>';
5) For stronger protections (per-API-key limits, 429 UI handling), see section 10.
4) Test the endpoint
TGI example:
curl -s http://<EC2_PUBLIC_IP>:8080/generate \
-H 'Content-Type: application/json' \
-d '{
"inputs": "You are a regulatory compliance expert.\n\nInstruction: List key compliance requirements for utility-scale solar projects.",
"parameters": {"max_new_tokens": 256, "temperature": 0.7}
}'
vLLM example:
curl -s http://<EC2_PUBLIC_IP>:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "asoba/EnergyAnalyst-v0.1",
"messages": [
{"role": "system", "content": "You are a regulatory compliance expert."},
{"role": "user", "content": "List key compliance requirements for utility-scale solar projects."}
],
"temperature": 0.7,
"max_tokens": 256
}'
5) Minimal chat UI (matches this site’s look)
Drop this HTML into a new page (e.g., chat.html
in this repo), or embed the widget section inside any doc page. It inherits the site’s typography and spacing.
<div class="chat-widget">
<h2>EnergyAnalyst Chat</h2>
<div id="chat-log" style="background:#fff;border:1px solid #e2e8f0;border-radius:8px;padding:12px;max-height:380px;overflow:auto"></div>
<div style="display:flex;gap:8px;margin-top:10px">
<input id="chat-input" type="text" placeholder="Ask about energy policy…" style="flex:1;padding:10px;border:1px solid #e2e8f0;border-radius:6px" />
<button id="chat-send" style="background:#4551bf;color:#fff;border:none;border-radius:6px;padding:10px 14px;cursor:pointer">Send</button>
</div>
<small style="color:#64748b">Model: asoba/EnergyAnalyst-v0.1</small>
</div>
<script>
const API_BASE = 'http://<EC2_PUBLIC_IP>:8080'; // TGI
// For vLLM (OpenAI API), set API_BASE to 'http://<IP>:8000/v1' and adjust fetch below.
async function askTGI(prompt) {
const res = await fetch(`${API_BASE}/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
inputs: `You are a regulatory compliance expert.\n\n### Instruction:\n${prompt}\n\n### Response:`,
parameters: { max_new_tokens: 256, temperature: 0.7 }
})
});
if (!res.ok) throw new Error('Request failed');
const data = await res.json();
// TGI returns either {generated_text: ...} or {generated_texts: [...]} depending on version
return (data.generated_text ?? data[0]?.generated_text ?? JSON.stringify(data));
}
// UI wiring
const log = document.getElementById('chat-log');
const input = document.getElementById('chat-input');
const send = document.getElementById('chat-send');
function append(role, text) {
const row = document.createElement('div');
row.style.margin = '8px 0';
row.innerHTML = `<strong style="color:${role==='You' ? '#1e293b' : '#4551bf'}">${role}:</strong> <span>${text.replace(/</g,'<')}</span>`;
log.appendChild(row);
log.scrollTop = log.scrollHeight;
}
async function handleSend() {
const q = input.value.trim();
if (!q) return;
append('You', q);
input.value = '';
try {
const a = await askTGI(q);
append('EnergyAnalyst', a);
} catch (e) {
append('Error', 'Failed to reach model API');
}
}
send.addEventListener('click', handleSend);
input.addEventListener('keydown', (e) => { if (e.key === 'Enter') handleSend(); });
</script>
Notes:
- If hosting the model behind an ALB or a different domain, enable CORS (e.g., via Nginx):
add_header Access-Control-Allow-Origin "https://asobacloud.github.io" always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Content-Type, Authorization" always;
6) Make it production-ready (optional)
- Put the container behind an ALB with an HTTPS listener (ACM certificate)
- Restrict inbound access (security group) to the ALB only
- Auto-restart the container with a
systemd
unit or use ECS/EKS for orchestration - Add CloudWatch logs for the container
7) Local usage (reference)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("asoba/EnergyAnalyst-v0.1")
tokenizer = AutoTokenizer.from_pretrained("asoba/EnergyAnalyst-v0.1")
prompt = (
"You are a regulatory compliance expert.\n\n"
"### Instruction:\nWhat are the key compliance requirements for utility-scale solar projects?\n\n"
"### Response:\n"
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
8) Instance sizing tips
g5.xlarge
(NVIDIA A10G, 24 GB) is sufficient for int8/FP16 single-shard serving- Prefer
bfloat16
if supported; quantization (e.g., AWQ/GGUF) can reduce memory - Increase
--max-total-tokens
cautiously to avoid OOM
9) Security
- Keep your HF token secret; use SSM Parameter Store or Secrets Manager
- Allow only your docs origin via CORS; use HTTPS end-to-end
- Rotate AMI and patch regularly; limit SSH access (e.g., SSM Session Manager)
That’s it—after step 3 you’ll have a live endpoint; add the chat widget to any page in this docs site to talk to EnergyAnalyst-v0.1.
10) Lightweight rate limiting (client + server)
If you want a simple deployment that still protects against bots and abusive usage, add a tiny reverse proxy with rate limiting and a small client-side throttle. This avoids heavy infrastructure while providing practical safeguards.
Server: Nginx reverse proxy with rate limiting
Run Nginx in front of TGI or vLLM and enforce limits per IP and per API key. Example using Docker Compose with TGI (adapt to vLLM by changing the upstream and port):
# docker-compose.yml (example)
version: "3.8"
services:
tgi:
image: ghcr.io/huggingface/text-generation-inference:2.0.4
command: [
"--model-id","asoba/EnergyAnalyst-v0.1",
"--num-shard","1",
"--max-input-length","2048",
"--max-total-tokens","3072"
]
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
# Do not publish; reachable by service name from gateway
gateway:
image: nginx:1.25-alpine
depends_on:
- tgi
ports:
- "80:80"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
Create nginx/nginx.conf
with sensible defaults, CORS for this docs site, and rate limiting:
worker_processes auto;
events { worker_connections 1024; }
http {
# Token-bucket per-IP and per-API-key
limit_req_zone $binary_remote_addr zone=ip_rl:10m rate=30r/m; # 30 requests/min per IP
map $http_x_api_key $api_key { default "anon"; "~.+" $http_x_api_key; }
limit_req_zone $api_key zone=key_rl:10m rate=60r/m; # 60 requests/min per key
limit_conn_zone $binary_remote_addr zone=addr_conn:10m; # concurrent connection cap
upstream tgi { server tgi:80; }
server {
listen 80;
# CORS: allow only the docs site (adjust to your domain)
set $allowed_origin "https://asobacloud.github.io";
if ($request_method = 'OPTIONS') {
add_header Access-Control-Allow-Origin $allowed_origin always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;
return 204;
}
add_header Access-Control-Allow-Origin $allowed_origin always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Content-Type, Authorization, x-api-key" always;
# Basic abuse controls
client_max_body_size 128k; # prompt size guard
limit_conn addr_conn 5; # concurrent connections per IP
limit_req zone=ip_rl burst=10 nodelay;
limit_req zone=key_rl burst=20 nodelay;
# Help clients back off when limited
add_header Retry-After 15 always;
location / {
proxy_pass http://tgi;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 600s;
}
error_page 429 = @ratelimited;
location @ratelimited {
default_type application/json;
return 429 '{"error":"rate_limited","message":"Too many requests. Please slow down."}';
}
}
}
Notes:
- For vLLM, set
upstream vllm { server vllm:8000; }
andproxy_pass http://vllm;
. - Containers in the same Compose file share a network; keep TGI unexposed and reference
tgi:80
from Nginx. If you need host access for debugging, temporarily addports: ["8080:80"]
totgi
but keep the upstream astgi:80
. - To prefer API-key fairness, lower the per-IP rate and keep a higher per-key rate.
- Bind the public port only on the proxy and firewall direct access to the model.
Optional hardening:
- Put the proxy behind an AWS ALB or Cloudflare; both add DDoS/bot filtering with minimal setup.
- Add simple input caps: enforce
max_new_tokens
at the model server, keepclient_max_body_size
small.
Client: throttle requests and handle 429
Add a small token-bucket throttle to the chat UI and surface server 429 responses. Insert the following into the earlier chat.html
script (replace the existing handleSend
with this, and add the limiter helper):
<script>
// Simple token bucket: 10 requests/min, refills continuously
const rl = {
capacity: 10,
tokens: 10,
refillMs: 6000, // 1 token every 6s
last: Date.now()
};
function tryConsume() {
const now = Date.now();
const elapsed = now - rl.last;
const refill = Math.floor(elapsed / rl.refillMs);
if (refill > 0) {
rl.tokens = Math.min(rl.capacity, rl.tokens + refill);
rl.last = rl.last + refill * rl.refillMs;
}
if (rl.tokens <= 0) return false;
rl.tokens -= 1;
return true;
}
async function handleSend() {
const q = input.value.trim();
if (!q) return;
if (!tryConsume()) {
append('Info', 'You are sending messages too quickly. Please wait a few seconds.');
return;
}
append('You', q);
input.value = '';
send.disabled = true;
try {
const a = await askTGI(q);
append('EnergyAnalyst', a);
} catch (e) {
if (e?.message?.includes('429')) {
append('Error', 'Rate limited by server. Please wait and try again.');
} else {
append('Error', 'Failed to reach model API');
}
} finally {
send.disabled = false;
}
}
// Intercept 429 on fetch
async function askTGI(prompt) {
const res = await fetch(`${API_BASE}/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
inputs: `You are a regulatory compliance expert.\n\n### Instruction:\n${prompt}\n\n### Response:`,
parameters: { max_new_tokens: 256, temperature: 0.7 }
})
});
if (res.status === 429) throw new Error('429');
if (!res.ok) throw new Error('Request failed');
const data = await res.json();
return (data.generated_text ?? data[0]?.generated_text ?? JSON.stringify(data));
}
</script>
Optional bot friction (client-only, zero backend changes):
- Disable the Send button for 1–2 seconds after each request.
- Add a simple honeypot input (hidden field) and ignore submissions that fill it.
- If you can accept a managed service, add Cloudflare Turnstile to the page and only enable Send after a valid token is present. For robust verification, your proxy must validate the token server-side (requires a small verification endpoint or using Cloudflare in front of your origin).
This combination provides practical protection with minimal moving parts: a tiny Nginx container to enforce server-side fairness and a lightweight client throttle to improve UX and deter basic automation.