Create a REST API for the Microsoft/BitNet B1.58 model and integrate it with an Open WebUI

I'm currently writing this as a bit of a vent. I typically use Ollama models, but I discovered someone's post on X (formerly Twitter) about a Microsoft model that supposedly runs well on CPU alone, with even better performance on systems like the M2 chip.

And that model is microsoft/BitNet b1.58 2B4T. After seeing the news in APR-2025, I waited to see if anyone would try to implement it in Ollama. I saw people asking about it too, but there was still no update.

I waited for a long time until June-2025 and still nothing. Oh well, let me find a way to run it myself from the code then. Initially, I set a simple goal: put the model in a container and find something that can create an endpoint to work with Open WebUI (a web interface for chatting like ChatGPT), So This blog is to document the my experience for this experiment.

If you want to read Thai Version (อ่านได้ที่นี่)

Getting Ready to Run microsoft/BitNet

- Linux: (I actually tried this in Docker)

I take a Python image and install lib for Development

# Use official Python 3.12 image
FROM python:3.12-slim

# Install system dependencies for PyTorch and build tools
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        cmake \
        git \
        curl \
        ca-certificates \
        libopenblas-dev \
        libomp-dev \
        libssl-dev \
        libffi-dev \
        wget \
        && rm -rf /var/lib/apt/lists/*

# (Optional) Set a working directory
WORKDIR /app

# Copy your requirements.txt if you have one
COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt

create a requirements.txt file.

fastapi==0.110.2
uvicorn[standard]==0.29.0
transformers==4.52.4
torch==2.7.0
numpy==1.26.4
accelerate==0.29.0

Run it normally (standard execution)

# Build the image
docker build -t python-bitNet .

# Run the container with port forwarding and mounting your code
docker run -it -p 8888:8888 -v "$PWD":/app python-bitNet /bin/bash

Use DevContainer (I try this method)

- Windows: This one has quite a few steps, for those who like challenges

I say it's challenging because I tried it and got stuck for 2 weeks lol. On Linux, it's done in a flash. For anyone who wants to try, you need to have the following

  • For Visual Studio, you need to install additional C++ components as follows:
  • Running PowerShell won't work - you have to run it in Developer Command Prompt for VS 2022 or Developer PowerShell for VS 2022.
    The regular Terminal doesn't set all the variables like Path properly, so you'll encounter errors like
Error C1083: Cannot open include file: 'algorithm': No such file or directory

Even though you try to set vcvarsall.bat x64, it's like hit or miss - sometimes it works, sometimes it doesn't.

Note: vcvarsall.bat is located in "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvarsall.bat"

  • Set python lib .tlb in path > if you don't include it, it will crash
fatal error LNK1104: cannot open file 'python312.lib'
Add Python Lib to PATH for running C++ libraries used for AI Inference

And after the environment is ready

  • Let set Virtual environment
# Set ENV
python3 -m venv bitnet-env
# or
python -m venv bitnet-env
  • Activate Virtual environment
# Linux
source bitnet-env/bin/activate

# Windows - Powershell
.\bitnet-env\Scripts\Activate.ps1
# Windows - CMD
.\bitnet-env\Scripts\activate.bat
  • Install required libraries according to requirements.txt - if you're using the Linux/Docker approach, these dependencies are already included in the Container image.
fastapi==0.110.2
uvicorn[standard]==0.29.0
transformers==4.52.4
torch==2.7.0
numpy==1.26.4
accelerate==0.29.0
pip install --upgrade pip && pip install -r requirements.txt

pip install git+https://github.com/huggingface/transformers.git@096f25ae1f501a084d8ff2dcaf25fbc2bd60eba4

Writing Code to Use Model from Hugging Face

After resolving all the ENV issues, let's Coding, I mentioned wanting to connect it with OpenWebUI, so I made 2 versions Command Line version / API version

- Command Line

Try writing code using

  • Transformers: For loading pre-trained models from Hugging Face
  • PyTorch: For model inference from transformers (at this point, the my machine specs that only use inference but not good for Train/Fine Tune) The points where PyTorch is used in several parts, such as
    - bfloat16: Uses less memory than FP16 but with less precision
    - return_tensors="pt": Specifies PyTorch tensor format
    - to(model.device): Enables GPU acceleration if CUDA is available
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "microsoft/bitnet-b1.58-2B-4T"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    force_download=True,
)

# Apply the chat template + Role
messages = [
    {"role": "system", "content": "You are a Senior Programmer."},
    {"role": "user", "content": "Can you help me with a coding problem?"},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
chat_input = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate response
chat_outputs = model.generate(**chat_input, max_new_tokens=50)
response = tokenizer.decode(chat_outputs[0][chat_input['input_ids'].shape[-1]:], skip_special_tokens=True)
print("\nAssistant Response:", response)
The Command Line version - you'll see that Windows has many limitations

Another version actually adds a loop and keeps asking continuously until you type "Thank you BITNET" - you can see source code here

- API

Note: I didn't research what libraries are available that can make our API directly connect to Open WebUI. Initially

I tried to see what connection standards Open WebUI supports first - for the Text Prompt part, it has OpenAI / Ollama sections.

Now I chose to go with OpenAI API because when I tried playing with dotnet semantic kernel before, it has the /v1/chat/completions pattern, so I tried starting from there and tried adding it in WebUI to see what paths it hits in our code.

From what I tested, I found there are 3 API endpoints at minimum that Open WebUI calls to us:

  • /v1/chat/completions
  • /v1/models
  • /health

For /v1/chat/completions, I just kept adding based on what it complained about + asked AI until I completed all 3 APIs like this.

import datetime
import time
import uuid
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from pydantic import BaseModel
from typing import List, Dict, Optional
import torch
import uuid
from datetime import datetime
from transformers import AutoModelForCausalLM, AutoTokenizer

app = FastAPI()

# Load model and tokenizer at startup
model_id = "microsoft/bitnet-b1.58-2B-4T"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    force_download=True,
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: List[Message]
    max_new_tokens: Optional[int] = 700

class Choice(BaseModel):
    index: int
    message: Dict[str, str]
    finish_reason: str

class ChatResponse(BaseModel):
    id: str
    object: str
    created: int
    model: str
    choices: List[Choice]

@app.post("/v1/chat/completions", response_model=ChatResponse)
async def chat_completions(request: ChatRequest):
    # Prepare prompt using chat template
    prompt = tokenizer.apply_chat_template(
        [msg.dict() for msg in request.messages],
        tokenize=False,
        add_generation_prompt=True
    )
    chat_input = tokenizer(prompt, return_tensors="pt").to(model.device)
    chat_outputs = model.generate(**chat_input, max_new_tokens=request.max_new_tokens)
    response = tokenizer.decode(
        chat_outputs[0][chat_input['input_ids'].shape[-1]:], 
        skip_special_tokens=True
    )
    # Return response in OpenAI-compatible format
    # return JSONResponse({
    #     "id": f"chatcmpl-{uuid.uuid4().hex[:12]}",
    #     "object": "chat.completion",
    #     "created": int(time.time()),
    #     "model": model_id,
    #     "choices": [
    #         {
    #             "index": 0,
    #             "message": {
    #                 "role": "assistant",
    #                 "content": response
    #             },
    #             "finish_reason": "stop"
    #         }
    #     ]
    # })
    return ChatResponse(
        id=f"chatcmpl-{uuid.uuid4().hex[:12]}",
        object="chat.completion",
        created=int(time.time()),
        model=model_id,
        choices=[
            Choice(
                index=0,
                message={"role": "assistant", "content": response},
                finish_reason="stop"
            )
        ]
    )

@app.get("/")
def root():
    """Root endpoint with API info"""
    return JSONResponse({
        "message": "OpenAI-Compatible API for Open WebUI",
        "version": "1.0.0",
        "endpoints": {
            "models": "/v1/models",
            "chat": "/v1/chat/completions",
            "health": "/health"
        }
    })

@app.get("/health")
def health_check():
    """Health check endpoint"""
    return JSONResponse({"status": "healthy", "timestamp": datetime.now().isoformat()})

@app.get("/v1/models")
def list_models():
    """List available models"""
    return JSONResponse({
        "data": [
            {
                "id": model_id,
                "object": "model",
                "created": datetime.now().isoformat(),
                "owned_by": "microsoft",
                "permission": []
            }
        ]
    })

When using it, I made it into Docker. During build, I was shocked by the size - almost 10 GB.

Tried using it for real and connecting it with Open WebUI - sometimes it gives okay answers, sometimes it hallucinates lol

But what's for sure is the CPU usage shoots up lol

That concludes my rough trial of running the Model, and if I find something better, I'll write another blog post. Feel free to reach out with suggestions. Oh, and don't force it on Windows - mine got squeezed by WSL2. Taking an old notebook and installing Linux to make a Local AI Inference Engine is still faster.

For all the code, I've uploaded it to Git: https://github.com/pingkunga/python_microsoft_bitnet-b1.58_sample

Reference


Discover more from naiwaen@DebuggingSoft

Subscribe to get the latest posts sent to your email.