Quickstart: Self Host and Run DeepSeek R1
By Alex OwenDeepSeek R1 Has Changed the Game:
DeepSeek R1 has emerged as an incredibly important model as the first Open Source model to scale Inference Time Compute. R1 and its distilled siblings have introduced two new capabilities to the AI Community. They enable recording of Chain-of-Thought tokens which are otherwise hidden by foundational model APIs. And theyāre the first Open Source model usable to fine-tune or distill for reasoning use cases.
The model's architectural improvements deliver better throughput, which translates to decreased costs per token, and prices for inference are lower than a lot of foundational model APIs.
This will kick off a large amount of improvement on difficult tasks by scaling inference time compute.
Why Self-Hosting is More Important Than Ever
The demand for Self-Hosting is now higher than ever to use full inference.
Hereās why:
- There are concerns over privacy and data security for the popular overseas hosting platforms.
- R1 Models now generate reasoning traces and very long outputs, on top of being very large (671 Billion Parameters for the Full R1 Model), which stresses GPU RAM requirements on even the largest nodes.
- For Inference services, running multiple requests for different customers on the same node can be difficult and cause high variance in performance. Self hosting offers more inference stability for applications where performance and latency are important.
Shadeform provides instant access to affordable, secure, and enterprise ready compute by aggregating supply from 20+ clouds into a centralized marketplace, making it easier than ever for customers to self host at the best prices.
Templates Make Deployment Even Easier
Setting up the correct configurations to deploy these models can be quite tricky, so we have created ātemplatesā: a way to re-use deployment configs across the GPU clouds on our platform.
We have templates for the R1 models already made, so deploying DeepSeek R1 is just a click or two away.
How to Run DeepSeek-R1:
Weāve gone ahead and created a template that is ready for a 1-Click deployment on an 8xH200 node. With this template, we use vLLM to serve the model with the following configuration:
- Weāre serving the full
deepseek-ai/DeepSeek-R1
model - Weāre deploying this on an 8xH200 Node for the highest memory capacity, and splitting our model across the 8 GPUās with
--tensor-parallel-size 8
- Weāre enabling vLLM to
--trust-remote-code
to run the custom code the model needs for setting up the weights/architecture.
To deploy this template, simply click āDeploy Templateā, select the lowest priced 8xH200 node available, and click āDeployā.
Once weāve deployed, weāre ready to point our SDKās at our inference endpoint!
How to interact with R1 Models:
There are now two different types of tokens output for a single inference call: āthinkingā tokens, and normal output tokens. For your use case, you might want to split them up.
Splitting these tokens up allows you to easily access and record the āthinkingā tokens that, until now, have been hidden by foundational reasoning models. This is particularly useful for anyone looking to fine tune R1, while still preserving the reasoning capabilities of the model.
The below code snippets show how to do this with AI-sdk, OpenAIās Javascript and python SDKs.
AI-SDK:
import { createOpenAI } from '@ai-sdk/openai';
import { generateText, wrapLanguageModel, extractReasoningMiddleware } from 'ai';
// Create OpenAI provider instance with custom settings
const openai = createOpenAI({
baseURL: "http://your-ip-address:8000/v1",
apiKey: "not-needed",
compatibility: 'compatible'
});
// Create base model
const baseModel = openai.chat('deepseek-ai/DeepSeek-R1');
// Wrap model with reasoning middleware
const model = wrapLanguageModel({
model: baseModel,
middleware: [extractReasoningMiddleware({ tagName: 'think' })]
});
async function main() {
try {
const { reasoning, text } = await generateText({
model,
prompt: "Explain quantum mechanics to a 7 year old"
});
console.log("\n\nTHINKING\n\n");
console.log(reasoning?.trim() || '');
console.log("\n\nRESPONSE\n\n");
console.log(text.trim());
} catch (error) {
console.error("Error:", error);
}
}
main();
OpenAI JS SDK:
import OpenAI from 'openai';
import { fileURLToPath } from 'url';
function extractFinalResponse(text) {
// Extract the final response after the thinking section
if (text.includes("</think>")) {
const [thinkingText, responseText] = text.split("</think>");
return {
thinking: thinkingText.replace("<think>", ""),
response: responseText
};
}
return {
thinking: null,
response: text
};
}
async function callLocalModel(prompt) {
// Create client pointing to local vLLM server
const client = new OpenAI({
baseURL: "http://your-ip-address:8000/v1", // Local vLLM server
apiKey: "not-needed" // API key is not needed for local server
});
try {
// Call the model
const response = await client.chat.completions.create({
model: "deepseek-ai/DeepSeek-R1",
messages: [
{ role: "user", content: prompt }
],
temperature: 0.7, // Optional: adjust temperature
max_tokens: 8000 // Optional: adjust response length
});
// Extract just the final response after thinking
const fullResponse = response.choices[0].message.content;
return extractFinalResponse(fullResponse);
} catch (error) {
console.error("Error calling local model:", error);
throw error;
}
}
// Example usage
async function main() {
try {
const { thinking, response } = await callLocalModel("how would you explain quantum computing to a six year old?");
console.log("\n\nTHINKING\n\n");
console.log(thinking);
console.log("\n\nRESPONSE\n\n");
console.log(response);
} catch (error) {
console.error("Error in main:", error);
}
}
// Replace the CommonJS module check with ES module version
const isMainModule = process.argv[1] === fileURLToPath(import.meta.url);
if (isMainModule) {
main();
}
export { callLocalModel, extractFinalResponse };
Langchain:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from typing import Optional, Tuple
from langchain.schema import BaseOutputParser
class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
"""Parser for DeepSeek R1 model output that includes thinking and response sections."""
def parse(self, text: str) -> Tuple[Optional[str], str]:
"""Parse the model output into thinking and response sections.
Args:
text: Raw text output from the model
Returns:
Tuple containing (thinking_text, response_text)
- thinking_text will be None if no thinking section is found
"""
if "</think>" in text:
# Split on </think> tag
parts = text.split("</think>")
# Extract thinking text (remove <think> tag)
thinking_text = parts[0].replace("<think>", "").strip()
# Get response text
response_text = parts[1].strip()
return thinking_text, response_text
# If no thinking tags found, return None for thinking and full text as response
return None, text.strip()
@property
def _type(self) -> str:
"""Return type key for serialization."""
return "r1_output_parser"
def main(prompt_text):
# Initialize the model
model = ChatOpenAI(
base_url="http://your-ip-address:8000/v1",
api_key="not-needed",
model_name="deepseek-ai/DeepSeek-R1",
max_tokens=8000
)
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
("user", "{input}")
])
# Create parser
parser = R1OutputParser()
# Create chain
chain = (
{"input": RunnablePassthrough()}
| prompt
| model
| parser
)
# Example usage
thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)
if __name__ == "__main__":
main("How do you write a symphony?")
OpenAI Python SDK:
from openai import OpenAI
def extract_final_response(text: str) -> str:
"""Extract the final response after the thinking section"""
if "</think>" in text:
all_text = text.split("</think>")
thinking_text = all_text[0].replace("<think>","")
response_text = all_text[1]
return thinking_text, response_text
return None, text
def call_deepseek(prompt: str) -> str:
# Create client pointing to local vLLM server
client = OpenAI(
base_url="http://your-ip-:8000/v1", # Local vLLM server
api_key="not-needed" # API key is not needed for local server
)
# Call the model
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1",
messages=[
{"role": "user", "content": prompt}
],
temperature=0.7, # Optional: adjust temperature
max_tokens=8000 # Optional: adjust response length
)
# Extract just the final response after thinking
full_response = response.choices[0].message.content
return extract_final_response(full_response)
# Example usage
thinking, response = call_deepseek("what is the meaning of life?")
print("\n\nTHINKING\n\n")
print(thinking)
print("\n\nRESPONSE\n\n")
print(response)
Other DeepSeek Models:
With the R1 Release, DeepSeek released the Full R1 model, plus a series of Distilled models. These models are based off of a series of popular OpenSource models of varying sizes, and trained by distillation from the full R1 Model itself.
We have already created templates for the Llama-3.1.8B and Qwen-2.5-32B models so you can deploy them on Shadeform with a few clicks. They offer great tradeoffs for accuracy, performance and memory requirements. Below we have the different R1 models, and the suggested GPU configurations that account for increased memory usage of thinking tokens.
Model | Recommended GPU Config | ātensor-parallel-size | Notes |
---|---|---|---|
DeepSeek-R1-Distill-Qwen-1.5B | 1x L40S, A6000, or A4000 | 1 | This model is very small, depending on your latency/throughput and output length needs, you should be able to get good performance on less powerful cards. |
DeepSeek-R1-Distill-Qwen-7B | 1x L40S | 1 | Similar in performance to the 8B version, with more memory saved for outputs. |
DeepSeek-R1-Distill-Llama-8B | 1x L40S | 1 | Great performance for this size of model. Deployable via this template. |
DeepSeek-R1-Distill-Qwen-14 | 1xA100/H100 (80GB) | 1 | A great in-between for the 8B and the 32B models. |
DeepSeek-R1-Distill-Qwen-32B | 2x A100/H100 (80GB) | 2 | This is a great model to use if you donāt want to host the full R1 model. Deployable via this template. |
DeepSeek-R1-Distill-Llama-70 | 4x A100/H100 | 4 | Based on the Llama-70B model and architecture. |
deepseek-ai/DeepSeek-V3 | 8xA100/H100, or 8xH200 | 8 | Base model for DeepSeek-R1, doesnāt utilize Chain of Thought, so memory requirements are lower. |
DeepSeek-R1 | 8xH200 | 8 | The Full R1 Model. |
Recap
DeepSeek-R1 represents a shift in how we deploy Language models:
- We now need more memory for our deployments for much longer outputs
- We can scale reasoning tokens and inference time to increase accuracy on the most complicated tasks
- We can keep our reasoning traces for analysis and future fine-tuning
- We can finetune these base models for our use cases
- We have easier Reinforcement Learning via GRPO to accelerate time to market
Shadeformās enterprise compute platform make it dead simple to access and deploy these models across multiple clouds - ensuring that you get the best experience and price across all providers. Weāre excited about what this all means for use cases throughout AI. Stay tuned as we dive deeper into DeepSeek and fine-tuning!