Rerank documents for relevance scoring with TokModel

The /v1/rerank endpoint takes a query and a list of documents, then returns those documents sorted by relevance score. Reranking is typically used as a second-stage filter after an initial vector search retrieves a broad set of candidate documents. The reranker produces more precise relevance signals than embedding similarity alone, which improves the quality of context passed to a language model.

Authentication

Include your API key in every request:

Authorization: Bearer YOUR_API_KEY

Send a rerank request

Provide a query string, a documents array, and a model. TokModel returns the documents re-ordered from most to least relevant, each annotated with a relevance_score.

curl https://tokmodel.com/v1/rerank \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "cohere/rerank-english-v3.0",
    "query": "How do I reset my password?",
    "documents": [
      "You can reset your password from the account settings page.",
      "Our API supports OAuth 2.0 and API key authentication.",
      "To change your password, go to Settings > Security and click Reset Password.",
      "Billing questions can be directed to support@example.com."
    ]
  }'

Example response

The results array is sorted by relevance_score in descending order. The index field refers to the position of the document in the original input array.

{
  "id": "rerank-xyz789",
  "model": "cohere/rerank-english-v3.0",
  "results": [
    {
      "index": 2,
      "relevance_score": 0.9873,
      "document": {
        "text": "To change your password, go to Settings > Security and click Reset Password."
      }
    },
    {
      "index": 0,
      "relevance_score": 0.9541,
      "document": {
        "text": "You can reset your password from the account settings page."
      }
    },
    {
      "index": 1,
      "relevance_score": 0.1203,
      "document": {
        "text": "Our API supports OAuth 2.0 and API key authentication."
      }
    },
    {
      "index": 3,
      "relevance_score": 0.0412,
      "document": {
        "text": "Billing questions can be directed to support@example.com."
      }
    }
  ],
  "usage": {
    "total_tokens": 98
  }
}

Use the relevance_score to decide how many documents to forward to the language model. A common pattern is to keep only results above a threshold (e.g. 0.5) or to take the top-k regardless of score.

Key parameters

Parameter	Type	Description
`model`	string	The reranking model to use, e.g. `cohere/rerank-english-v3.0`.
`query`	string	The search query to rank documents against.
`documents`	array	List of strings (or objects with a `text` key) to rank.
`top_n`	integer	Return only the top N results. Defaults to all documents.
`return_documents`	boolean	Include document text in the response. Default `true`.

Use reranking in a RAG pipeline

A typical RAG pipeline retrieves more documents than it can fit in the context window, then reranks them to keep only the most relevant ones.

python

import requests
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://tokmodel.com/v1",
)

# Step 1: Retrieve candidate documents from your vector store
# (Replace with your actual retrieval logic)
candidate_docs = [
    "Password resets are handled via the Security settings panel.",
    "Our SLA guarantees 99.9% uptime for all paid plans.",
    "To reset your password, visit Settings > Security > Reset Password.",
    "You can export your data from the Account > Data Export page.",
    "Contact support if you have not received your password reset email.",
]
user_query = "How do I reset my password?"

# Step 2: Rerank the candidates
rerank_response = requests.post(
    "https://tokmodel.com/v1/rerank",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json",
    },
    json={
        "model": "cohere/rerank-english-v3.0",
        "query": user_query,
        "documents": candidate_docs,
        "top_n": 3,
    },
)

top_docs = [r["document"]["text"] for r in rerank_response.json()["results"]]
context = "\n\n".join(top_docs)

# Step 3: Generate an answer using only the top-ranked context
chat_response = client.chat.completions.create(
    model="openai/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": f"Answer the user's question using only the context below.\n\nContext:\n{context}",
        },
        {"role": "user", "content": user_query},
    ],
)

print(chat_response.choices[0].message.content)

Retrieve 20–50 candidates from your vector store and pass them to the reranker, then use only the top 3–5 results as context. This pattern consistently outperforms passing raw vector search results directly to the model.

​Authentication

​Send a rerank request

​Example response

​Key parameters

​Use reranking in a RAG pipeline

Authentication

Send a rerank request

Example response

Key parameters

Use reranking in a RAG pipeline