How to Optimize Token Usage with Haystack (Step by Step)

📖 5 min read•921 words•Updated Apr 4, 2026

How to Optimize Token Usage with Haystack

We’re building a powerful application using Haystack to optimize token usage. This matters because understanding token management can significantly reduce costs and improve the efficiency of your machine learning models.

Prerequisites

Python 3.11+
Pip install haystack[all] >= 2.10.0
Pip install transformers >= 4.24.0
Basic understanding of Python and REST APIs

Step 1: Setting Up Haystack

First off, you’ll need to install Haystack along with some additional libraries for your application. You can run the following command:

pip install haystack[all] transformers==4.24.0

Now, here’s the deal: Haystack has become incredibly popular with over 24,696 stars and 2,696 forks on GitHub. That’s a lot of developers trusting this framework — and you should too, especially when you consider its recent updates. If you hit an error regarding version compatibility, check your Python version and ensure you have the right libraries installed.

Step 2: Import Necessary Libraries

Now that you’ve got Haystack set up, let’s import the necessary libraries into our script. This is simpler than picking the right movie to watch for date night:

from haystack.document_stores import InMemoryDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import DensePassageRetriever, FARMReader

These imports are crucial. You need the Document Store for managing your documents and the Pipeline for executing your tasks. If you encounter any import errors, ensure that they’re spelled correctly or consult the documentation.

Step 3: Set Up a Document Store

Next, we’ll set up a Document Store to hold our data. Here’s a simple example:

document_store = InMemoryDocumentStore()
documents = [{"content": "Haystack is an open-source framework.", "meta": {"source": "haystack-docs"}}]
document_store.write_documents(documents)

This sets up your document store in memory. It’s fast, but if you’re dealing with large data sets, consider persistent options. Running into memory errors? Your dataset may be too large for the InMemoryDocumentStore, so switch to a more suitable backend database.

Step 4: Initialize the Retriever and Reader

Now you need to initialize a retriever and a reader. This is where the magic happens:

retriever = DensePassageRetriever(document_store=document_store, embedding_model="facebook/dpr-question_encoder-single-nq-base")
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

Make sure your models are downloaded. You can monitor the token usage by keeping track of the model size and the number of documents retrieved. If the model loading fails, verify if you have the model correctly set up in your environment.

Step 5: Build the Pipeline

We’ll now construct a pipeline that links these components together:

pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

This is essential because it defines how the components interact. You may face issues if one of your nodes isn’t correctly configured. Ensure that you’ve completed the previous steps without skipping any, or you’ll be troubleshooting like you did when accidentally initializing a thread pool without a shutdown function — and boy, did I learn that the hard way.

Step 6: Query the Pipeline

Now let’s query our pipeline. This part is where you’ll see the real token usage:

query = "What is Haystack?"

result = pipeline.run(query=query, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 1}})

The parameters control the number of documents to retrieve and the number of answers to yield. If you get unexpected results, tweaking these parameters will help tune your output and optimize token usage for better precision.

The Gotchas

Here are a few pitfalls to watch for:

Memory issues: Overloading the Document Store can lead to memory errors. Test with smaller datasets first.
Model size: Large models may not fit in your environment. Always check compatibility with your hardware.
Parameter tuning: Incorrect parameters can lead to excessive token usage or insufficient answers. Always validate your settings based on the model’s performance.

Full Code

Here’s a complete working example for reference:

from haystack.document_stores import InMemoryDocumentStore
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import DensePassageRetriever, FARMReader

# Set up the Document Store
document_store = InMemoryDocumentStore()
documents = [{"content": "Haystack is an open-source framework.", "meta": {"source": "haystack-docs"}}]
document_store.write_documents(documents)

# Initialize the retriever and reader
retriever = DensePassageRetriever(document_store=document_store, embedding_model="facebook/dpr-question_encoder-single-nq-base")
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")

# Build the pipeline
pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

# Query the pipeline
query = "What is Haystack?"
result = pipeline.run(query=query, params={"Retriever": {"top_k": 5}, "Reader": {"top_k": 1}})

print(result)

What’s Next

After optimizing token usage with Haystack, consider exploring how to integrate this setup with an actual frontend application. This could include serving it through a Flask or FastAPI endpoint for web-based applications. Just don’t forget to handle the request/response lifecycle properly.

FAQ

1. How do I monitor my token usage?

You can monitor your token usage by logging API requests and keeping track of the number of tokens processed during each call. External libraries like logging or monitoring solutions can help scaffold this out.

2. What if my model doesn’t return expected results?

If the results aren’t as expected, review the input prompts and ensure they’re clear. Sometimes, restructuring your query or adjusting the top_k parameters can make a huge difference.

3. Are there limits on the free usage tier?

Yes, depending on the API or platform you are using alongside Haystack, check for their token limits. They typically vary between free and paid tiers.

Data Sources

For more details, check out the following links:

Last updated April 04, 2026. Data sourced from official docs and community benchmarks.

🕒 Published: April 4, 2026

✍️

Written by Jake Chen

AI technology writer and researcher.

Learn more →