OpenAI GPT-4 Turbo's 128k token context has a 4k completion limit

By Christian Prokopp on 2023-11-23

Recently, OpenAI released GPT-4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG) applications, which I described in detail for Llamar.ai. That amounts to nearly 200 pages of text, assuming approximately 500 words per page and 0.75 words per token and 2¹⁷ tokens.

OpenAI GPT-4 Turbo's 128k token context has a 4k completion limit

While writing some code using the gpt-4-1106-preview model via the API, I noticed that long responses never exceed 4,096 tokens for completion. Responses cut off mid-sentence or word even when the total is less than 128k tokens, i.e. input plus completion. A quick search in the OpenAI forum revealed that others observe this behaviour, and the model does not provide more than 4,096 completion tokens.

The larger context window greatly improves maintaining context in lengthy conversations. RAG applications can benefit from more detailed in-context learning and a higher chance of having relevant text in-context. However, this is a serious limitation for other applications that require extensive outputs, for example in data generation or conversion. I wish OpenAI were more proactive in listing the limitations in their DevDay announcement or the API description. OpenAI does mention it in its documentation in the model description.

Lastly, it is a reminder to never assume, always check and use logs and metrics whenever possible. Biases and issues can creep in from unexpected vectors.


Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at christian@bolddata.biz for inquiries.