Prompt Compression — Hands on Guide for LLM Lingua-2

Devesh Surve
5 min readAug 5, 2024

--

So a really interesting paper I read about was in the domain of LLMs, specifically about Prompt compression ( https://arxiv.org/abs/2403.1296 ) and I was quite fancinated by it, so I decided to write a simple explanation and tutorial on it since I didn’t find any good ones online.

And Ofcourse, like any normal person your question would be :

What is Prompt Compression ?

So in simple words, As LLMs grow more powerful and complex, the length and intricacy of prompts used to query them have also increased. Now this has led to significant computational demands and higher costs.

Prompt compression is the process of reducing the length of input prompts to LLMs while retaining essential information.

Why Does Prompt Compression Matter at Scale?

Now This technique is crucial because:

  • Efficiency: Shorter prompts lead to faster inference times, reducing the time it takes for an LLM to generate a response.
  • Cost Savings: Many LLM services, such as those provided by OpenAI, charge users based on the number of tokens processed. Compressing prompts can significantly lower these costs.
  • Improved Performance: In some cases, long prompts can overwhelm LLMs, causing them to lose contextual accuracy. Compressed prompts can help maintain or even enhance the model’s performance by focusing on the most relevant information.

Okay, So What is this LLM Lingua?

Now you’re asking the right questions. So LLM Lingua is a framework developed by researchers at Tsinghua University and Microsoft aimed at improving the efficiency of prompt compression. LLMLingua utilizes a compact, well-trained language model (e.g., GPT2-small, LLaMA-7B) to identify and remove non-essential tokens in prompts. This approach enables efficient inference with large language models (LLMs), and they promise up to 20x compression with minimal performance loss. The latest version, LLMLingua-2, introduces a task-agnostic approach that is faster and more efficient than previous methods.

Key Features of LLMLingua-2:

  • Task-Agnostic Compression: Unlike task-aware methods that tailor compression to specific tasks, LLMLingua-2 can be applied broadly across various applications.
  • Bidirectional Transformer Encoder: Uses a lightweight bidirectional transformer model to capture essential information from both directions of the text.
  • Lower Latency: Designed to be computationally efficient, resulting in lower latency during inference.

So I decided to write a small hands-on guide about it after reffering their documentation.

I have also created a colab notebook on the same, Here’s the link to that : https://colab.research.google.com/drive/1jwtJ0p_qrRNY9lOy48lybYEcx6TkJD31?usp=sharing

Here are some of the key code snippets :

How to Use LLMLingua-2: A Demo

To demonstrate the use of LLMLingua-2, let’s go through a step-by-step guide on setting it up and applying it to both in-domain and out-of-domain datasets.

Step 1: Installation and Setup

First, ensure you have the necessary libraries installed. You can install the LLMLingua package and OpenAI’s API as follows:

!pip install llmlingua openai datasets

Step 2: Initializing the Prompt Compressor

There are 2 variations of this available on Hugging Face as well as a hosted space for you to use. Do check those out:

LLMLingua models on Hugging Face

from llmlingua import PromptCompressor
llm_lingua = PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
)

Step 3: Loading a Dataset

Next, we load a test dataset using the Hugging Face datasets library.

from datasets import load_dataset

dataset = load_dataset("huuuyeah/meetingbank", split="test")
context = dataset[0]["transcript"]
question = "What is the agenda item three resolution 31669 about?\nAnswer:"
prompt = "\n\n".join([context, question])
len(prompt)

As you see, the entire prompt is about 7000 characters.

Step 4: Initializing the OpenAI API Client

from openai import OpenAI

client = OpenAI(api_key="<Replace with your key>")
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model="gpt-3.5-turbo",
max_tokens=100,
temperature=0,
top_p=1,
n=1
)
print(chat_completion.choices[0].message.content)

Step 5: Compressing the Prompt Using LLMLingua

  • compressed_prompt = llm_lingua.compress_prompt(…)
  • Purpose: This line compresses the text in the context variable using the llm_lingua library’s compress_prompt method.
  • rate=0.33
  • Purpose: Specifies the compression rate. A rate of 0.33 means the function will attempt to reduce the text to 33% of its original size, effectively compressing it by 67%.
  • force_tokens=[“!”, “.”, “?”, “\n”]
  • Purpose: Ensures that certain tokens or characters remain in the compressed text.
  • drop_consecutive=True
  • Purpose: Specifies that consecutive duplicate tokens should be dropped.
compressed_prompt = llm_lingua.compress_prompt(
context,
rate=0.33,
force_tokens=["!", ".", "?", "\n"],
drop_consecutive=True,
)
shorter_prompt = "\n\n".join([compressed_prompt["compressed_prompt"], question])
print(len(shorter_prompt))

And you see that suddenly our prompt has been reduced by 67%!

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": shorter_prompt,
}
],
model="gpt-3.5-turbo",
max_tokens=100,
temperature=0,
top_p=1,
n=1
)
print(chat_completion.choices[0].message.content)

And yet, the answer remains the same!

Step 6: Loading Another Sample from the LongBench Dataset ( Out-of-Domain Example )

We start by defining prompt templates for different datasets and tasks.

dataset2prompt = {
"narrativeqa": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question as concisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nStory: {context}\n\nNow, answer the question based on the story as concisely as you can, using a single phrase if possible. Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:",
}
dataset2maxlen = {
"narrativeqa": 128,
}
task = "narrativeqa"
dataset = load_dataset("THUDM/LongBench", task, split="test")
sample = dataset[3]
context = sample["context"]
prompt_format = dataset2prompt[task]
max_gen = int(dataset2maxlen[task])
new_prompt = prompt_format.format(**sample)

Again, we see that our prompt is quite huge character length-wise.

chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": new_prompt,
}
],
model="gpt-3.5-turbo",
max_tokens=100,
temperature=0,
top_p=1,
n=1
)
print(chat_completion.choices[0].message.content)

And the answer to our prompt is “To persuade Socrates to escape.”

Step 7: Compressing with Respect to Target Tokens

Here, we see that instead of a compression rate we use target tokens.

compressed_prompt = llm_lingua.compress_prompt(
context,
target_token=3000,
force_tokens=["!", ".", "?", "\n"],
drop_consecutive=True,
)

Again, we see that our prompt has become a third of what it was!

sample["context"] = compressed_prompt["compressed_prompt"]
new_smaller_prompt = prompt_format.format(**sample)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": new_smaller_prompt,
}
],
model="gpt-3.5-turbo",
max_tokens=100,
temperature=0,
top_p=1,
n=1
)
print(chat_completion.choices[0].message.content)

And yet, our answer remains correct!

Benefits of Using LLMLingua-2

  • Scalability: LLMLingua-2’s ability to compress prompts effectively allows for scalable LLM applications.
  • Cost Efficiency: By reducing the number of tokens, LLMLingua-2 helps in lowering the operational costs associated with using LLM services.
  • Performance Enhancement: Compressed prompts can help maintain or improve the accuracy and relevance of LLM responses, especially for models that struggle with long contexts.

Limitations and Considerations

While LLMLingua-2 is highly effective for general prompt compression, task-specific methods might still outperform it in specialized scenarios. For instance, task-aware compression methods like LongLLMlingua leverage additional context-specific information, which can result in better performance for certain tasks.

Conclusion

Prompt compression is a critical innovation in optimizing the use of large language models. LLMLingua-2 stands out as a powerful tool that provides significant advantages in terms of speed, efficiency, and cost savings. By incorporating this method into your LLM workflows, you can achieve better performance and scalability, making it an invaluable asset in the era of advanced AI applications.

For more detailed information and to access the source code for LLMLingua-2, visit the GitHub repository and the arXiv paper.

If you read all the way till, I’d really appreciate if you would give me a clap if you liked the article. Follow for more such content !

--

--

Devesh Surve

Grad student by day, lifelong ML/AI explorer by night. I dive deep, then share easy-to-understand, step-by-step guides to demystify the complex.