Albon Wu

I've been working on a token-aware text compression service called trunq. This obviously makes it suitable for LLM prompt compression, but the underlying logic actually has very little to do with LLMs and everything to do with classical NLP.

Here, we'll talk about the theoretical background of trunq: bits of math, linguistics, and algorithms. For general reference, see the repo/docs.

I think about this quote a lot:

Outside of the many mistakes and obfuscations I have introduced, nothing here is original. - Stephen DeBacker

It's true of most of an engineer's output, and certainly of trunq. These techniques are well-known and long-studied, but here we're only concerned with how to synthesize them in an interesting and relevant way.

Fundamentally - and this is the strategy trunq uses - you can prune a long text by performing three operations:

Break it into smaller units (segmentation)
Decide which units to keep (vectorization, MMR packing)
Keep them (stitching)

We break down each operation below.

Segmentation

Sentence boundary disambiguation

The first question to ask is what size segment to pick - we'll start with sentences. Certainly, recalling the application, there are many situations where you might care about sentence-level segments:

~~You are a FAANG recruiter.~~ Can you review my resume for SWE jobs?
Summarize this article in two paragraphs. ~~Make sure it's clear and easy to understand.~~

Segmenting text at this granularity is called sentence boundary disambiguation (SBD), which is a pretentious term for chasing edge cases. Treating periods as sentence delimiters is a good start, but how about false positives in code blocks or URLs or ellipses or decimals or IP addresses or version numbers or...?

My point is you could spend all day coming up with such cases, and on hour 24 you'd still be finding new ones. But the strategy is simple: whatever your ruleset is, precompute those "no-split" spans. Do an initial pass through the text and find all (start, end) pairs, so you never define a boundary within them on later passes.

Then do a standard left-right two-pointer pass. When the second pointer hits an "ender" candidate character, check that segmenting there is safe - it doesn't fall within a marked span, it's not an abbreviation, etc. If so, emit a Segment and reset the starting pointer. Note also that we should consolidate terminal punctuation because prompts can contain "!!" and "?!."

That's SBD. The efficacy is a function of your ruleset completeness. See here for mine.

A more interesting question

is how to increase granularity. It doesn't really make sense to keep an entire sentence if only part of it is useful. Word-level segmentation is probably a non-starter if we want the compressed prompt to be coherent English (function words like the would be pruned first), so we need something between words and sentences in length.

Clauses - informally, phrases containing a subject and verb - are a natural choice.

I liked that sushi place, though it gave me food poisoning.

Both halves are clauses, but the bolded one is dependent (i.e., not a sentence by itself). Not all LLM prompts are as friendly as this one, though. We won't be able to isolate the clauses by looking for a comma delimiter (certainly not in the prompts I write).

We need to expand our search to conjunctions, which connect phrases. Though is the conjunction above. Other examples include and, although, whereas, unless, etc. Again, the efficacy is a function of this list's completeness. So the heuristic is split on a comma, dash, semicolon, etc., plus right before a conjunction.

You may notice that this strategy includes false positives. Consider this sentence:

Think about the pros and cons.

The segments are "Think about the pros" and "and cons." Except "pros and cons" should be a package deal, so what happened?

And is a coordinating conjunction; that is, it joins elements of the same grammatical type. Here, "pros" and "cons" are both nouns, so "pros and cons" is grammatical. Similarly, "I studied hard, and I passed the test" is valid. Other conjunctions like though are called subordinating conjunctions because they connect dependent clauses (not a standalone sentence) with independent ones.

There's probably a million rabbit holes we can go down involving semantic analysis, but we'll use a lightweight heuristic. Before splitting on a coordinating conjunction, check that the preceding and following segments are at least some arbitrary length. This is based on the assumption that clauses are longer than most words and phrases because there's more stuff in them.

Now we have chunks of our prompt, some of which we want to keep. Let's identify them.

Vectorization

The goal of vectorization is to convert language into some information-rich numerical format. We could use embedding models for this, but let's look for a workable alternative. Embeddings are expensive to compute and have low explainability.

In my sophomore year at Michigan, I took a web systems course where we built a Google clone. We took a bunch of pages and combined their PageRank scores with a metric called tf-idf computed using the page and an input query. tf-idf expresses the relevance of certain words to a document, making it relevant to search engine ranking, but we can use it here to roughly represent which segments have similar content.

We start by tokenizing the segments: replace all non-alphanumeric characters with whitespace, then split per word. Those are the tokens (words)!¹

Also, since the segments are treated as documents, we use the terms interchangeably hereafter. After tokenizing all $N$ segments, define $\mathrm{df}(w)$ as the number containing token $w$ . Then define the inverse document frequency (idf) score of $w$ as:

\textrm{idf}(w) = \log\left(\frac{N + 1}{\mathrm{df}(w) + 1}\right) + 1.

At the core of this quantity is the ratio of total segments to segments containing $w$ . Take a second to appreciate how this is definitionally the inverse document frequency. We add 1 to the numerator and denominator to prevent $\log(0)$ and division by zero, respectively. Intuitively, idf captures how rare each token is across all segments.

Compute the term frequency (tf) as the count of a token $w$ in a segment $d$ , denoted $\mathrm{tf}(w, d)$ .

Now recall that our goal is to vectorize the segments. Build the vector for segment $d$ as follows:

v_d[w] = \mathrm{idf}(w) \cdot \mathrm{tf}(w, d).

This quantity is large when $w$ appears often in $d$ and rarely anywhere else. As with embeddings, we can and will compare segments based on the distance of their vectors in space. We'll use cosine similarity, so let's also do the standard practice of normalizing all vectors:

v_d = \frac{v_d}{\sqrt{\sum_w v_d[w]^2}}

so that the cosine of the angle between two vectors reduces to their dot product.

Great! Shall we do something with these vectors?

MMR packing

Maximal marginal relevance (MMR) was introduced by Carbonell and Goldstein in 1988 to optimize search engine results. Their idea was to score a document based not just on its relevance to a user's query but also its diversity with respect to already-retrieved documents. This sentence from the paper is particularly salient:

Summarizes need to avoid redundancy, as it defeats the purpose of summarization.

In this step, trunq uses MMR to identify the segments that maximize relevance and diversity.

Now you might ask: relevance to what? This is a fair and subtle question. Prompt "compression" implies some loss of information, but we need to know what's considered valuable information to minimize loss of it.

In trunq's case, relevance can be specified by the user or defined with respect to the prompt itself. From my experimentation, sending the previous prompt as the reference query works well. That string is vectorized for cosine similarity comparison with segments.

But when no query is specified, we synthesize the query vector from the document itself. The simplest way is to take the centroid of all segment vectors: their normalized average. Intuitively, this is the "average segment" in the prompt - an easy summarization heuristic.

Each segment's cosine similarity to this centroid gives a relevance score. We simply pick the segments that maximize it.

Dually, as Carbonell advises, we must also minimize redundancy between segments. We optimize with respect to both criteria using the MMR score. Denote the query vector by $q$ , the set of segments by $R$ , and the set of selected ones by $S$ .

Then for $v\in R$ , we can express the MMR score of $v$ as follows:

\mathrm{score}_{MMR}(v) = \lambda v\cdot q + (1 - \lambda)\max_{d\in S}(v\cdot d)

The second term computes the maximum similarity between a candidate segment and the already-chosen ones. By penalizing similarity to existing choices, we discourage overlap between segments. The weighting parameter $\lambda$ determines whether to prioritize diversity or relevance.

We iterate through the segments, greedily selecting the ones with highest MMR score.²

One valuable thing to allow is initialization of $S$ . Consider that some groups of segments should be atomic: either all or none are included. Like code blocks spanning multiple segments, for instance. We can't haphazardly erase bits of code. Or perhaps we want to allow the user to specify which keywords they want to keep in the compressed prompt.

As long as the "must-keep" segments are within the token budget, we add them to $S$ . Future selections are then made around this scaffolding.

Stitching

This is the simplest stage of the base trunq pipeline. All segments are ordered and concatenated, separated by spaces; some, like headings, have newlines.

Finally, adding delimiters risks affecting tokenization, so we run a final pass to make sure the compressed prompt fits within the hard token budget.

That's a quick introduction to trunq. One thing I like about this project is the number of rabbit holes it exposes. We haven't talked about advanced stitching techniques, nor how to segment code blocks, nor alternative vectorization methods, etc.

The tradeoffs were also very fun to think about. I made several decisions favoring speed and resource efficiency using heuristics, but who knows? Perhaps a lightweight SLM or embedding model would make it more robust for production.

Moving forward, I'll do further work in some areas mentioned in this post (see footnotes too), but feel free to email me or open GitHub issues with suggestions.

This is one aspect of trunq that could benefit from further experimentation - is its simplicity a virtue? N-grams are promising: technical prompts often use multi-word key terms (API key, multi-shot, etc.). Conversely, I don't think stemming and lemmatization would do much: consider that most LLM prompts are short and lacking morphological variation. Either way, empirically, the lexical tokenization works. ↩
A natural question is whether we can apply knapsack here, treating the uniformity penalty as the weight and relevance as the value. My initial thought is no: the penalty is path-dependent in the sense that it depends on which segments have been chosen already. The DP memo is then parameterized by the subset of currently chosen segments, which is exponential. But please email me if you have any ideas to make this work. ↩

Trunq: a prompt compression engine