Note: Get nerd sniped for the love of god! Also, if you like this kind of work, and your organization does something similar, consider hiring me? I’m looking aggresively looking for a job. Would love to chat

Cadence - Evolving Code with LLMs for NP-Hard Problems

Introduction

There are very few paper that have moved me as much as AlphaEvolve. The paper improved on the RL pilled frontier of the current AI research landscape, and proved - by discovering novel algorithms, just how much we can still squeeze performance out of LLMs by improving solutions to hard problems and evolving them to discover new solutions. Here’s what @karpathy has to say about it, and is more or less what AlphaEvolve does:

Scaling up RL is all the rage right now. I’m fairly certain RL will continue to yield more intermediate gains, but I also don’t expect it to be the full story. … There’s significantly more bits of supervision we extract per rollout via a review/reflect stage of human mechanism of improvement along the lines of “what went well? what didn’t go so well? what should I try next time?” etc. and the lessons from this stage feel explicit, like a new string to be added to the system prompt for the future, optionally to be distilled into weights (/intuition) later a bit like sleep. … Example algorithm: given a task, do a few rollouts, stuff them all into one context window (along with the reward in each case), use a meta-prompt to review/reflect on what went well or not to obtain string “lesson”, to be added to system prompt (or more generally modify the current lessons database). Many blanks to fill in, many tweaks possible, not obvious.

When I was going through the paper, something struck to me. The paradigms introduced in AlphaEvolve can be used make progress for optimization problems, where there is no exact “algorithmic breakthrough”, but rather heuristic or cleverness breakthrough which, in my humble opinion is something that can’t be learned through pattern recognition prevalent in LLMs.

NP Hard problems by definition have no polynomial time solutions. These are practically unsolvable for larger instances, and often require clever approaches than brute force solutions. Since exact algorithms are computationally prohibitive, the two main strategies used to solve NP Hard problems are approximation and heuristics. These are not technically easier for LLMs(or any particular system for solving such problems) to excel on these two fronts, as they require extensive information about problem domain and often require manual tuning. So, with all these questions and opinions in my mind, I set out on a journey to prove how much truth is in my claim

Cadence is an attempt to understand evolutionary problem solving by studying how solutions evolve over generations of improvement, and whether or not it could lead to newer, improved solutions for NP Hard problems. I’m going to follow the similar paradigms as that of AlphaEvolve. The solutions generated by LLMs directly feeds into the idea of Asymmetric Verifiability, as for TSP it is much easier to verify than to solve. Asymmetric verification and programming problems go hand in hand, as it’s tedious to read code and check its correctness, but if you have test cases with ample coverage, you can quickly check any given solution whether or not they are right or not.

For the purpose of this blog, I chose the Travelling Salesman Problem, and analyzed how the solutions evolves. TSP is a classical optimization problem, with approximate solutions traditionally requiring exponential time to solve(but less time to verify!). According to Jason Wei, all tasks that follow the properties defined below, will be solved by AI(Verifier’s Law):

Objective truth: everyone agrees what good solutions are.
Fast to verify: any given solution can be verified in a few seconds.
Scalable to verify: many solutions can be verified simultaneously.
Low noise: verification is as tightly correlated to the solution quality as possible.
Continuous reward: it’s easy to rank the goodness of many solutions for a single problem.

As far as I my hypothesis go, the stochastic nature of LLMs and their pattern finding ability, this is a really hard problem for them to make progress in, and they can possibly fail in designing the approximation and heuristics strategies. While LLMs have demonstrated ability to discover new mathematical constructions for Cap Set problem, these have been a result of huge experiments, often requiring resources far beyond the budget of an individual. So, my second hypothesis that I want to prove is whether I scaling improves the quality of solutions for TSP, and that we can extrapolate the results of huge experiments from a sequence of far smaller, constrained experiments - something that was explored in Scaling Scaling Laws with Board Games. Therefore, the aim of this blog is to prove/disprove the following hypothesis:

Hypothesis 1: How good are LLMs in designing approximation and heuristics designing for the Travelling Salesman Problem?

Hypothesis 2: Can we draw conclusions about scaling TSP problem from this approach with a series of constrained experiments?

Before moving forward with what I found, let us look into the architectural design of Cadence.

Architectural Overview - Cadence

Cadence implements an evolutionary system that uses large language models (LLMs) to iteratively generate, mutate, and improve programs for solving computational problems. The current implementation focuses on optimizing solutions to the Traveling Salesman Problem (TSP). At its core, Cadence operates on a continuous feedback loop, where programs are treated as evolving entities. The system maintains a population of programs, each representing a potential solution to the target computational problem. The evolutionary process is driven by the performance of these programs on a fixed test suite, with better-performing programs being favored for reproduction and further refinement. The unique aspect of Cadence lies in its use of LLMs as the primary mechanism for generating new program variations and mutations, guided by explicit instructions and the context of previous program generations.

Untitled diagram _ Mermaid Chart-2025-07-19-152929

The system evolves programs over generations using the following loop:

Sample a parent program and its previously generated children.
Construct a prompt that includes the parent, children, and instructions.
Use an LLM to generate modified versions of marked code blocks.
Apply the generated diffs to produce a child program.
Evaluate the child program’s performance on a fixed test suite.
Log and store the program and its performance in a database.
Periodically promote the best-performing program to guide future generations.
Optionally mutate the instructions used in prompts to encourage better code.

Here’s a more detailed overview of the steps involved in it:

Parent and Children Sampling: A parent program is selected from the current population, along with its previously generated children. This provides the LLM with a historical context of the program’s lineage and evolutionary trajectory. The parents and the child are stored in a SQLite database, and are sampled from there.
Prompt Construction: A comprehensive prompt is dynamically constructed. This prompt encapsulates the parent program’s code, relevant sections from its children, set of instructions designed to guide the LLM in generating modifications, and feedback every set number of generations. The instructions are critical for directing the LLM towards specific improvements or exploratory behaviors.
LLM-driven Code Generation: The constructed prompt is fed to an LLM. The LLM then generates modified versions of designated code blocks within the parent program. This step leverages the LLM’s understanding of programming logic and its ability to produce syntactically correct and semantically meaningful code variations. In this blog, we’ve used gemini-2.0-flash for code generation.
Diff Application and Child Program Production: The modifications generated by the LLM, often in the form of code differences (diffs), are applied to the parent program. This process yields a new ‘child’ program, which inherits characteristics from its parent but incorporates the LLM-suggested changes. For now, this works on functions rather than entire files(similar to what was done in FunSearch).
Performance Evaluation: Each newly generated child program undergoes rigorous evaluation against a predefined, fixed test suite. For the Traveling Salesman Problem, this involves assessing the quality (e.g., shortest path length) of the solution produced by the program. The evaluation employs multi-seed deterministic metrics to ensure consistency and reliability of performance measurements. This feeds into the idea of asymmetric verifiability.
Program Promotion: Periodically, programs that demonstrate superior performance are promoted within the population. These ‘elite’ programs are given higher priority for selection as parents in subsequent generations, thereby guiding the evolutionary process towards more optimal solutions. These can be set using ELITISM_INTERVAL from the config file.
Meta-prompting (Optional Instruction Mutation): To foster diversity and prevent premature convergence, Cadence incorporates an optional metaprompting mechanism. This involves periodically mutating the instructions provided to the LLM. This self-adaptive instruction evolution allows Cadence to explore different problem-solving strategies and potentially discover novel approaches that might not be evident with static instructions. We extract lessons from after every set number of instructions, and then give it as a feedback to the LLM.

Experiment and Results for Hypothesis 1

The Travelling Salesman Problem (TSP) is a classic combinatorial optimization problem: given a list of cities and their pairwise distances, what is the shortest tour that visits each city exactly once and returns to the starting point?

While TSP is known to be computationally hard, approximate solutions and heuristic methods are widely used in real-world settings. My central question for this hypothesis is:

Can Large Language Models (LLMs), guided through structured prompting and iterative feedback, learn to generate effective heuristic solutions to the TSP problem?

To explore this, we evolve TSP programs using an LLM-based system, letting it observe, modify, and evaluate code over multiple generations.

Experiment

Setup

We define a minimal evolutionary system with the following loop:

Baseline Program: A trivial solution (e.g., visiting cities in order) is used as a starting point.
Prompt Generation: The LLM is given the current program and asked to improve it.
Code Generation: The LLM returns code blocks representing a new strategy.
Evaluation: Each program is scored based on the total cost (distance) over multiple random city configurations.
Feedback: After a set number of generation, LLMs extract the lesson from the strategy so far and feed it back to the system prompt as feedback.

This process is repeated over N=30 generations. Lessons are distilled every few generations using a meta-prompt that summarizes what strategies worked well (or didn’t).

Code

Here is an excerpt from the generation loop:

# Run a single generation
result = run_generation(
    generation,
    parent_program,
    inspirations,
    LESSON_INTERVAL=cfg.LESSON_INTERVAL,
)

And the evaluation phase:

child_result = execute(child_program_code, task)
add(program_code=child_program_code, metric=child_result["cost"])

If generation % LESSON_INTERVAL == 0, we extract insights:

lesson = get_lesson_from_history(EXPERIMENT_LOG, previous_lesson)

Metrics

Cost: Total distance of tour across seeds.
Feasibility: Whether all cities are visited once and only once.
Improvement: Drop in cost over generations.
Novelty (qualitative): Changes in structure of code and algorithms used.

Sample Prompts

Initial prompt (simplified): This is the prompt we started with.

Improve the following TSP solution written in Python.
Do not remove any cities, and make sure to return a valid tour.

After a set number of intervals, here’s what the feedback looked like:

Farthest insertion and limited 2-opt improved feasibility but not cost effectively; next explore metaheuristics like simulated annealing or genetic algorithms with the nearest neighbor or farthest insertion solution as a starting point, and evaluate the impact of different cooling schedules or population sizes on solution quality and computational cost.

 Nearest neighbor and stochastic 2-opt did not achieve feasibility; therefore, implement constraint programming or Lagrangian relaxation specifically designed for feasibility and then integrate with a guided local search, such as ALNS or a penalty-based method, to optimize the solution. 

Lagrangian relaxation and local search alone are insufficient for achieving feasibility; incorporate constraint-specific repair mechanisms within the Lagrangian framework to actively guide the search toward satisfying the problem's constraints before cost optimization.

Results

Here are the results from the experiment(Experiment for 30 generations, with feedback every 5 generations):

h1_results

Cost vs Generation: Random peak during the runs, then goes below baseline, and then stays there.
Lesson Prompt Influence: Generations after meta prompt edits showed improvement after some generations. I believe more consclusive results will be produced after more iterations.

Example comparison:

Generation 0:

def tsp(cities): return list(range(len(cities)))

And finally, after 30 generations, the solution looks something like this: Generation 30:

import numpy as np 
import random 
def tsp(cities): 
	num_cities = len(cities) 
	cities_np = np.array(cities) 
	
	def calculate_distance(city1, city2): 
		return np.sum((cities_np[city1] - cities_np[city2])**2)**0.5 
	
	def calculate_tour_distance(tour): 
		distance = 0 
		for i in range(len(tour) - 1): 
			distance += calculate_distance(tour[i], tour[i+1]) 
			distance += calculate_distance(tour[-1], tour[0]) 
		return distance 
	
	def nearest_neighbor_heuristic(): 
		start_city = 0 
		tour = [start_city] 
		unvisited_cities = set(range(num_cities)) 
		unvisited_cities.remove(start_city) 
		current_city = start_city 
		while unvisited_cities: 
			nearest_city = None 
			min_distance = float('inf') 
			for city in unvisited_cities: 
				distance = calculate_distance(current_city, city) 
				if distance < min_distance: 
					min_distance = distance 
					nearest_city = city 
			tour.append(nearest_city) 
			unvisited_cities.remove(nearest_city) 
			current_city = nearest_city 
		return tour 
	tour = nearest_neighbor_heuristic() 
	return tour

Conclusion

From this small experiment, we can conclude that LLMs exhibit the ability to learn heuristics through structured prompting and iterative feedback through lesson . Even with minimal explicit instruction, the model developed:

Greedy strategies (nearest neighbor)
Loop structures for exploration
Valid city permutations

However, performance plateaus without stronger meta-reasoning or curriculum. This, surprisingly, validates the hypothesis that - LLMs can design heuristics that actually improve solutions over generations of evolution, but effectiveness depends on how well the prompt structure and feedback mechanism are crafted.

Experiment and Results for Hypothesis 2

The Travelling Salesman Problem (TSP) becomes increasingly difficult as the number of cities increases. Classic heuristics like Nearest Neighbor (NN) scale poorly, often producing suboptimal solutions for large instances.

This experiment asks:

Can a code-evolving LLM-based system generalize to larger TSP instances when trained only on smaller ones?

This aligns with recent research in the “Scaling Laws” literature - particularly in board games and program synthesis - where systems trained under resource constraints exhibit generalization when scaled up, or break down in predictable ways. So, the goal of this experiment is:

To compare how the LLM-evolved TSP heuristics scale in performance across different input sizes, relative to a classic heuristic (Nearest Neighbor), and to extract any scaling trends.

Design Overview

City Sizes: We sweep over increasing problem sizes (e.g. 10, 15, 20, 25, 30).
Baseline: For each size, we run the Nearest Neighbor algorithm multiple times(for 5 generations, and 1 feedback lesson) to compute an average cost and standard deviation.
LLM Evolution: For the same size, we run a full multi-generation evolutionary loop, where the LLM is prompted to improve on prior programs.

Each generation in the LLM loop includes:

Sampling a parent program
Prompting the LLM to improve it (optionally with “lessons”)
Applying code diffs and evaluating performance
Storing metrics and hashing the program
Extracting lessons every few generations to steer the LLM more effectively

Key Code Snippets

1. Baseline Cost Computation

def baseline_cost(size, seeds=3):
    scores = []
    for seed in range(seeds):
        cities = generate_test_instance(n=size, seed=seed)
        tour = nearest_neighbor(cities)
        scores.append(compute_total_distance(tour, cities))
    return np.mean(scores), np.std(scores)

2. LLM Evolution for a Given Size

def run_size_experiment(size, generations, lesson_interval):
    task = TSPTask(n_cities=size)
    ...
    for gen in trange(1, generations):
        parent, inspirations = sample(...)
        prompt = build(parent, inspirations, lesson)
        diffs = generate(prompt)
        child_code = apply_diff(parent[3], diffs)
        metric = execute(child_code, task)
        ...

3. Plotting Results

ax1.plot(cfg.SIZES, final_costs["llm"], marker="s", linestyle="--", label="LLM Evolution")
ax2.plot(cfg.SIZES, rels, marker="o", color="purple")

Conclusion

Graph 1: Average Tour Cost vs City Count
Graph 2: LLM Relative Improvement over NN (%)

Key Observations:

h2_scaling_analysis

Scaling Summary:

City Count	NN Avg Cost	LLM Cost	Relative Improvement
10	321.83	340.28	-5.7%
15	373.64	340.28	8.9%
20	410.70	320.79	21.9%
25	495.49	321.23	35.2%
30	540.10	329.15	39.1%

Insights

With increase in city count, there is relative improvement in LLM cost, suggesting emergent scalability in the LLM’s learned heuristic, which is better over baseline nearest neighbor cost.
Periodic lesson extraction helps the system retain and reuse successful strategies, which contributed to the increase in relative improvement.

Conclusion

This validates my hypothesis - LLMs evolution follows scaling laws, and given the right lessons, actually improve the solutions over baselines!!.

Future Work

It took well over the course of 3 months to completely understand and learn what it takes to implement AlphaEvolve, and what goes behind implementing a paper, how to experiment your hypothesis. Reading paper is easy, implementing them is hard, but using the insights from that paper to draw your own conclusions is really really hard. I hope the reader gets encouraged to work on problems on their own, and let their imagination go wild. The ability to run experiments, and draw your own conclusions is a very valuable skill that will really help you if you go down the path of research. There is however, a lot of rough edges around the project, and I would really appreciate the read gets encouraged to submit is PR, and design their experiments. Cadence makes it really easy. The instructions are given in the README of the project.

Want to get involved in the project? Read the docs, play around with Cadence and Make a PR!!

Note: If you find this work like this interesting and your organization does something similar, consider hiring me? I’m on the job market, and would love to chat