Creating Eval Cases for LLM-Generated Code

Written by Ruslan Galinskii | Sep 11, 2025 3:00:00 PM

By Ruslan Galinskii, Co-Head of Development at OneTick

Based on our experience, it’s not just about writing tests anymore. It’s about accepting that things won’t always work, and learning to focus on steady improvements instead of perfect results.

Introduction

These days, you’re very lucky if you work with fully open-source code — modern LLMs and the tools around them can help a lot with code generation.

Most are not so lucky. At OneTick, we have our own API/query language — onetick-py — that allows us and our customers to “speak” to market data. We also want to provide LLM-based code generation for our query language. However, this gets tricky: out-of-the-box LLMs don’t know about it. And even if they can find it on the internet, there aren’t many examples or use cases publicly available.

That’s exactly why we’re actively building our own coding assistant. We wrote more about it here. We’re working as a team and doing all the usual things you’d expect for such assistants: augmenting the system prompt with a knowledge base, adding RAG, turning a simple assistant into a multi-node graph with different purposes, keeping a human in the loop, etc.

And — obviously and abruptly — we hit the big question: how do we know if it’s actually getting better? There’s a standard answer: evaluation set.

So, I’ll briefly share our experience creating an evaluation set for this kind of task, and what we’ve learned so far.

What Does Our Eval Case Look Like?

We use end-to-end eval cases for the generated code:

Every eval case has its own static, pre-prepared dataset that we pass as input (we’re working on a code generator for a query language that operates on data). We prepare data right in a test module for better visibility.
Every eval case includes a specific query request. For example: What was the AAPL price at yesterday’s market open on the NYSE?
For each eval case, we write our own reference function, which we run against the prepared data to ensure consistency — and to verify that the answer can actually be produced.
And finally, the core of the eval case: we call our coding assistant to generate code, execute it with the prepared data, and check the result. We call it a submission — that’s exactly what we try to evaluate.

We use pytest as the framework for running these tests. It offers an easy-to-read API for defining tests and assertions, and provides convenient ways to create and apply reusable resources called fixtures. However, over time, having multiple interdependent fixtures can become difficult to manage and may make it harder to understand what’s actually being used in the tests.

Here’s an example of what it looks like:

"""
The eval case checks our coding assistant’s ability to generate point-in-time (PIT) logic using the onetick-py API.
Specifically, to find the price of an instrument on a given market at a specific point in time in the past.
"""
import pytest
import pandas as pd
import onetick.py as otp
from datetime import datetime, timedelta

def get_last_thursday(today):
# if today is Thursday, return week ago, because GPT4 interprets "last Thursday" as "a week ago"
offset = (today.weekday() - 4) % 7 + 1
last_thursday = today - timedelta(days=offset)
return last_thursday

@pytest.fixture
def query():
""" The fixture defines a user's request """
return "What was the AAPL market price at noon last Thursday?"

@pytest.fixture
def session(today):
""" Here we prepare data for tests """
with otp.TestSession() as session:
# Create trades ticks
trades = otp.Ticks(
{
'offset': [
otp.Hour(11) + otp.Minute(51),
otp.Hour(11) + otp.Minute(59), # <-- correct resulting tick
# add this tick to check that we don't pick any other ticks
otp.Hour(11) + otp.Minute(60) + otp.Second(1),
otp.Hour(11) + otp.Minute(62),
otp.Hour(11) + otp.Minute(63)
],
'PRICE': [13.68, 13.72, 13.74, 13.69, 13.65]
},
)
# Create and add trades data to the database
db = otp.DB("NYSE_TAQ")
db.add(trades, tick_type='TRD', date=get_last_thursday(today))
session.use(db)
yield session

def reference_function():
""" Our manually written implementation according to a request """
date = get_last_thursday(datetime.now().date())
data = otp.DataSource(
db="NYSE_TAQ",
symbol="AAPL",
back_to_first_tick=otp.Day(1)
)
df = otp.run(
otp.agg.last_tick().apply(data),
start=date + otp.Hour(12),
end=date + otp.Hour(12),
)
# Return df with price at noon
return df

@pytest.mark.reference
def test_reference_function(session):
result = reference_function()
print(result)
assert len(result) == 1
assert result['PRICE'].iloc[0] == pytest.approx(13.72)

@pytest.mark.submission
def test_submission_function(target_function_gen, session, execution_number):
"""
The `target_function_gen` fixture asks our coding assistant to generate code based on the user's request, provided by the `query` fixture.
It returns a callable function with no parameters.
The `execution_number` fixture controls the number of attempts to run this test.
We manage it externally via config and use it to obtain statistical results and assess robustness.
"""
result = target_function_gen()
assert len(result) == 1
assert result['PRICE'].iloc[0] == pytest.approx(13.72)

There are a few other common fixtures that came up out of necessity — mainly to bind data with the corresponding implementation — but those are just technical implementation details, so let’s skip them; nothing interesting there.

Well, once the test structure is defined, everything else is fairly standard: write the cases and run pytest to check the results. It might seem simple, but our experience shows it’s not quite that easy. Below, I described the challenges we faced while preparing the evaluation set.

First Problem: Randomness

The first problem is that the generated code can be random in terms of inputs and outputs. That’s why we clearly state in the system prompt that the generated code must be wrapped in a function with no parameters, the function should have a fixed, predefined name, and the output should be a DataFrame.

This makes it easy to understand the entry point of the generated function and how to call it.

Sure, we’re probably losing some flexibility around parameterization — but that’s intentional. We aim to specify everything needed to run the code up front.

Even with these restrictions, there’s still some flexibility in the structure of the resulting DataFrame. Since we want to assert the output, it helps to know the name of the column to check.

That’s why we introduced a special optional fixture that lets us extend the prompt and guide the assistant to put the result into a specific column.

Second problem: Tests Will Fail

I’d say this one is more of a psychological problem — at least for me. Personally, I’m used to passing tests when I work on code. Or, to put it more strongly: I’m used to committing code only when tests are green.

But when working with eval cases for code generation, you have to come to terms with committing code where tests fail — specifically, eval tests. And it happens for many reasons:

Sometimes LLMs generate code that’s almost correct from a logic perspective, or
Sometimes they generate syntactically correct code that’s logically wrong, or
Sometimes a user’s request (query) might have different interpretations, or…

and then there are all the other “or”-s that come to mind:

The system prompt isn’t perfect
The assistant can’t handle complex logic
Knowledge is missing
RAG isn’t doing its job
Or maybe the model just isn’t the right one

And the truth is, you can’t fix all of these things at once. Some you simply can’t fix at all — like the LLM’s inherent limitations. And you never really know where the boundary is between what’s possible and what’s just outside its skill set.

That’s why at some point, you have to stop tweaking an eval case — especially if it always fails — write down your observations and thoughts on what could be improved, and commit it as is.

And honestly, for me, this brings in a whole new feeling to software development: You have to admit that it’s time to stop — that you’ve spent enough time on it. There’s no clear stopping rule; you have to decide it yourself.

And you get a new, weird feeling when committing: like you’ve lost. Like you couldn’t make it work. That’s why I say it’s more of a psychological problem — because in the standard (or previous) software development lifecycle, I never experienced this kind of thing, especially not on short-term tasks.

In the meantime, it introduces a new perspective into the software development cycle: usually, tests provided with the code are meant to detect regressions, but this kind of evaluation test is designed to measure improvements.

Third Problem: Robustness

The assistant can sometimes generate correct code — and sometimes not.

That pushed us to add a fixture (execution_number you can find in code above) that runs each eval case multiple times, to check and report on its robustness.

It also gives a new perspective on testing: a statistical one.

Fourth Problem: LLMs are really smart

LLMs are really smart — they can easily solve simple, generated tasks. If you’ve got a comprehensive agent inside, with multiple nodes, reflection, and other fancy techniques, it can handle even more.

It’s easy to come up with totally different (from a logic perspective) simple cases. But it’s very challenging to invent a variety of non-trivial coding tasks. And that’s crucial — because similar tasks are essentially useless for eval cases and will bias the final score.

We try to tackle the diversity problem by asking different people to come up with queries. But even that isn’t easy — it requires people who are not only skilled, but who also understand the domain (which, in our case, is not trivial). We’ve also tried leveraging different sources that provide us with non-trivial cases, including some non-obvious ones.

For example, certain product UI features (and their underlying queries) for drilling down into data haven’t changed for years — which suggests stability and consistent usage. If our UI supports features like filtering, aggregations, and other data operations, it means we’ve already evolved toward certain patterns. We can reverse-engineer those patterns into corresponding natural language queries.

That’s just one example. In reality, we try to use a mix of sources and people — because it leads to more diverse cases. And we have to keep in mind: the more cases you have, the better your understanding of code generation capabilities.

Reporting and Analysis

We use integration with Langfuse to collect run results for further analysis, since Langfuse allows us to build charts and drill down into any failure — all the way to the exact prompts sent to the LLM provider.

We’ll write a separate article soon on how we collect and analyze the results of our eval cases.

Conclusion

Here are the key takeaways from our experience working on eval cases for code generation:

Code generation evaluation is a challenging and mind-shifting task — but it’s essential. Without a solid evaluation set, you can’t really tell whether any change is an improvement or not.
When working with LLMs, you can’t expect deterministic results. That’s why any evaluation becomes statistical. Every eval case should be run multiple times.
An eval case is not a unit test — it can fail, and sometimes it fails every time. And that’s actually useful: those failing cases highlight areas for potential future improvements.
It’s important to understand why they fail. Publishing your code, examples, use cases, and articles becomes extremely important — especially when dealing with a proprietary API.
Modern LLM providers mostly train on public data and use public sources to augment context during inference. That’s what pushed us to make our API documentation public — and even consider open-sourcing it (which we plan to do soon).

Want to hear more about OneTick's ongoing development in LLM-generated code? Subscribe to the OneTick Blog and keep an eye out for more details in an upcoming webinar.

Until then,

Ruslan Galinskii

OneTick Co-Head of Development

View full post