How to Have LLMs Write More Efficient Code using Optimal Data Structures
Software engineer and entrepreneur based in San Francisco.
Software engineer and entrepreneur based in San Francisco.
You've probably noticed it too. You ask an LLM to write some code, and it works perfectly... on your test data of 100 items. Then you deploy to production with 100,000 items and suddenly your server is on fire.
I've been using Claude, ChatGPT, and other LLMs extensively for code generation, and I've noticed they have the same bad habits many junior engineers have: defaulting to arrays when sets would be better, writing nested loops when a single pass would suffice, and generating N+1 SQL queries like our database won't ever have more than 100 rows.
LLMs are trained on a massive corpus of code where correctness is prioritized over performance. Most tutorial code, Stack Overflow answers, and GitHub repositories optimize for readability and "getting it working" rather than efficiency. As Martin Kleppmann notes in Designing Data-Intensive Applications, "premature optimization" warnings have led to a generation of developers who never optimize at all.
After spending months refactoring LLM-generated code and studying both Designing Data-Intensive Applications and A Common-Sense Guide to Data Structures and Algorithms, I've identified the most common performance mistakes LLMs make—and more importantly, how to get them to write efficient code from the start.
This is by far the most common mistake I see. LLMs love arrays, even when you're doing membership checks.
// LLM typically generates this
const userIds = [1, 2, 3, 4, 5, ...]; // thousands of IDs
const bannedUsers = [101, 102, 103, ...]; // hundreds of banned IDs
function isUserBanned(userId: number): boolean {
return bannedUsers.includes(userId); // O(n) lookup!
}
// Processing users
userIds.forEach(userId => {
if (isUserBanned(userId)) { // O(n) * O(m) = O(n*m)
console.log(`User ${userId} is banned`);
}
});
// Better approach
const userIds = [1, 2, 3, 4, 5, ...];
const bannedUsersSet = new Set([101, 102, 103, ...]); // Set for O(1) lookups
function isUserBanned(userId: number): boolean {
return bannedUsersSet.has(userId); // O(1) lookup!
}
userIds.forEach(userId => {
if (isUserBanned(userId)) { // O(n) * O(1) = O(n)
console.log(`User ${userId} is banned`);
}
});
// Optimal for large-scale processing
const userIds = [1, 2, 3, 4, 5, ...];
const bannedUsersSet = new Set([101, 102, 103, ...]);
// Process in batches to optimize memory access patterns
const BATCH_SIZE = 1000;
const results: number[] = [];
for (let i = 0; i < userIds.length; i += BATCH_SIZE) {
const batch = userIds.slice(i, i + BATCH_SIZE);
const bannedInBatch = batch.filter(id => bannedUsersSet.has(id));
results.push(...bannedInBatch);
}
console.log(`Found ${results.length} banned users`);
Modern CPUs have multiple levels of cache (L1, L2, L3). When you process data in batches that fit in cache, you get significantly better performance due to spatial locality. This is especially important when dealing with millions of items. The batch approach also makes it easier to parallelize work across multiple threads or workers.
LLMs are notorious for generating code that makes database queries inside loops. This is perhaps the most performance-destroying pattern in web applications.
// LLM often generates this disaster
async function getUsersWithPosts(userIds: number[]) {
const results = [];
for (const userId of userIds) {
const user = await db.query("SELECT * FROM users WHERE id = ?", [userId]);
const posts = await db.query("SELECT * FROM posts WHERE user_id = ?", [userId]);
results.push({
...user,
posts,
});
}
return results; // Made N+1 queries!
}
// Better approach with a single query
async function getUsersWithPosts(userIds: number[]) {
const query = `
SELECT
u.id, u.name, u.email,
p.id as post_id, p.title, p.content
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
WHERE u.id IN (${userIds.map(() => "?").join(",")})
`;
const rows = await db.query(query, userIds);
// Group results by user
const userMap = new Map();
rows.forEach(row => {
if (!userMap.has(row.id)) {
userMap.set(row.id, {
id: row.id,
name: row.name,
email: row.email,
posts: [],
});
}
if (row.post_id) {
userMap.get(row.id).posts.push({
id: row.post_id,
title: row.title,
content: row.content,
});
}
});
return Array.from(userMap.values());
}
// Optimal approach for production
const GET_USERS_WITH_POSTS = `
SELECT
u.id, u.name, u.email,
COALESCE(
JSON_AGG(
JSON_BUILD_OBJECT(
'id', p.id,
'title', p.title,
'content', p.content
) ORDER BY p.created_at DESC
) FILTER (WHERE p.id IS NOT NULL),
'[]'::json
) as posts
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
WHERE u.id = ANY($1::int[])
GROUP BY u.id
`;
// Use prepared statement with connection pool
async function getUsersWithPosts(userIds: number[]) {
const client = await pool.connect();
try {
const result = await client.query(GET_USERS_WITH_POSTS, [userIds]);
return result.rows.map(row => ({
...row,
posts: row.posts || [],
}));
} finally {
client.release();
}
}
LLMs love string concatenation in loops, apparently unaware that strings are immutable in most languages.
// LLM typically writes this
function generateCSV(data) {
let csv = "id,name,email\n";
for (const row of data) {
csv += `${row.id},${row.name},${row.email}\n`; // Creates new string each time!
}
return csv;
}
// Better approach
function generateCSV(data) {
const header = "id,name,email";
const rows = data.map(row => `${row.id},${row.name},${row.email}`);
return [header, ...rows].join("\n");
}
// Production-ready streaming approach
const { Transform } = require("stream");
class CSVTransform extends Transform {
constructor(options) {
super(options);
this.isFirstChunk = true;
}
_transform(chunk, encoding, callback) {
if (this.isFirstChunk) {
this.push("id,name,email\n");
this.isFirstChunk = false;
}
const row = JSON.parse(chunk);
const escaped = [row.id, this.escapeCSV(row.name), this.escapeCSV(row.email)];
this.push(escaped.join(",") + "\n");
callback();
}
escapeCSV(field) {
if (field == null) return "";
const str = String(field);
if (str.includes('"') || str.includes(",") || str.includes("\n")) {
return `"${str.replace(/"/g, '""')}"`;
}
return str;
}
}
// Usage with streaming
async function generateLargeCSV(dataStream, outputStream) {
return dataStream.pipe(new CSVTransform()).pipe(outputStream);
}
LLMs often design REST APIs that require multiple round trips when a single request would suffice.
// LLM often suggests this pattern
app.get("/api/user/:id", async (req, res) => {
const user = await getUser(req.params.id);
res.json(user);
});
app.get("/api/user/:id/posts", async (req, res) => {
const posts = await getUserPosts(req.params.id);
res.json(posts);
});
app.get("/api/user/:id/followers", async (req, res) => {
const followers = await getUserFollowers(req.params.id);
res.json(followers);
});
// Client needs 3 requests to show a profile page!
// Better approach with field selection
app.get("/api/user/:id", async (req, res) => {
const fields = req.query.fields?.split(",") || ["basic"];
const userId = req.params.id;
const result: any = {};
if (fields.includes("basic") || fields.includes("*")) {
result.user = await getUser(userId);
}
if (fields.includes("posts") || fields.includes("*")) {
result.posts = await getUserPosts(userId);
}
if (fields.includes("followers") || fields.includes("*")) {
result.followers = await getUserFollowers(userId);
}
res.json(result);
});
// Client can request: /api/user/123?fields=basic,posts,followers
// Optimal approach with dataloader pattern
const DataLoader = require("dataloader");
// Batch and cache user loads
const userLoader = new DataLoader(async (userIds: number[]) => {
const query = `
SELECT
u.*,
array_agg(DISTINCT p.*) FILTER (WHERE p.id IS NOT NULL) as posts,
array_agg(DISTINCT f.*) FILTER (WHERE f.id IS NOT NULL) as followers
FROM users u
LEFT JOIN posts p ON p.user_id = u.id
LEFT JOIN followers f ON f.following_id = u.id
WHERE u.id = ANY($1::int[])
GROUP BY u.id
`;
const result = await db.query(query, [userIds]);
const userMap = new Map(result.rows.map(r => [r.id, r]));
return userIds.map(id => userMap.get(id));
});
app.get("/api/user/:id", async (req, res) => {
const userId = parseInt(req.params.id);
const cached = await redis.get(`user:${userId}`);
if (cached) {
return res.json(JSON.parse(cached));
}
const userData = await userLoader.load(userId);
// Cache for 5 minutes
await redis.setex(`user:${userId}`, 300, JSON.stringify(userData));
res.json(userData);
});
LLMs often load entire datasets into memory when streaming would be more appropriate.
// LLM typically suggests this
async function processLargeFile(filePath) {
const data = fs.readFileSync(filePath, "utf-8"); // Loads entire file!
const lines = data.split("\n");
const results = lines.map(line => {
const parsed = JSON.parse(line);
return {
id: parsed.id,
total: parsed.items.reduce((sum, item) => sum + item.price, 0),
};
});
return results;
}
// Better approach with readline
const readline = require("readline");
async function processLargeFile(filePath) {
const results = [];
const rl = readline.createInterface({
input: fs.createReadStream(filePath),
crlfDelay: Infinity,
});
for await (const line of rl) {
if (line.trim()) {
const parsed = JSON.parse(line);
results.push({
id: parsed.id,
total: parsed.items.reduce((sum, item) => sum + item.price, 0),
});
}
}
return results;
}
lazily parse lines. // 2. Distribute chunks across a fixed worker pool via worker_threads. // 3. JSON.stringify each
transformed chunk to the output stream. // 4. Gracefully terminate workers in _final(). ```
</CollapseDropdown>
## 6: Type-Safe Data Shapes
LLMs often gloss over TypeScript's power—falling back to `any` and leaving you to discover runtime type errors. A better pattern is **derive runtime validation from compile-time types**.
<CollapseDropdown summary="Show me the code for type-safe shapes" id="type-safe-shapes">
```typescript
import { z } from 'zod';
// Compile-time shape
interface User {
id: number;
name: string;
email: string;
}
type UserMap = Record<User['id'], User>; // O(1) access by id
// Runtime validator derived from the same shape
const UserSchema = z.object({
id: z.number().int(),
name: z.string(),
email: z.string().email()
});
// Safer lookup helper
function buildUserMap(users: User[]): UserMap {
users.forEach(u => UserSchema.parse(u));
return Object.fromEntries(users.map(u => [u.id, u]));
}
Even when an LLM generates the right JOIN, it rarely suggests the index that makes it fast.
-- Before: 1.2s sequential scan
EXPLAIN ANALYZE SELECT * FROM posts WHERE user_id = 42;
-- Fix: composite index
CREATE INDEX CONCURRENTLY idx_posts_user_created ON posts(user_id, created_at DESC);
-- After: 3ms using idx_posts_user_created
Key takeaways:
EXPLAIN
on generated SQL.WHERE
+ ORDER BY
patterns.LLMs love to load entire datasets into memory and then filter them at request time. This works fine for toy examples but becomes a performance disaster as data grows.
// LLM typically generates this
async function findBookmarksByTag(tag: string) {
const allBookmarks = await getAllBookmarksFromSource(); // Loads 10,000+ items
return allBookmarks.filter(bookmark => {
return bookmark.tags.some(t => t.toLowerCase() === tag.toLowerCase());
}); // O(n) scan for every request
}
// Better approach with tag index
type TagIndex = Map<string, Bookmark[]>;
let tagIndex: TagIndex;
async function buildTagIndex() {
const allBookmarks = await getAllBookmarksFromSource();
const index = new Map<string, Bookmark[]>();
for (const bookmark of allBookmarks) {
for (const tag of bookmark.tags) {
const normalizedTag = tag.toLowerCase();
if (!index.has(normalizedTag)) {
index.set(normalizedTag, []);
}
index.get(normalizedTag)!.push(bookmark);
}
}
tagIndex = index;
}
// O(1) lookup at request time
async function findBookmarksByTag(tag: string): Promise<Bookmark[]> {
if (!tagIndex) await buildTagIndex();
return tagIndex.get(tag.toLowerCase()) || [];
}
// Optimal approach with search index
import lunr from "lunr";
const searchIndex = lunr(function () {
this.ref("id");
this.field("title");
this.field("tags");
allBookmarks.forEach(doc => {
this.add({
...doc,
tags: doc.tags.join(" "),
});
});
});
const bookmarkMap = new Map(allBookmarks.map(b => [b.id, b]));
async function findBookmarksByTag(tag: string): Promise<Bookmark[]> {
const results = searchIndex.search(`tags:${tag}`);
return results.map(result => bookmarkMap.get(result.ref)!);
}
LLMs understand caching in principle but often implement it as a simple Map
with no eviction policy, leading to unbounded memory growth.
// LLM often generates this
const cache = new Map();
function getExpensiveData(id: string) {
if (cache.has(id)) {
return cache.get(id);
}
const data = computeExpensiveData(id);
cache.set(id, data); // Never evicts, grows forever!
return data;
}
// Better approach with size limit
const cache = new Map();
const MAX_CACHE_SIZE = 1000;
function getExpensiveData(id: string) {
if (cache.has(id)) {
return cache.get(id);
}
if (cache.size >= MAX_CACHE_SIZE) {
const oldestKey = cache.keys().next().value;
cache.delete(oldestKey); // FIFO eviction
}
const data = computeExpensiveData(id);
cache.set(id, data);
return data;
}
// Optimal approach with LRU eviction
import { LRUCache } from "lru-cache";
const cache = new LRUCache({
max: 500, // Max number of items
ttl: 1000 * 60 * 5, // 5 minute TTL
});
function getExpensiveData(id: string) {
const cachedValue = cache.get(id);
if (cachedValue) {
return cachedValue;
}
const data = computeExpensiveData(id);
cache.set(id, data);
return data;
}
An LRU cache evicts the item that hasn't been accessed for the longest time. This is perfect for caching user sessions, configuration data, or any resource where recent access predicts future access. A simple bounded map with FIFO eviction might discard a popular, frequently-accessed item just because it was one of the first to be added.
After months of iteration, here's my prompt template that consistently produces more efficient code:
Write a function to [task description].
Performance requirements:
- Expected data size: [e.g., "10k-100k items"]
- Access pattern: [e.g., "frequent lookups by ID"]
- Constraints: [e.g., "must handle streaming data", "limited memory"]
Optimize for [time complexity/space complexity/both].
Avoid O(n²) algorithms and prefer O(n log n) or better.
Use appropriate data structures (Set for lookups, Map for key-value, etc).
Bad Prompt: "Write a function to find common elements in two arrays"
LLM Output: Almost always produces nested loops with O(n*m) complexity
Good Prompt: "Write a function to find common elements in two arrays. Arrays may contain 10k-100k items. Optimize for time complexity. Use appropriate data structures for efficient lookups."
LLM Output: Usually produces a Set-based solution with O(n+m) complexity
Based on A Common-Sense Guide to Data Structures and Algorithms, here's when to use what:
Operation | Wrong Choice | Right Choice | Time Complexity |
---|---|---|---|
Lookup by value | Array | Set/Map | O(n) → O(1) |
Finding duplicates | Nested loops | Set | O(n²) → O(n) |
Priority queue | Sorted array | Heap | O(n) → O(log n) |
Range queries | Hash table | B-tree/Skip list | O(n) → O(log n) |
LRU Cache | Array | LinkedHashMap | O(n) → O(1) |
Based on analyzing real codebases, here are impactful changes you can make:
The key insight: Most performance problems are solved by choosing the right data structure for your access patterns, not by micro-optimizing algorithms.
– Did I miss any critical data-structure pitfalls? – Are there scenarios where Sets/Maps are not the right fix? – Which concurrency patterns trip up LLMs most often?
LLMs are incredible tools for generating code quickly, but they inherit the biases of their training data—which is full of tutorial-level, unoptimized code. By understanding common performance pitfalls and being specific about your performance requirements in prompts, you can get LLMs to generate code that's not just correct, but actually efficient.
Remember: The difference between O(n) and O(n²) might not matter at n=100, but at n=100,000, it's the difference between 100ms and 3 hours.
Next time you're using an LLM for code generation, take an extra moment to specify your performance requirements. Your future self (and your AWS bill) will thank you.