RFDs #0006

Importing recipes from files

My love for cooking comes from my grandfather. He was obsessed with desserts, and I "helped" him every chance I got. Along with the hobby, I inherited his collection of cookbooks, scribbled notes, and handwritten recipe cards.

For years I've wanted to digitize all his recipes so I can pass them to my kids. But typing them out by hand is both tedious and error-prone.

Why not just take a photo and import them? There's a problem; dealing with file uploads from users requires careful consideration.

Why file uploads are risky

Uploading a binary file is effectively asking your server to run a complex parser on untrusted input. Image formats are not simple blobs; they are structured programs that describe how memory should be allocated, how bytes should be interpreted, and which decoding paths should be taken. When that interpretation happens in native code, bugs translate directly into memory corruption and, in the worst case, remote code execution. (e.g., ImageTragick in ImageMagick, CVE-2016-3714).

History backs this up. ImageMagick has suffered dozens of critical vulnerabilities over the years. PDFs can embed active content. Compressed archives can expand catastrophically. Any file that is parsed and not just stored is an attack vector.

Beyond security, there's also:

Resource exhaustion. Someone uploads 25MB files in a loop. Your server starts swapping, response times degrade, other users suffer.
Storage growth. Users abandon uploads constantly. Each orphaned file sits in your storage indefinitely, quietly costing money.
Abuse vectors. Bad actors will find creative ways to use your upload endpoint for things you never intended.

This is a small project. I am the security team. Whatever solution I build needs to be safe by default and require minimal ongoing maintenance.

The naive approach

The obvious way to handle file uploads:

┌──────────┐      file bytes       ┌──────────┐      file bytes      ┌─────┐
│  Browser │ ───────────────────▶  │   API    │ ──────────────────▶  │ S3  │
└──────────┘                       └──────────┘                      └─────┘
                                        │
                                        │ parse, validate,
                                        │ resize, scan...
                                        ▼
                                   ┌──────────┐
                                   │   RCE    │
                                   └──────────┘

Every byte flows through your server. You parse it, validate it, maybe resize images or scan for viruses. Each of those steps is an opportunity for something to go wrong. One malformed file and suddenly someone's running code on your server.

Design goals

Before diving into the solution, let me be clear about the goals:

Files never touch my servers during upload. If malicious bytes never hit our infrastructure, we can't be exploited by them.
Abandoned files clean themselves up. No cron jobs checking for orphans. No manual intervention. The system should be self-healing.
Abuse is structurally difficult. Not just rate limits that can be worked around, but architectural decisions that make abuse impractical.
Nothing to maintain. Once deployed, this should run indefinitely without intervention.

It turns out S3 and some cryptographic signing can achieve most of this.

The solution

Here's the approach in a nutshell:

Browser uploads directly to S3 using presigned URLs. Our API never sees the bytes.
A cryptographically signed token references the upload. No database entries to maintain.
Abandoned files auto-delete after 24 hours. S3 lifecycle rules, zero code.
Processing happens in isolated Lambdas. If something goes wrong, it's contained.

Let me explain how each piece works.

The upload

The key insight: what if file bytes never touched our servers at all?

S3 presigned URLs let you grant temporary, scoped access to upload a specific file. The browser uploads directly to S3. Our API just generates the permission slip and never sees the actual bytes.

We use presigned POST rather than PUT because POST lets us set strict conditions:

Declared content type. The upload must claim a specific MIME type.
Exact file size match. Can't claim 1KB then upload 25MB.
5 minute expiry. Can't stockpile upload URLs.

S3 doesn't verify actual file contents, only headers. Someone could claim image/jpeg while uploading something else. We use the declared MIME type for routing (images go to vision API, PDFs go to document processing, etc.), not for security. If someone uploads garbage with a fake MIME type, it just gets sent to the wrong parser which fails gracefully.

The file size requirement is the important one. The user tells us how big their file is upfront. We generate a presigned URL that only accepts that exact size. Lie about it? S3 rejects the upload. No bytes ever reach us to verify.

┌──────────┐  1. "I want to upload    ┌──────────┐
│  Browser │     a 3MB JPEG"          │   API    │
│          │ ────────────────────────▶│          │
└──────────┘                          └──────────┘
     │                                      │
     │                                      │ 2. validate user,
     │                                      │    generate presigned URL
     │                                      │    (3MB, image/jpeg, 5min)
     │                                      ▼
     │         3. presigned URL         ┌──────────┐
     │ ◀─────────────────────────────── │   API    │
     │                                  └──────────┘
     │
     │  4. upload directly to S3
     │     (3MB of bytes)
     ▼
┌─────────────────┐
│       S3        │
│  uploads/xxx    │
└─────────────────┘

Referencing the uploaded file

Once someone has a presigned URL, they can upload to that S3 key. But how do they reference it later? I could create a database entry with a unique ID, but that would require maintaining that data. Instead, I use a cryptographically signed token:

┌─────────────────────────────────────────────────────┐
│                    SIGNED TOKEN                     │
├─────────────────────────────────────────────────────┤
│  {                                                  │
│    "key": "uploads/usr_xxx/abc123.jpg",             │
│    "metadata": {                                    │
│      "filename": "grandma_recipe.jpg"               │
│    }                                                │
│  }                                                  │
├─────────────────────────────────────────────────────┤
│  signature: HMAC-SHA256(payload, SECRET_KEY)        │
│  expires: 10 minutes                                │
└─────────────────────────────────────────────────────┘

The token is minimal: just the S3 key and metadata (like the original filename for display). We don't need content_type or file_size since S3 already enforces those during upload. When validating, a HEAD request tells us if the file exists and its actual properties.

This token travels with the presigned URL. When the user triggers extraction, they send back the token. We verify the signature, confirm the file exists in S3, and proceed. Any tampering breaks the signature. Old tokens expire. There's no way to claim files you didn't upload.

Self-cleaning storage

Abandoned files are inevitable. Users close tabs, lose connection, change their mind. Without cleanup, storage costs grow indefinitely.

The solution requires zero code: S3 lifecycle rules.

                        S3 BUCKET
    ┌───────────────────────────────────────────────┐
    │                                               │
    │   uploads/                                    │
    │   ├── usr_abc/file1.jpg  ─┐                  │
    │   ├── usr_def/file2.pdf   │ ⏰ 24h lifetime  │
    │   └── usr_ghi/file3.png  ─┘    auto-delete   │
    │                                               │
    │   profiles/                                   │
    │   └── usr_abc/                               │
    │       └── extraction/                        │
    │           └── ext_123/                       │
    │               └── file.jpg  ← permanent      │
    │                                               │
    └───────────────────────────────────────────────┘

    LIFECYCLE RULE:
    ┌─────────────────────────────────────────┐
    │  prefix: uploads/                       │
    │  action: delete after 1 day             │
    └─────────────────────────────────────────┘

All uploads go to an uploads/ prefix. S3 automatically deletes anything in that prefix after 24 hours. When extraction actually happens, we move the file to its permanent location. Successful uploads get preserved, abandoned ones vanish automatically.

No cron jobs. No database queries looking for orphans. No maintenance.

Rate limiting

Traditional rate limiting (X requests per minute) can be worked around with patience or multiple accounts. I wanted something more structural.

Before generating a presigned URL, we check how many pending uploads the user has in S3. More than 16 files sitting in their uploads folder? We reject new requests. They can wait for the 24-hour cleanup or complete some extractions.

There's a subtle issue: what if someone requests 16 presigned URLs without actually uploading? The count check happens before the upload, so they could stockpile URLs. The fix: when we generate a presigned URL, we immediately "touch" the key with an empty (0-byte) file. This consumes their quota instantly. When they actually upload, their file replaces the placeholder. When they trigger extraction, we check the file isn't empty. If it's still 0 bytes, we reject it.

This creates natural backpressure. You can't accumulate unlimited upload URLs. You can't use us as free file hosting because files expire. The architecture itself resists abuse.

The processing

We use OpenAI's vision API to extract recipes from images. That means uploading files to OpenAI's storage. Even though we're not parsing files ourselves, we'd still be loading file bytes into our API process to forward them to OpenAI. That's memory pressure and potential attack surface we don't need.

The solution: an isolated Lambda function that handles the OpenAI upload. Our API never touches the file bytes at all.

API                          Lambda                      S3                 OpenAI
 │                             │                         │                    │
 │ generate presigned URL      │                         │                    │
 │─────────────────────────────┼────────────────────────▶│                    │
 │                             │                         │                    │
 │ invoke(url, filename)       │                         │                    │
 │────────────────────────────▶│                         │                    │
 │                             │                         │                    │
 │                             │ GET presigned URL       │                    │
 │                             │────────────────────────▶│                    │
 │                             │                         │                    │
 │                             │ stream bytes            │                    │
 │                             │◀────────────────────────│                    │
 │                             │                         │                    │
 │                             │ upload to OpenAI        │                    │
 │                             │─────────────────────────┼───────────────────▶│
 │                             │                         │                    │
 │                             │                    file_id                   │
 │                             │◀────────────────────────┼────────────────────│
 │                             │                         │                    │
 │ {openai_file_id: ...}       │                         │                    │
 │◀────────────────────────────│                         │                    │

The Lambda is minimal. Written in Go for fast cold starts (~10ms), it streams bytes directly from S3 to OpenAI without buffering the entire file. Critically, it runs with minimal permissions:

No AWS IAM policy (receives presigned URLs instead)
OpenAI API key scoped to files:write only
Ephemeral execution environment

If the Lambda were somehow compromised through a malicious file, the blast radius is minimal. The attacker could upload files to our OpenAI account. That's it. No database access, no user data, no AWS credentials.

What's next

This design eliminates several concerns: no file bytes touch our API servers, no file parsing near our infrastructure, no orphan detection jobs, no unbounded storage growth. The system is simple and self-maintaining.

It also sets us up for future features:

Audio import. Same presigned upload flow, just with MP3 files and a transcription step before extraction. Recording someone as they explain a recipe, then getting it transcribed and structured automatically.

Document processing. Word docs or Excel spreadsheets. Each format gets its own isolated Lambda for processing. DOCX files are ZIP archives containing XML, and XML parsers have their own history of vulnerabilities (XXE attacks, billion laughs). Better to handle that in an ephemeral container that can't touch anything important.

Smarter cropping. Select the recipe you want from a multi-recipe photo. Pick the right page from a long PDF. The upload infrastructure stays the same; we just add more sophisticated extraction.

The hard part was getting the secure upload pipeline right. Each new file type just plugs into the same pattern: presigned uploads, signed tokens, self-cleaning storage, isolated processing.

Last week, I photographed one of my grandfather's handwritten recipe cards, his chuchitos (That's how we call profiteroles). Uploaded it, watched the extraction run, and there it was! ready to cook!

Author: Jorge Bastida
Published: December 21, 2024
RFD: #0006

If you'd like to discuss this RFD, share your thoughts, or simply chat about it, feel free to reach out to me - To stay up to date with the project's development, you can follow me on X