RFDs #0007

Importing recipes from the internet

The recipe is right there. You can see it. Two cups of flour, a teaspoon of baking powder, the whole thing. But between you and those instructions sit 30 cookies, 4 banners, 2 newsletter popups, a 2000-word essay about someone's trip to Tuscany, and enough ads to fund a small nation. You want to save this recipe, it actually looks good and you want to make it next week... but saving a bookmark means saving all that bloat too. And who knows if that page will even exist next year?

Preserving attribution

Recipes have authors. Someone developed that technique, tested those proportions, wrote those instructions. If we extract a recipe, we want to preserve that connection.

Every imported recipe should include the original author and a link back to the source. This serves two purposes: credit to the creator, and a reference for the user to revisit the original whenever they want.

# Chocolate and cherry cake

- Author: Mary Berry
- Source: [bbc.co.uk](https://www.bbc.co.uk/food/recipes/fabulous_chocolate_and_74789)

- 375g dark chocolate
- 375g butter
- 6 eggs
...

Yes, the recipe will live in your library. It is yours to edit and adapt. But its origins travel with it. If you want to see Mary's original photos, read her tips, or check something we might have missed during the import, the link will be right there.

We're mindful that imported recipes exist in a space that requires care. Fork.club is a private-by-default platform: your library is yours, not a public repository.

We've made deliberate choices about how and when recipes can be shared to ensure attribution is respected and the system isn't abused. More on that in RFD #0008: Sharing recipes.

URLs are files about to be

In RFD #0006, I described how we handle file uploads: presigned URLs, self-cleaning storage, isolated processing. The insight that shaped this system was treating uploads as inherently dangerous payloads that should never touch our servers directly.

URLs present a similar challenge, but inverted. Instead of users uploading files to us, we're downloading files from arbitrary locations on the internet. The same principles apply: we can't trust what arrives, we need to process it carefully, and we should be prepared for anything.

Seems obvious, but the key realization was this: a URL is just a file that hasn't been fetched yet. Once you think about it that way, the architecture becomes clear. We fetch the content, store it as an extraction file with metadata about its origin, and process it through the same pipeline as any other file. Whether the recipe came from your grandmother's handwritten card or from a food blog, by the time it enters our extraction pipeline, it's just content waiting to be understood.

Multiple extractions, user choice

Not all websites are created equal. Some recipe sites have been built with care, embedding structured data that makes extraction trivial. Others are hostile environments: walls of ads, popups, and SEO-optimized prose burying the actual recipe.

Rather than picking a single extraction strategy, we run multiple approaches in parallel and let users choose the best result. Each method has different strengths, and sometimes one captures details that another misses.

URL arrives
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  Run in parallel:                                               │
│                                                                 │
│  ┌─────────────────────┐    ┌─────────────────────┐             │
│  │  1. JSON-LD         │    │  2. recipe-scrapers │             │
│  │  Schema.org data    │    │  DOM patterns       │             │
│  └─────────────────────┘    └─────────────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  Both failed?                                                   │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │  3. HTML Digestion + LLM                                │    │
│  │  Find recipe content via LCA, send to language model    │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  User reviews available extractions and picks the best one      │
└─────────────────────────────────────────────────────────────────┘

When both JSON-LD and recipe-scrapers succeed, we present both options. The JSON-LD extraction might have better metadata (cook time, yield, cuisine), while recipe-scrapers might format the instructions more cleanly. Users can compare and choose, or we can suggest the one that looks most complete.

The expensive LLM call only happens when the first two methods fail entirely. This keeps most extractions fast and cheap.

JSON-LD extraction

The best recipe websites include machine-readable structured data using the schema.org Recipe vocabulary. This data is embedded in the page as JSON-LD, typically in a <script> tag that browsers ignore but crawlers love.

{
  "@type": "Recipe",
  "name": "Classic Chocolate Chip Cookies",
  "recipeIngredient": [
    "2 1/4 cups all-purpose flour",
    "1 tsp baking soda",
    "1 cup butter, softened"
  ],
  "recipeInstructions": [...]
}

When we find JSON-LD Recipe data, extraction is straightforward: the title, ingredients, instructions, yield, cuisine, and author are all explicitly labeled. No guessing, no heuristics, no AI. Just parse the JSON and map it to our format.

Google incentivizes sites to include this markup through rich search results, so most serious recipe publishers support it. The data tends to be complete and well-structured, though sometimes it diverges slightly from what's actually displayed on the page.

DOM pattern extraction

The open-source recipe-scrapers library, maintained by hhursev, provides site-specific parsers for over a thousand popular recipe websites: AllRecipes, Food Network, BBC Good Food, Bon Appetit, Serious Eats, and hundreds of others.

Each scraper knows exactly where that site puts its ingredients, how it formats instructions, and where to find the author. Because it reads directly from the DOM, it captures exactly what users see on the page, including formatting details that might not make it into the JSON-LD.

This approach has an elegant property: it improves over time. When a popular site changes its layout, the community updates the scraper. When a new cooking site gains popularity, someone adds support for it. Our extraction gets better without us writing a single line of code.

If at some point I manage to make this project net-positive, I'll try to support hhursev's endeavour

HTML digestion for unknown websites

When both JSON-LD and recipe-scrapers come up empty, we need a different approach. We could just throw the entire HTML at an LLM and ask it to find the recipe, but HTML is verbose, and sending all of that wastes tokens and confuses the model.

Instead, we digest the HTML first: intelligently extracting the recipe-relevant portions before sending anything to the LLM. This is where things get interesting.

Finding the recipe in the noise

Consider what a recipe page actually contains. Somewhere in that HTML is a list of ingredients and a set of instructions. These are surrounded by everything else: headers, footers, sidebars, advertisements, user comments, "you might also like" sections.

The challenge is to find the signal in all that noise. The approach we developed was inspired by Zack (see this article) and from Ben Awad (see this article) and works like this:

Step 1: Remove the obvious garbage

First, we strip elements that never contain recipe content: <script>, <style>, <nav>, <header>, <footer>, <iframe>, <aside>. This eliminates navigation menus, tracking code, embedded widgets, and footer links in one pass.

Step 2: Score every node

Here's the core insight: recipes contain ingredients, and ingredients are drawn from a relatively finite vocabulary. "Flour" is an ingredient. "Subscribe" is not. "Butter" appears in recipes. "Advertisement" does not.

We maintain a dictionary of thousands of common ingredient words across multiple languages. Then we walk the entire DOM tree, scoring each node by counting how many ingredient words appear in its text content.

<div class="sidebar">                        score: 0
  <p>Subscribe to our newsletter!</p>

<ul class="ingredients-list">                score: 8
  <li>2 cups all-purpose flour</li>          flour ✓
  <li>1 tsp baking soda</li>                 baking soda ✓
  <li>1 cup butter, softened</li>            butter ✓
  <li>3/4 cup sugar</li>                     sugar ✓
  ...

The ingredient list lights up with matches. The sidebar, newsletter signup, and author bio score near zero.

Step 3: Find the common ancestor

High-scoring nodes tell us where ingredients are, but ingredients alone don't make a recipe. We need the surrounding context: the instructions, the section headings, perhaps some notes. The question becomes: what's the smallest container that holds all the important content?

We use a classic algorithm: find the Lowest Common Ancestor (LCA) of all high-scoring nodes. If the ingredients are scattered across several <li> elements, we find the <ul> that contains them all. If there are multiple sections (ingredients for the sauce, ingredients for the pasta), we find the <div> or <article> that wraps both.

                   ┌─────────────────┐
                   │ <article>       │ ← LCA (this becomes our extract)
                   │   ┌───────────┐ │
                   │   │ <h2>      │ │
                   │   │ Sauce     │ │
                   │   └───────────┘ │
                   │   ┌───────────┐ │
                   │   │ <ul>      │ │ ← high scoring
                   │   │ tomatoes  │ │
                   │   │ garlic    │ │
                   │   └───────────┘ │
                   │   ┌───────────┐ │
                   │   │ <h2>      │ │
                   │   │ Pasta     │ │
                   │   └───────────┘ │
                   │   ┌───────────┐ │
                   │   │ <ul>      │ │ ← high scoring
                   │   │ spaghetti │ │
                   │   │ salt      │ │
                   │   └───────────┘ │
                   └─────────────────┘

Step 4: Clean the extract

Once we have our candidate container, we strip it down further:

Remove all attributes. Classes, IDs, data attributes, event handlers, all gone. The LLM doesn't care that something was class="recipe-ingredient-item-v2". It just needs the content and structure.

Remove empty branches. If a <div> contains nothing but whitespace after we've stripped its contents, delete it. This removes decorative wrappers, empty ad containers, and skeleton elements.

Collapse unnecessary nesting. Five nested <div> elements containing a single <p> becomes one <div> with that <p>. This reduces token count without losing semantic meaning.

The result is minimal, clean HTML containing only the recipe-relevant content. What started as a 200KB page becomes a 2KB extract.

Step 5: Let the LLM finish the job

Now we can send this digestified HTML to the language model. With all the noise removed, the model sees something like:

<article>
    <h2>Chocolate Chip Cookies</h2>
    <ul>
        <li>2 cups flour</li>
        <li>1 tsp baking soda</li>
        <li>1 cup butter</li>
    </ul>
    <ol>
        <li>Preheat oven to 375°F</li>
        <li>Mix dry ingredients</li>
        <li>Cream butter and sugar</li>
    </ol>
</article>

The extraction prompt asks for a structured recipe. The model identifies the title, extracts ingredients with their quantities and units, orders the instructions, and returns clean, normalized recipe data. What would have been a confusing mess becomes a straightforward task.

Beyond traditional websites

This article focuses on extracting recipes from standard web pages. Social media platforms like YouTube, Instagram, and TikTok present their own unique challenges: video transcripts, aggressive bot detection, client-side rendering, and content buried in JavaScript state. We'll cover those platform-specific pipelines in a future RFD. (See #0009)

Why this works

Running multiple extraction methods in parallel and letting users choose might seem like overkill, but it reflects reality: no single method is perfect for every site.

JSON-LD is authoritative but sometimes incomplete, or subtly different from what's displayed. recipe-scrapers captures exactly what users see, but only works on known sites. The LLM approach handles anything, but costs more and can occasionally hallucinate details. By presenting options, we let users pick the extraction that best matches what they saw on the page.

Most recipes come from established food blogs where both JSON-LD and recipe-scrapers succeed. In these cases, the extractions are usually identical, and we can auto-select. The choice interface matters most for edge cases: sites with quirky formatting, incomplete structured data, or layouts that confuse one method but not another.

The node scoring approach for LLM fallback is surprisingly robust. Recipes really do contain ingredient words at unusual density. Even on pages with aggressive SEO content ("these chocolate chip cookies are the BEST chocolate chip cookies, better than any chocolate chip cookies you've ever tasted"), the actual ingredient list still scores highest because it contains "flour," "sugar," "butter," and "chocolate" in close proximity.

And the LCA algorithm elegantly handles the structural diversity of recipe pages. Whether ingredients are in a <ul>, a <table>, or a series of <p> tags, we find the right container.

There's something deeply satisfying about watching a cluttered recipe page transform into clean, readable text.

Author: Jorge Bastida
Published: December 31, 2025
RFD: #0007

If you'd like to discuss this RFD, share your thoughts, or simply chat about it, feel free to reach out to me - To stay up to date with the project's development, you can follow me on X