Skip to content
Back to Blog
Guide

Tool Recommendations via Grounded Search: How LucidFlow Avoids the LLM Hallucination Trap

Ask a frontier LLM which tool to use for a given process, and the answer will be confidently wrong in subtle ways: the tool existed three years ago, the pricing is out of date, the URL is a hallucination. LucidFlow's recommendation layer solves this with grounded search: Gemini 3.1 Pro plus live Google Search, and a fallback chain that degrades transparently rather than silently.

8 min read

The hallucination problem with LLM tool recommendations

Ask a generic LLM which tool to use for a specific business process and you get confident recommendations with a predictable failure pattern: the model recommends tools that were best-in-class two or three years ago (the vintage of its training data), quotes pricing that is out of date or fabricated, and cites product URLs that either redirect, 404, or point to a completely different product. Even frontier models do this, because the recommendation task itself is inherently temporal: the right answer changes every quarter as new tools ship, incumbents raise prices, and categories consolidate. No amount of scale fixes a training-data vintage problem.

The concrete failure mode we audited in the LucidFlow Knowledge Base before shipping the grounded-search layer: models would recommend RPA platforms from 2022 for tasks where a 2025-vintage AI-native competitor has explicitly taken the category, quote yearly pricing based on 2023 public pricing pages that have since moved behind enterprise-sales walls, and confidently cite URLs that were the vendor's product page three years ago but are now a redirect to a generic homepage. The recommendations read as authoritative because the model is good at writing authoritative prose. They fail as actual product-selection guidance.

The three-tier fallback chain

Grounded search is the primary provider, but it cannot be the only one. Real-time web search can fail for mundane reasons: API rate limits, transient network errors, timeouts on unusually broad queries, and the recommendation layer needs to remain useful even when the primary fails. The tool-recommendation layer implements an explicit three-tier fallback chain, with each tier chosen to preserve as much recommendation quality as possible while degrading transparently when the higher tier is unavailable.

  1. Primary: Gemini 3.1 Pro + Google Search grounding. Real-time web data, auditable via grounding metadata. This is the default path and handles the vast majority of production calls.
  2. Fallback: Grok. Uses training data rather than real-time search, so it may have some incumbency bias, but it is still a capable recommender and its training data is relatively recent. Activated when the Gemini primary call fails with a retryable error. A Grok availability check gates this: if Grok is also unavailable, the chain skips to the last-resort tier.
  3. Last resort: graceful degradation. Returns an empty tool list with an explicit 'tools unavailable' flag rather than fabricating recommendations. The UI then surfaces this state explicitly, showing the rest of the transformation analysis (maturity levels, savings, ROI) without tool suggestions. This is the right behaviour because a missing recommendation is more honest than a hallucinated one.

Why recommendations are process-level, not task-level

A subtle architectural choice in the tool-recommendation layer: it makes ONE LLM call per process, not one call per task. The system prompt explicitly frames the model's job as thinking holistically: 'These tasks form a given workflow. Can ONE platform handle multiple tasks?' The prompt actively pushes the model toward consolidated platforms: 'A consolidated platform that is good enough at each task beats 4 best-in-class tools that require complex integration.' This is the enterprise best practice, and it falls out of the design rather than being applied after the fact.

The practical consequence: a process with ten automatable tasks usually comes back with two or three tool recommendations rather than ten, each recommendation mapping to multiple tasks. A modern AI-native accounts-payable platform might cover five tasks with one subscription; a document-understanding API plus a workflow engine might cover the rest. This is closer to how a real transformation actually gets implemented: teams buy platforms, not point tools, and it surfaces the integration cost trade-off explicitly rather than hiding it behind a long list of individual-tool recommendations.

First-principles thinking baked into the prompt

One more underrated detail: the prompt instructs the model to consider the entire process holistically, including tasks marked not_automatable. A task might be 'not_automatable' in the static pattern-matcher's view but eliminatable by a modern platform: an automated-approval system removes the need for manual review entirely, for example. The prompt allows the model to include 'not_automatable' tasks in a recommendation and explain in the reasoning why the tool can eliminate the task. This catches the case that task-level recommendation would always miss: sometimes the right answer is not automating the task but removing it.

What a recommendation actually looks like

Every recommendation returned by the grounded-search path conforms to the same Zod-validated schema. The shape is worth reading once because it shows what the platform actually commits to surfacing for each recommendation, not claims in prose, but structured fields.

const toolRecommendationSchema = z.object({
  recommendations: z.array(
    z.object({
      taskIds: z.array(z.string()),
      toolName: z.string(),
      description: z.string(),
      reasoning: z.string(),
      monthlyPrice: z.object({ min: z.number(), max: z.number() }),
      pricingModel: z.string(), // "per seat" | "flat rate" | "usage-based" | ...
      url: httpsUrlSchema, // HTTPS only; non-HTTPS falls back to empty string
      maturity: z.enum(['new', 'established']),
      alternatives: z.array(
        z.object({
          name: z.string(),
          url: httpsUrlSchema,
          monthlyPrice: z.object({ min: z.number(), max: z.number() }),
          reasoning: z.string(),
        })
      ),
    })
  ),
});

Three properties worth highlighting in that schema. First, the URL field uses HTTPS URL validation: a transform that returns an empty string for non-HTTPS URLs rather than propagating insecure links into the product. Second, the pricing model is a free-text string rather than an enum because pricing models vary in ways an enum cannot capture (per-seat-with-volume-discounts, per-execution-with-monthly-cap, usage-based-with-minimum, etc.). Third, the alternatives array is required and non-empty by convention: the prompt instructs the model to include one to two alternatives with their own price ranges and reasoning, because a recommendation without alternatives is a sales pitch rather than an analysis. The schema does not force non-empty alternatives, but the prompt design ensures they arrive.

Frequently asked questions

How does grounded search handle pricing that changes between the recommendation and implementation?

Pricing drift is real: vendors raise prices, change tiers, move capabilities between tiers. The recommendation captures the pricing at the moment the grounded search was performed, and surfaces it as a range (min/max monthly cost) rather than a point estimate to absorb the normal churn within a tier. If you run the same recommendation call six months later, you will get current pricing; the earlier recommendation is not auto-updated in storage. For programmes that plan more than two or three months out, the practical move is to re-run the recommendation closer to the implementation date to get fresh pricing before commitment.

Can I trust the URLs in recommendations, or should I verify each one?

Verify each one before acting. The HTTPS URL validation guards against non-HTTPS URLs, but it does not guarantee the URL is a valid product page: the model can still return a plausible-looking URL that no longer exists. The grounding metadata in the logs shows which URLs the model actually retrieved during the search, and these are more trustworthy than any URL that does not appear in the grounding sources. For enterprise purchasing decisions, the right verification is to click through each recommended URL and each alternative URL before including the recommendation in a signed-off plan. The grounded-search layer gets you closer to the right answer than an ungrounded LLM, but it is not a substitute for the last-mile check.

What happens if Gemini grounded search surfaces a tool that does not actually exist?

Rare, because the search is real-time and the grounding chunks show what was actually retrieved, but still possible when the model over-generalises from a partial match. The Zod schema validation catches shape errors but not factual errors. The protection is the grounding-metadata log: if a reviewer finds a recommendation with no supporting grounding chunk, that is a red flag. A second protection is the alternatives array: a fabricated recommendation typically comes with fabricated alternatives that are easier to detect than the fabricated primary. In production, this failure mode has been rare enough that the main quality lever is prompt tuning rather than architectural change.

Why not use ChatGPT or Claude with browsing for the same purpose?

Either could work in principle: the approach is model-agnostic, not Gemini-specific. The LucidFlow stack chose Gemini 3.1 Pro because its grounded-search API is the most mature integration of real-time search into the generation loop as of 2026, with transparent grounding metadata that the logs can capture. The fallback to Grok is in place partly to remain resilient to any single-provider outage. Adding OpenAI or Anthropic grounding as a third provider is on the roadmap but not required for the current quality bar.

Do the recommendations ever include free or open-source tools?

Yes, when they are genuinely state-of-the-art for the task. The prompt does not filter by licence model; it filters by being 'commercially available, stable enough for enterprise use, and state-of-the-art'. An open-source project with enterprise support (a managed hosted version, a services layer) can and does get recommended. A purely open-source project with no enterprise deployment story tends not to, because the reasoning field has to justify production readiness and the absence of enterprise support makes that harder to write honestly. The resulting mix tends to be commercial SaaS with occasional open-source-with-managed-hosting entries.

How is this different from just asking a model with web browsing enabled?

Two structural differences. First, the prompt is specifically engineered for enterprise tool selection, it enforces a schema, requires alternatives, insists on HTTPS URLs, demands pricing-page verification, and biases toward consolidated platforms over point tools. A generic 'browse and recommend' prompt does none of this. Second, the fallback chain and the schema validation are production plumbing that turns a best-effort generation into a reliable service. An ad-hoc browsing LLM is useful for research; a production recommendation layer has to handle timeouts, schema failures, and provider outages without silently returning low-quality answers. The engineering around the model is at least as much of the value as the model itself.

Related articles

What Is BPMN? The Complete 2026 Guide to Business Process Model and NotationAI Process Transformation: From Manual Workflows to Autonomous Agents, Without the Gap Year in BetweenFive Business Processes Every SMB Should Automate with AI First, and Why in That Order

Ready to Build Your AI Transformation Plan?

Upload any process document and co-build an AI transformation plan with real tool recommendations and ROI projections — in minutes, not weeks.

Try LucidFlow Free