Data Sovereignty and AI Process Tools: Fourteen Questions to Ask Any Vendor Before You Upload a Single Document
Most AI vendor evaluations are a procurement theater of SOC 2 logos and generic privacy promises. These fourteen questions cut through it. Use them before you upload one document, not after.
The four question categories that matter and the rest that does not
AI vendor evaluations tend to drown in generic security checklists that were written for traditional SaaS tools in 2015. Most of that checklist is either obsolete or already covered by SOC 2. The questions that actually discriminate between serious AI vendors and everyone else fall into four specific categories, and they are the questions the weak vendors struggle to answer clearly.
The categories: data residency (where your data physically lives), training (whether your data is used to improve the vendor's models), retention and deletion (how long data stays and how you get it out), access and breach (who can see it inside the vendor and what happens if that goes wrong). Every other question about the vendor is either downstream of these four or is a distraction.
- Category 1: data residency (questions 1 to 4). Where, under whose jurisdiction, with what sub-processors.
- Category 2: training and model handling (questions 5 to 8). Is your data used to train, how is context isolated, what about the vector store.
- Category 3: retention and deletion (questions 9 to 11). How long, how fast can you get it out, what proof of deletion.
- Category 4: access, audit, breach (questions 12 to 14). Who sees it internally, what logs exist, what happens when something goes wrong.
Questions 1 to 4 on data residency
Data residency is the first wall of the evaluation because it is the easiest one for vendors to answer clearly, and the clarity of the answer is a signal. Vendors who waffle here are telling you they have not thought about it, or they have thought about it and do not like the answer they have to give.
Question 1: In which countries is our data physically stored at rest?
The good answer: specific regions, specific cloud providers, specific data centers if relevant. For an EU customer: "Your data is stored in the Frankfurt and Paris regions of AWS, with backups in the Dublin region". The vague answer: "We use enterprise-grade cloud infrastructure". The second answer is a fail.
Question 2: In which countries is our data processed, as distinct from stored?
Processing includes the inference calls to the AI model, which often route through different regions than storage. A vendor can store in Frankfurt and route inference through a US region, which creates a cross-border data transfer every time the AI is used. The good answer names both locations separately and explains the legal basis for any transfer.
Question 3: Do you offer EU-only or customer-chosen regional deployment?
For regulated industries (healthcare, financial services, legal) or EU public-sector customers, regional pinning is often a hard requirement. The good answer: "Yes, on our Pro tier and above, all storage and processing is in the EU region of your choice". The marginal answer: "We can discuss custom arrangements on Enterprise". The bad answer: "All our infrastructure is US-based". The last one may still be acceptable for internal use cases with no PII, but you should know.
Question 4: Who are your sub-processors, and under what jurisdiction?
Every AI vendor has sub-processors: the underlying LLM provider (OpenAI, Anthropic, Google), the hosting provider, any specialized services. The vendor must maintain a public or on-request list. For EU customers, each sub-processor's jurisdiction determines whether standard contractual clauses or adequacy decisions apply. A good answer includes the full list; a bad answer hedges around "trusted partners".
Questions 5 to 8 on training and model handling
Training questions are where vendor answers most often collapse, because the honest answer is often that the underlying LLM provider's policy governs, not the vendor's own. The vendor is not always lying when they say "we do not train on your data", but the full picture usually requires two levels of answer: what the vendor does, and what the underlying model provider does.
Question 5: Is our data used to train, fine-tune, or improve your models?
The only acceptable answer for business use cases with any sensitive data is a clear "no, by default, for all customers". Not "no if you opt out" (which is a 2022-era answer). Not "no unless anonymized" (which is usually impossible to verify). Not "only with your consent" (which is a loophole). A clear default "no" with the commitment in the DPA is the bar.
Question 6: Does your upstream LLM provider train on our data?
The vendor must have a contract with OpenAI, Anthropic, Google, or whoever they use, and that contract must prohibit training on passed-through customer data. The good answer names the provider, references the zero-training API tier, and provides the line in the underlying ToS. The bad answer is "we trust our provider", which tells you nothing.
Question 7: How is context isolated between customers at inference time?
Every AI tool has some form of context or memory. RAG pipelines have vector stores. Chat tools have conversation history. Agent tools have tool-call logs. The question is whether these are strictly scoped per customer (or per workspace, or per user) at every layer, including caches and indexes. The good answer is specific: "Each workspace has its own vector namespace, keyed on workspace ID, with no cross-workspace retrieval possible". The bad answer is "our architecture ensures isolation".
Question 8: If we fine-tune a model on our data, what happens to that fine-tuned model?
If the vendor offers fine-tuning, the fine-tuned weights are your data. They must be scoped to your account, usable only by your account, and deletable on request. A particularly bad answer here: "Fine-tuned models benefit all customers through improved base models". That is the vendor laundering your data into their competitive moat.
Questions 9 to 11 on retention and deletion
Retention questions are the ones your compliance team needs for the DPA, and the ones your ops team needs for the day you decide to switch vendors. Both audiences need specific numbers and specific mechanisms, not good intentions.
Question 9: What is the default retention period for customer data, and is it configurable?
The good answer is a specific duration (e.g., "30 days for logs, indefinite for customer-created artifacts until deletion requested") with configurability on Pro and Enterprise tiers. Retention should be minimized for logs, especially prompt and completion logs that contain customer content. Long retention of prompt logs is a common source of shadow data accumulation.
Question 10: How do we export all our data if we decide to leave?
The good answer: documented self-service export in machine-readable formats (JSON, CSV, standard document formats), delivered within 48 hours maximum, covering all artifacts the customer created. The bad answer: "Contact support, we can discuss export options". Lock-in by obfuscation is a red flag you must surface before signing, not after.
Question 11: When we request deletion, what exactly gets deleted, and on what timeline?
"Deletion" has variable meanings in cloud systems. Ask specifically: are backups purged within the retention window, or does the 30-day standard backup cycle mean residual copies persist for up to 30 days? Are vector embeddings derived from your data deleted alongside raw data? Are logs in observability systems (Datadog, Splunk) scrubbed? The good vendor has a written data deletion procedure and can quote the end-to-end timeline. The vague vendor says "within a reasonable timeframe".
Questions 12 to 14 on access, audit, and breach
The last three questions are about what happens when humans get involved: vendor employees accessing customer data, auditors verifying the controls, and the inevitable incident response. These are the questions most likely to surface a gap between the vendor's marketing and their actual operation.
Question 12: Who inside your company can access our data, and under what circumstances?
The good answer: a minimal named role (e.g., "on-call SRE") with access gated by customer-initiated support tickets, logged in real time, and reviewed quarterly. Engineers do not have standing access to customer data. Support staff can see metadata, not content, unless the customer explicitly authorizes it for a specific ticket. The bad answer: "Our employees are bound by confidentiality obligations". That is the floor, not a policy.
Question 13: What audit logs exist, can we access them, and for how long?
Audit logs should capture every access to customer data, every administrative action, every API call with customer scope. Customers on Enterprise tier should be able to pull their own logs via API or export for their own SIEM. A six-month retention is the floor; twelve is better. "We have comprehensive logging" without a customer-accessible interface is a checkbox answer, not a control.
Question 14: What is your breach notification timeline and process?
For EU customers, GDPR imposes a 72-hour regulator notification. For customers in regulated industries, shorter contractual timelines often apply. The vendor should commit to notifying you within 24 to 72 hours of discovery of a confirmed breach affecting your data, with a documented process for providing the information you need to run your own breach analysis. Longer timelines, or "we will notify if legally required", are inadequate.
The five dealbreakers you should never compromise on
Of the fourteen questions, answers can reasonably vary by vendor and tier. You can accept US-only data residency for an internal-only use case with no regulated data. You can accept 30-day backup residuals as part of standard cloud hygiene. You can accept shared vector indexes if they are strictly namespaced. But a few answers are always dealbreakers, and if any of them show up you walk away regardless of the price.
- The vendor trains on your data by default, and opt-out is either not available or not in the DPA.
- The vendor's underlying LLM provider retains a broader training right than the vendor's ToS admits.
- There is no documented self-service data export, or export is gated behind a negotiation or fee.
- Employee access to customer data is standing rather than just-in-time and logged.
- Breach notification is "as required by law" with no contractual 24 to 72-hour commitment.
Any one of these is sufficient to walk. Two of them and you should be questioning why the vendor is still in market. The price of finding out later, once your data is already in their system, is always higher than the price of a tougher procurement conversation now.
How to read a DPA in ten minutes
A Data Processing Agreement is twenty to sixty pages of lawyer language that looks intimidating and is actually formulaic. If you know what five sections to read and what to look for in each, a competent operator can evaluate any DPA in ten minutes. You do not need a lawyer for the first pass, you need one for the edge cases the first pass identifies.
- Processing purposes: read the list of specific purposes for which the vendor processes your data. "Service improvement" and "internal analytics" are the phrases that can hide training uses.
- Sub-processors: find the sub-processor list (usually an annex). Check each name. Any surprise is a red flag.
- Cross-border transfers: find the transfer mechanism (SCCs, adequacy decision, data subject consent). If you are EU and transfer is to a non-adequate country, SCCs must be explicitly referenced.
- Retention and deletion: find the retention table. Compare with the vendor's marketing claims. Any discrepancy is a red flag.
- Breach notification: find the timeline and the notification method. Anything longer than 72 hours or less specific than "direct email to a customer contact" is inadequate.
If all five sections are clean and specific, the DPA is probably fine and you spend the remaining time on the business terms (pricing, SLA, exit). If any of the five has a vague or off-market clause, that is where the legal review needs to concentrate. Paying a lawyer to review a good DPA end-to-end is usually a waste. Paying a lawyer to negotiate the two sections that actually have issues is money well spent.
Frequently asked questions
What about non-EU vendors claiming GDPR equivalence?
The only valid mechanisms under GDPR are adequacy decisions (a short list of countries), Standard Contractual Clauses, or Binding Corporate Rules. "Equivalence" as a marketing term has no legal meaning. A vendor claiming it is either using shorthand for SCCs (ask them to confirm in writing) or is misrepresenting their legal posture. Require the DPA to specifically name the mechanism being used for any transfer.
Does SOC 2 cover AI-specific risks?
Only partially. SOC 2 confirms that documented security controls are in place and followed. It does not inspect AI-specific risks: model training boundaries, prompt injection protections, output logging policies, or vector store isolation. Treat SOC 2 as a baseline that says the vendor takes security seriously, then ask the AI-specific questions separately. For AI process tools, SOC 2 plus the fourteen questions is the right bar.
What if a vendor says 'we do not train on your data' but also has ambiguous ToS language?
The ToS wins, not the sales deck. If the public marketing says one thing and the actual contract says another, you will not be able to enforce the marketing language in any future dispute. Demand that the no-training commitment is explicit in the DPA or master services agreement. If the vendor refuses, the ambiguity is a deliberate choice and you should treat the training question as unanswered.
Can we trust a vendor that uses a third-party LLM provider we already trust?
The vendor's trustworthiness is not a transitive property. Even with a well-regarded underlying LLM provider, the vendor's application layer can log prompts and completions, store them in unexpected places, and use them for their own purposes. The LLM provider's compliance posture does not automatically extend to everything the vendor's application does on top. Ask the vendor the questions anyway.
Do we need all fourteen answers before signing, or can we start with a pilot?
For a genuine pilot with synthetic data or de-identified test data, you can defer some questions to the commercial agreement stage. For a pilot that touches real customer data or real internal operational data, you need all fourteen answers before anything gets uploaded. The category 1 and 2 questions (residency and training) are the minimum you verify before any upload; the others you can negotiate through DPA drafting.
Related articles
Ready to Build Your AI Transformation Plan?
Upload any process document and co-build an AI transformation plan with real tool recommendations and ROI projections, in minutes, not weeks.
Try LucidFlow Free