Data, data everywhere AI finds what you seek Data, data everywhere The AI has sprung a leak
with apologies to Samuel Taylor Coleridge
The best thing about a Generative Ai is that it will answer your questions. The worst thing about Generative AI is that it will answer your questions. It’s not just that they might make mistakes; there are ways of dealing with that. The real problem is that an AI only looks like it knows what it’s talking about. Ai models create an illusion of context, but they don’t really understand context. Because of that lack of understanding, an AI will leak data in response to a carefully formulated query.
Putting Context in Context
Humans consider information contextually at multiple levels.
One level of context is the immediate: the conversation we are having, the article we are reading, the news broadcast we are listening to. An AI excels at simulating this context. It “remembers” the question you just asked and assumes the next thing you say refers back to that earlier question. You can bring up something you or the AI said earlier, and it generally figures out what you mean. Communicating with an AI feels like a conversation.
But humans also maintain a deeper context that includes experiences, ethics, moral codes, legality, and more. All the knowledge we accumulate throughout our lives helps us understand meaning at multiple levels (which is part of why puns work). An AI lacks this deeper understanding—it doesn’t know what anything means, and it has no sense of larger context. An AI has no judgement. Like a very young, very precocious child, the AI will blurt out anything if asked nicely.
More Data, More Problems
Data is frequently copied into a centralized vector database in order to better feed that data to an AI. This approach is problematic for many reasons, not just because Ais can be convinced to blab everything they know.
Vector databases are a relatively immature technology and can be very complex. They don’t necessarily have the level of transaction support, data management capabilities, efficiency, and flexibility of more mature database technologies. They also require a specialized skillset that is not yet all that common. Even with the right people to manage it, vector databases are computationally expensive and have limited usefulness outside of AI. That means that data—including SQL and JSON data—will still need to be managed elsewhere when being used for purposes other than feeding an AI.
And that leads to a bigger problem: data duplication. Tracking all the copies becomes increasingly difficult as data assets increase—and in today’s world, data assets always increase. The more copies there are of the data the harder it is to fix errors and the more everyone struggles to base decisions on the most recent data. Moreover, as data is copied into multiple different repositories security becomes increasingly difficult and preventing an AI from leaking data grows ever more impossible.
See Something, Leak Something
Consider the following scenarios. The names are being withheld to protect those involved.
An employee hears rumors that the company is going to start layoffs. Since the company uses an AI to manage internal information—and to use that AI had consolidated their data into one large data store—he queries the AI about the upcoming layoffs. After the AI refuses to answer, he starts playing around with the prompt until he finds one that gets around the safeguards. Before long, he has all the information about the upcoming layoffs including names, dates, and severance packages.
At a different company, again the information had been consolidated for ease of AI access. An employee wanted to get upcoming product information that he wasn’t authorized to view. The AI correctly recognized that he shouldn’t have the data and refused to answer the query. The employee, being industrious and creative, used another AI to help him design a prompt that would get the first AI to disgorge the information. The employee was successful.
Far from being secure, the AI leaked like a double-agent in a bad spy movie. Trying to lock down the data within a single data store is an exercise in frustration. Because AI lacks that deeper layer of context, it has no capacity to understand that some information is off-limits. Instead, the AI has to be programmed to not answer certain questions, but that is only as effective as the imaginations of the people coming up with the questions to block. Meanwhile, the people who want to get the information are equipped with Ais to help them devise new and interesting ways of asking questions (indeed, some professors teach their students how to use one AI to hack another AI as a way of building better prompts. It is a small step from there to using one AI to design prompts that will get another AI to reveal what it shouldn’t).
Ultimately, what an AI can access an AI will leak. It doesn’t know any better and has no capacity to understand.
AI Infrastructure Prevents Data Leaks
The standard approach of bringing the data to the AI may provide short-term benefits, but quickly leads to serious problems. There is a better way: bringing the AI to the data. This is the approach taken by SWIRL with its innovative AI infrastructure software.
SWIRL AI Connect provides a unique combination of AI, metasearch, and vectorless RAG capabilities that dramatically shortens the amount of time it takes to find relevant data. And SWIRL does it all without compromising data security.
SWIRL AI Connect lets you leave your data where it is and brings the AI to the data. You do not need to move data into a vector database or outside your firewall into a vendor’s cloud. SWIRL stays inside your firewall and integrates with your existing security protocols. You no longer need expensive ETL processes to move data around before it can be used, nor do you need to maintain a separate copy of the data for the AI.
You can ask the AI anything you want. SWIRL will automatically search all data stores that you have permission to access, and only the data stores that you have permission to access. No matter how creative the prompt is, what the AI can’t see, the AI can’t leak.
SWIRL AI Connect supports multiple LLMs and AI systems, allowing you to use LLMs located inside the corporate firewall. You have total control over which AI you use.
Because SWIRL does not need data to be uploaded to an external server, you can run SWIRL on an air-gapped system. SWIRL will interact only with the available data.
SWIRL is format agnostic and is capable of reading both structured and unstructured data—including useful data in email, team collaboration apps, office documents, and more. SWIRL doesn’t need the data to be a vector database to use RAG, thereby accelerating data access, increasing the amount of data SWIRL can search, and significantly improving the relevancy of the results compared to other systems.
SWIRL connects to well over 100 apps and databases and that number is steadily increasing.