Unlocking AI Value: The Role of Metadata

You can’t get value out of your data from AI tools if AI can’t understand your data!

Companies are racing to deploy AI agents and large language models, yet many overlook a critical factor: AI is only as reliable as the metadata it receives. Without clear, maintained metadata, even advanced tools produce incorrect joins, misinterpret fields, and deliver flawed insights.

The Model Context Protocol (MCP)—the open standard launched by Anthropic in November 2024—changes the game. MCP servers provide secure, standardized connections between AI clients (like Claude) and enterprise systems, including databases and data catalogs. This lets AI query data in real time, much like a human analyst. However, MCP servers deliver full value only when backed by high-quality metadata.

Companies must now treat metadata maintenance as core infrastructure. Fully populating and updating potentially three layers (Depending on your database and platforms) —INFORMATION_SCHEMA, ISO/IEC 11179, and enterprise data catalogs—has become non-negotiable for maximizing AI and MCP effectiveness.

Layer 1: INFORMATION_SCHEMA – The Structural Foundation

The ANSI/ISO SQL standard INFORMATION_SCHEMA offers machine-readable details on tables, columns, data types, and—crucially—foreign-key constraints. These constraints define precise join paths, such as orders.customer_id = customers.id.

Many teams ignore this built-in resource, leaving foreign keys undocumented and comments empty. As a result, AI agents generate faulty SQL with wrong joins.

Solution: Treat schema hygiene as routine. Enforce foreign keys, add clear column comments, and securely expose INFORMATION_SCHEMA through your MCP server. This low-effort step dramatically improves query accuracy with almost no cost.

Layer 2: ISO/IEC 11179 – The Semantic Foundation

Structure shows how tables connect. Semantics explain why a field exists and how to use it correctly.

ISO/IEC 11179, the international standard for metadata registries, defines data elements with business definitions, value domains, stewardship, and usage rules. A vague status_code becomes clearly documented as “order lifecycle stage: 01=Pending, 02=Shipped, 03=Cancelled—do not use for payment status.”

Without this layer, AI misinterprets fields and applies incorrect logic. When MCP servers pull ISO 11179 definitions alongside schema data, agents gain true business context, not just raw structure.

Maintaining a semantic registry (often inside your catalog) is now essential. It equips AI to ask and receive authoritative answers about field meaning and rules.

Layer 3: The Enterprise Data Catalog – The Unified AI Brain

Modern data catalogs (such as OpenMetadata, DataHub, Collibra, or Alation) combine INFORMATION_SCHEMA structure, ISO 11179 semantics, lineage, usage stats, ownership, quality scores, and sensitivity tags.

Leading catalogs now integrate natively with MCP. AI agents can discover popular joins, check data quality, respect PII policies, and access the business glossary—all in one place.

A stale or incomplete catalog turns MCP into a fast pipeline for poor context. Rich, living metadata transforms it into a powerful enabler of trustworthy AI.

The Payoff: Higher ROI from AI

Organizations investing in metadata should see clear gains:

  • AI-generated SQL succeeds far more often on the first attempt.
  • Built-in governance reduces compliance risks.
  • Onboarding for new agents and analysts accelerates dramatically.
  • Safe cross-silo analysis becomes possible because context is reliable.

Neglecting metadata wastes AI investments and slows production deployments. MCP amplifies both the benefits of good metadata and the costs of poor metadata.

90-Day Action Plan

  1. Weeks 1–2: Audit and strengthen INFORMATION_SCHEMA. Add comments, enforce keys, and expose it via MCP.
  2. Weeks 3–6: Populate ISO 11179 definitions for priority domains (customer, finance, operations). Assign stewards.
  3. Weeks 7–10: Upgrade to an MCP-compatible catalog. Import schema and semantic data; activate usage analytics.
  4. Weeks 11–12: Test with live AI agents. Compare query success rates with and without enriched metadata.
  5. Ongoing: Automate updates through CI/CD, dbt, and quality gates. Include metadata tasks in team goals.

Most required capabilities already exist in your environment. The work is mainly about consistent, disciplined use.

Final Word

MCP and AI agents are powerful multipliers—but they multiply the quality of your metadata. To unlock accurate, auditable, high-impact AI that truly drives business value, stop treating metadata as optional documentation.

Fully maintain your INFORMATION_SCHEMA, ISO/IEC 11179 registry, and data catalog (As applicable of course) . Expose them through MCP servers. Do it now, to allow your investments in AI to pay off.

Your AI systems—and your company—will benefit immediately.

(This Blog post was formatted and developed with the assistance of Grok)

Vulnerability Scanning in the Cloud – Part 1

This is the first in potentially a series of posts regarding vulnerability scanning in the cloud and some of the related challenges and helpful tips.

I started looking around for any good sources or posts on the topic of vulnerability scanning in the cloud.  In this case, an infrastructure as as service (iaas) scenario for private or public cloud.  I didn’t find anything.

How it starts..

When you get the email or call that goes something like.. “Hey, what are we scanning in our <vendor name here> cloud”??? .. In a perfect world you just say, “When we worked with everybody to setup and plan our cloud usage we had vulnerability scanning designs built in from the beginning.. We are good.”

Or, you start sweating and realize nobody ever brought you in to any planning or discussions and there are already large cloud deployments that you aren’t scanning.

Or maybe you are a consultant or service provider going in to an environment and setting up vulnerability scanning in a customer cloud. These posts should be helpful for people that are in the planning stages or trying to “catch up” with their cloud groups or customers.

Dynamic nature of the cloud

Most clouds give you the ability to dynamically provision systems and services, and as a result, you dynamically provision IP addresses. Sometimes these IP addresses are from a certain range, and often, especially for Internet facing systems, these IP addresses are from a large pool of addresses shared with other customers.

In these large dynamic ranges, it is common for the IP address you used today, to be used by another customer tomorrow. 

This dynamic nature is great for operations, but can cause some challenges on tracking assets. 

Asset management is different

Traditional vulnerability management has been very tied to IP addresses and/or DNS names. In cloud scenarios, assets are often temporary, or may not have DNS names. Sometimes your dns names for PaaS type services are provisioned by the cloud provider, with little or no control from your IT group.

Most cloud providers have their own type of unique identifiers for assets. These unique identifiers are what need to be used for asset tracking.. IP addresses, and sometimes DNS names are just stateful metadata for your asset. 

Also, cloud has different types of “objects” that can be given IP addresses beyond traditional compute system interfaces. Certain services can be provisioned in cloud from a PaaS solution that are dedicated to your tenancy/account, and they get their own IP address. Are these your asset? Many times you may have some control over the content and data on these services even though you don’t manage most of the underlying solution. 

In general, the whole approach for asset management in cloud is that your assets are tracked by the cloud provider, and you use their API’s to query and gather information on your assets.

Your vulnerability analysis and asset analysis needs to become dynamic and based on the data returned from your asset queries. This is definitely not a bad thing. Most big companies struggle with solid asset management because there are always ways to circumvent traditional asset management. (This is why network traffic based asset management is becoming so popular) 

Now, with cloud, as long as you are using the API, and know what tenancies you have, you can get a good list of assets… However, this list is short lived… You need to consistently query the API’s to get a good list. Some cloud providers are able to provide a “push” notification or provide “diffs” of what has come online or gone away in X amount of time. I think that is the future best practice of cloud asset management. Real time visibility into what is coming and going. 

 

Capacity is costly..

One major concept and value of cloud is only using and paying for capacity you need.

When it comes to information technology, this “costly capacity” in IaaS essentially comes down to

  1. Network usage (sending data over the network)
  2. Storage usage,(disk space, object space, etc)
  3. Compute usage (CPU)..

Classic Vulnerability scanning can typically be performed 2 different ways,

  1. Either scanning over the network from a scanning system, or
  2. By installing a local agent/daemon/service on the host that reports up the vulnerability data.

Both of these approaches use all 3 types of capacity mentioned above in your cloud, but mostly network and CPU usage.

Scanning over the network — Network Usage

Your cloud vendor’s software defined networking can have huge capacity, or it could remind you of early 90’s era home networking.

One of the major considerations for network based scanning is determining where your bottlenecks are going to be.

  • Do you have virtual gateways or bandwidth caps?
  • Do you have packet rate caps?
  • Are you trying to scan across regions or networks that may be geographically disperse with high latency and/or low bandwidth?

Cloud networking doesn’t just “work”… in many cases it is far more sensitive than physical networks. You need to carefully look at the network topology for your cloud implementations and base scanner placement based on your topology and bottleneck locations. Depending on your network security stack, you may even need or want to avoid scanning across those stacks.

Agents

Agent based scanning is starting to be one of the preferred options in some cloud iaas implementations, because you can just hope that every host reports up it’s vulnerability data when it comes online. This is a nice approach if you have good cooperation from your infrastructure groups to allow your agent to be deployed to all systems.

However, agents likely will not be able to go on every type of resource or service with an IP, such as 3rd party virtual appliances.  You will still need network scanning to be able to inspect some virtual systems or resource types such as PaaS deployed services.

– Most agents typically lack the ability to see services from the perspective of the “network”, which is often where the most risk resides. For example, they can’t talk to all the services and see the ciphers or configurations being exposed to network clients.
 
So, regardless of what you may have been told, there is no cloud or vendor provided vulnerability scan agent that will give you full visibility to your cloud resources. You still need network scans.
 

Even though agents won’t solve all your problems,  you probably won’t be hitting packet rate caps or throughput issues, since they mostly just push up their data in one stream on a regular schedule. So agents can allow you to avoid some of the network issues you might hit otherwise.

 
Here are some questions you need to consider for vulnerability scanning in the cloud…
 
  • How much cpu impact will there be from network scanning or agent scanning? The act of scanning will use some capacity.
 
  • Should you size your cloud capacity to allow for vulnerability management? (yes)
 
In summary, vulnerability management in the cloud is different.
 
Why?
 
  • Dynamic assets.
  • API driven asset management
  • Cloud has more “things” as a service than what one solution can handle.
  • Container Services
  • PaaS
  • Functions/Serverless
  • SaaS/Services
     

How to handle vulnerability management in the cloud?

  • Take a look at all the services your cloud provider offers that you are planning to use.
  • Create an approach for each type of scenario and thing that will be used.
  • Some cloud providers are starting to build in some amount of vulnerability management natively into their platforms. Leverage these native integrations as much as possible.

Scanning Large Container Registries

As container technology adoption grows, the need to provide governance and inspection of these containers and platforms also grows.

One of the nice things about container images is that they are easier to analyze than a traditional application (which may be spread across many directories and files) since everything you need to analyze exists in that container image somewhere.

Container vulnerabilities bring a converged vulnerability footprint of both application and operating system package vulnerabilities. This means your container needs to be treated like an application in some respects, but you also need to analyze the dependencies that are along side the application inside the container, which are often linux packages in the case of linux based containers.

Most of the container scanning solutions out there are fairly immature in that they still mostly treat containers like a virtual machine. They ask the container to dump out its package list (dependencies) and create a finding if they are not at the latest version. Unfortunately, this approach completely ignores the application and/or application runtime itself in many cases. As container scanning solutions mature, they are going to need to differentiate themselves by how well they can analyze the application and application runtimes that exist in containers.

One good solution due to this lack of toolset convergence is to

  • Scan & analyze application artifacts before they are allowed to be layered onto a container, then
  • Scan the container itself after it is built.

This way you are covering both the application and its dependencies.

Some challenges with scanning container repositories and registries.

  • Huge registries and/or repositories of container images.

Some large registries may have hundreds or thousands of different repositories. Each repository could have hundreds of container images. This can easily lead to registries that have tens or hundreds of thousands of container images. I imagine we will soon see registries with millions of container images if they don’t already exist.

Most container scanners know not to rescan things they have already seen, but the first scan on large registries can take a very long time in many cases.

This huge volume of containers can cause a few challenges, and here are some ideas on how to overcome those challenges.

  • Your repo/registry scanner must be designed to scale out or up to handle 10’s of thousands of containers. This usually means…..
  • The container scanner backend must track track the container layer hashes and container hashes to know what it has not already scanned. It obviously shouldn’t scan layers or images it has already scanned.
  • The container scanner backend must be able handle multiple concurrent scans against multiple images or repositories. It should be able to scale up if needed. This means your scanner backend design has to be able to handle multiple concurrent scanners and be able to distribute work between them properly.
  • The container scanner should implement shortcuts to know if it has already scanned images from a registry without necessarily checking every layer and image hash. If you pull down a registry manifest with 10,000 images, the next time you pull the manifest, you should try to diff the manifests to determine what are “new” images and scan those first.
  • A good approach is for container scanner companies to “pre-load” containers and container layers from public registries. This way you may be able to avoid even have to scan many of the layers of the containers.
  • Container scanners should natively support the main container registries in cloud providers like Azure, Google, etc.. by knowing how to use their API’s enough to access the container registries and repositories they provide.
  • A container scanner should usually try to scan in a LIFO approach by scanning newer images first. This can be difficult because container tags and version tags are not very structured. You can try to scan all “latest” tags first. One field I think could be valuable to be added to the docker registry manifest is the timestamp of the image. Since tags are not structured enough to be reliable, you could use the timestamp or epoch to at least know when the container was last modified or placed in a repo.
  • You want to use the LIFO approach because newer containers are the ones most likely to be used, and the ones that need to be analyzed as part of CI/CD integrations

Those are my thoughts on scanning large container registries and repositories. Do you have any thoughts on optimizing container scanning for large registries? I imagine similar work has been done on different types of file or artifact scanning in the past. It seems like we always try to “reinvent the wheel” in security products for some reason.

3 Types of Vulnerability Scanning – Pros and Cons

The 3 Main Types of Vulnerability Scanning Approaches

 

There are 3 major types of vulnerability scanning you can use on your networks. Most large organizations will have to use all 3 (or at least a couple) methods.

  • Unauthenticated Network Based Scanning

  • Authenticated Network Based Scanning

  • Agent Based Scanning

This post will go over the differences of these methods and explain why a combination of methods is typically needed. (This is standard network and host scanning. Containers will be covered in a different post) Yes, passive network scanning exists too. I don’t feel knowledgeable enough on that yet to speak to it.

Back in 2011 I posted a quick explanation of some of the differences between authenticated and unauthenticated scans. Not much (if anything) has changed since then in regards to the differences between those 2 types of scans. However, I will add some more details on the differences in this post.

Unauthenticated Network Based Scanning

These are scans that you run from a system with “scan engine” software or a an appliance type of system. These scans run across a network, targeted at other systems without knowing anything about the targeted systems other than the IP address or DNS name.

No credentials are provided in these types of scans.

The unauthenticated scan has to mostly guess at everything it tells you about the target system because all it can do is probe the ports and services you have open on your system and try to get it to give up information.

  • Cons – More false positives. (it is guessing)
  • Cons – Less detailed information. (it is still guessing)
  • Cons – May require more network connections than authenticated scans.
  • Cons – You are more likely to impact legacy services or applications that do not have authentication or input sanitation.
  • Cons – You have to maintain access to your targets through firewalls, IDS, IPS, etc.
  • Cons – You have to manage a scanner system(s)
  • Pros – Only shows the highest risk issues
  • Pros – Gives you a good view of the least capability an attacker on your network will have. Any script kiddie will be able to see anything an unauthenticated scan shows you.
  • Pros – Is usually faster than an authenticated scan in many cases.

Authenticated Network Based Scanning

These are scans that you run from a system with “scan engine” software or a an appliance type of system. These scans run across a network, targeted at other systems but provide login credentials to the targeted system that allow the network scanner to get a command shell (or similar access) so it can simply run commands and check settings on the targeted system. This allows for much more accurate and detailed information to be returned.

You will never get 100% authenticated scanning success on large networks because of the variety of system types and authentication methods required. You will probably not be able to get into every appliance, printer, iot device, etc.. So 100% is not typically a realistic goal for diverse environments.

  • Pros- Less false positives. (Much less guessing)
  • Pros- More detailed information. (again, doesn’t have to guess anymore)
    • You can now see things like missing patches, specific os versions, locally installed 3rd client software versions.
  • Pros- May require less network connections than authenticated scans.
  • Pros- You are less likely to impact 3rd party legacy services or applications that do not have authentication or input sanitation, because the scanner doesn’t have to guess about the service.
  • Pros – You can now gather configuration information off the system to help feed a CMDB or perform configuration baseline checks. You are now a configuration checking tool and not just a vulnerability checking tool..
  • Cons – Still has most of the type of impacts on custom written socket servers/services.
  • Cons – You are now awash in a sea of data (vulnerability data) about that system.
  • Cons- Risk assessment requires more analysis because instead of a handful of findings from an unauthenticated vulnerability scan, you may now have 30-40 findings.
  • Cons – Is often slower than an un-authenticated scan in many cases, because it is running specific commands from a shell on the system and waiting for the returns etc.. This is not always the case, and it some cases authentication may speed up scans.
  • Cons – You have to maintain access to your targets through firewalls, IDS, IPS, etc.
  • Cons – You have to manage a scanner system(s)

Agent Based Scanning

Agent based scanning requires the installation of a daemon/agent on Linux and Unix systems, or a “Service” on Windows systems. I will refer to this an a “agent” from now on.

The agent is installed locally on the targeted systems, runs on a schedule, and reports the data up to a centralized system or SaaS service. Vulnerability scan agents are usually fairly light weight, but the different variations and vendors all have their own quirks. I highly recommend you perform testing on a variety of systems and talk to existing similar clients using the vendor’s agents before going with this approach.

One of the big pitfalls with an agent is that it cannot fully talk to the target system’s network stack like a network based scanner.. So if you have an nginx service that is misconfigured, it likely won’t report that as an issue, while a network based vulnerability scan would.

This lack of capability to simulate a network client is the big gap in agent functionality. As a result, you cannot truly get a “full” vulnerability picture without running at least an additional network based scan. In some cases, the agent data may be good enough, but that is a decision up to each organization.

Agents are good solutions for systems like mobile laptops that may rarely be on the corporate network, or for systems like some public cloud scenarios, where you can’t maintain full network scanner access across a network to the target host.

  • Pros- Less false positives. (Much less guessing. The agent is installed on the system and just asks for the information. )
  • Pros- More detailed information. (again, doesn’t have to guess anymore)
    • You can now see things like missing patches, specific os versions, locally installed 3rd client software versions.
  • Pros- Requires Far less network connections. Usually just an outbound push of data.
  • Pros – The system with the agent can report up its data from anywhere to your Saas backend or potentially into an internet connected backend if that is your design scenario. So the scanner just resides with each host.
  • Pros- You are less likely to impact 3rd party legacy services or applications that do not have authentication or input sanitation, because the agent doesn’t talk to the network stack and services like a network client.
  • Pros – You can now gather configuration information off the system to help feed a CMDB or perform configuration baseline checks. You are now a configuration checking tool and not just a vulnerability checking tool..
  • Cons – You are now awash in a sea of data (vulnerability data) about that system.
  • Cons- Risk assessment requires more analysis because instead of a handful of findings from an unauthenticated vulnerability scan, you may now have 30-40 findings.
  • Cons – You now have an agent and piece of software on every target system that you (or some team) has to own and somewhat manage. Since every company has slight different ways this is done, it adds a layer of complexity and overhead compared to running a scan across the network.
  • Pros- You have to maintain far less network access (usually just an outbound connection) IDS, IPS, WAF’s etc don’t matter anymore.
  • Cons – You now have to manage an agent, and are now a customer and user of every target system
  • Cons – Your agent may (will) get blamed, (and sometimes rightly so) for impacting performance on a system.

So what is the best solution?

Like almost everything in IT and IT Security, the best solution depends on your requirements. Most larger organizations want the verbose data that an authenticated scan or agents provide.

With most people using laptops these days, classic network based vulnerability scanning is going to miss a lot of assets that an agent will be able to cover.

Datacenter implementations may be covered fine with authenticated scanning, and not having to manage an agent or be called in to every performance issue (because you have something running on the system) in that scenario may reduce headache.

Public iaas hosts may require unauthenticated scanning from an Internet based scanner, and an agent on the host to get the full picture of data..

Ultimately, the right approach is the one that meets your requirements and fits within your funding and capabilities.

Payment Card Security In The News

On Feb 4th, 2014, I gave a high level presentation to our Northwest Arkansas ISSA chapter regarding Payment Card Security. Unfortunately, the roads were icy that day, so there were only a few of us in attendance.

I felt like this was a presentation that both technical and non-technical attendees would find interesting due to all of the credit card security topics that had been in the news over the holidays.

Below is a LibreOffice Impress document with the contents of the presentation.

Payment_Card_Security_Feb_2014

When Is the Best Time To Run Vulnerability Scans?

It Depends…

There are several factors to consider when determining the times to run vulnerability scans.

Is this the first time you have run this scan?

Is the scan going to run against an ecommerce site?

Do you have standing approval from your operational areas to run a scan?

Do you have security monitoring and logging systems that will alert on the scanning?

Contact the administrators of your websites to determine the best times to run a vulnerability scan.

Most site admins will know their peak periods of website activity, it is best to avoid those periods for routine scanning simply due to the scans increasing the load on the site.

Scans can often cause increased error logging and alerting. So you need to be extra diligent and careful the first time you run scans. Assume that you may break things the first time.

  • Talk to the stakeholders for the systems you are scanning to determine the best time to scan.
  • Notify the stakeholders and any support areas that may be involved if there are issues or alerts generated by the scan.
  • Follow your normal change control management procedures and treat initial scans like a system change.

One piece of information that your stakeholders will need to know is the source where your scans will originate. They may want to whitelist or ignore those ip addresses in their monitoring.

If you are able to perform vulnerability scanning on your network and e-commerce sites without anybody noticing, then you likely have a gap in your ability to detect malicious scanning also. 🙂