MWMS Deep Search Quality And Observability Framework

System: MWMS

Document Type: Framework

Status: Active

Authority: HeadOffice

Applies To: Research Brain, Data Brain, Affiliate Brain, Ads Brain, Content Brain, Experimentation Brain, HeadOffice Intelligence, AI Agent Operations, Future AI Employees

Primary Location: MCR

Future Operational Destination: mwmsbrain.site, mwmsheadofficebrain.site, Future AI Employee Dashboards

Parent Page: HeadOffice

Source Of Truth: MCR

Course Source: Matt Pocock AIhero Build DeepSearch In TypeScript and AI Automations by Jack Fact Checking, Browser Copilot, Multi-Model Research, Client Intelligence Research, and Evidence Synthesis Block

Absorption Status: Approved For Integration

Version: v1.1

Last Reviewed: 2026-06-21

Purpose

The purpose of this framework is to define how MWMS evaluates, monitors, improves, and governs Deep Search style AI Employees.

A Deep Search AI Employee is any AI workflow that:

researches external information

searches the web

inspects sources

crawls pages

evaluates evidence

checks factual claims

compares supporting and opposing evidence

produces recommendations

creates decision-ready intelligence

prepares meeting research

validates business information

supports client intelligence

This framework ensures that Deep Search outputs are not judged by appearance, confidence, source quantity, model agreement, or “it sounds good” alone.

MWMS Deep Search outputs must be judged by measurable quality standards including:

factuality

claim precision

relevance

source quality

source independence

freshness

evidence sufficiency

contradiction handling

confidence calibration

cost control

latency

traceability

business usefulness

This framework also defines how observability, metadata, database activity, tool use, crawler activity, claim evaluation, source relationships, model review, and evaluation results must be captured so HeadOffice can understand:

what happened

why it happened

which evidence was used

whether evidence supported the conclusion

whether sources were genuinely independent

whether the output was useful

whether the AI Employee should be trusted

whether more research is required

whether the workflow should be improved, restricted, or escalated

Scope

This framework applies to any MWMS system, Brain, AI Employee, workflow, dashboard, browser copilot, extension, or automation that performs:

research

retrieval

fact checking

claim validation

evidence extraction

source analysis

external intelligence generation

company research

client research

market research

policy checking

competitive research

multi-model synthesis

decision preparation

This includes:

Research Brain search workflows

Affiliate Brain offer research

Ads Brain market and compliance research

Content Brain topic and SEO research

HeadOffice newsletter intelligence

Data Brain source and signal validation

Experimentation Brain test analysis

AI Employee task execution

client-facing AI research tools

AIBS research and reporting systems

Deep Search agents

fact-checking browser copilots

Chrome extension research tools

pre-call research systems

automated client intelligence reports

multi-model research workflows

any system that uses search, crawling, scraping, source inspection, or external knowledge retrieval

This framework does not define exact implementation code, TypeScript architecture, Langfuse setup, Evalite setup, crawler packages, browser extension code, search provider configuration, or vendor-specific tooling.

Those belong to developer implementation notes.

This framework defines the MWMS operating standard.

Core Principle

A Deep Search AI Employee is only useful if its output is:

factually grounded

claim specific

relevant to the task

based on inspectable sources

supported by sufficiently independent evidence

current enough for the decision being made

clear about disagreement

clear about missing evidence

traceable from request to result

measurable against success criteria

cost controlled

fast enough for the use case

logged for review

improvable through Kaizen

The model response is not the system.

The full system includes:

user request

Brain or Employee assignment

claim or question definition

prompt

model call

search query

tool call

source discovery

source selection

crawler or scraper action

evidence extraction

supporting evidence

challenging evidence

source relationship analysis

database read or write

observability trace

evaluation score

confidence assessment

final output

human review

stored decision record

improvement loop

Definition Of Deep Search In MWMS

Deep Search is the structured process of moving beyond a shallow AI answer by combining:

question clarification

claim decomposition

source inspection

evidence extraction

source comparison

contradiction analysis

synthesis

evaluation

decision-ready output

A proper MWMS Deep Search workflow should include:

understanding the task

identifying whether the task is time sensitive

identifying the exact claims that require verification

creating one or more search paths

retrieving possible sources

selecting sources worth inspecting

opening or crawling selected sources

extracting useful evidence

checking freshness and relevance

identifying source dependence

finding supporting and opposing evidence

distinguishing fact from interpretation

synthesising findings

producing a decision-ready answer

logging the full process

evaluating the output against success criteria

routing improvements through Kaizen

A shallow AI answer is not Deep Search.

A search-result summary is not Deep Search.

A scraped page alone is not Deep Search.

A source list without evidence comparison is not Deep Search.

Multiple models repeating similar conclusions is not independent verification.

Deep Search requires source-backed reasoning and measurable quality control.

Claim Decomposition Requirement

Deep Search should begin by identifying the exact claim or decision question.

Broad requests should be decomposed into testable claims where appropriate.

Examples

Broad request:

“Is this company trustworthy?”

Possible claims:

the company is legally registered

the company has operated for a stated period

the company has credible customer feedback

the company has unresolved regulatory action

the company makes verifiable performance claims

Broad request:

“Does this product work?”

Possible claims:

the product contains the stated ingredients

the stated mechanism is scientifically plausible

clinical evidence exists

the claimed outcome matches the evidence

the product is safe for the intended user

Claim Record Standard

Each material claim should record:

Claim ID:

Original Claim:

Normalised Claim:

Claim Type:

Claim Origin:

Originating URL Or Source:

Claimant:

Date Claimed:

Time Sensitivity:

Required Evidence Standard:

Current Status:

Rule

The system must validate the actual claim being made.

It must not replace a difficult claim with an easier adjacent question.

Claim Types

Possible claim types include:

factual claim

numerical claim

historical claim

current-status claim

causal claim

comparative claim

predictive claim

legal or regulatory claim

medical or safety claim

performance claim

marketing claim

opinion presented as fact

Different claim types require different evidence standards.

Date Awareness Requirement

Any AI Employee working with external information must be date aware.

The AI Employee must understand:

the current date

whether the user request depends on current information

whether the source may be outdated

whether the answer changes over time

whether the source has a publication date, update date, or no visible date

whether recency should affect confidence

whether older sources remain valid or should be treated as historical

whether the event date differs from the publication date

whether later corrections or updates exist

Date awareness is mandatory for:

Affiliate offer research

Google Ads policy research

compliance checks

tool reviews

AI platform updates

pricing checks

newsletter intelligence

market trend research

search engine or platform behaviour

product availability

legal, financial, medical, or policy-related topics

current events

competitor monitoring

business leadership

company status

software capabilities

If freshness matters and the AI Employee cannot confirm current information, the output must state uncertainty and reduce confidence.

Source Freshness Rules

Each source used by a Deep Search workflow should be evaluated for freshness.

Freshness assessment should consider:

publication date

event date

updated date

retrieval date

whether the page appears active

whether the topic is time sensitive

whether newer conflicting information may exist

whether the source is evergreen or unstable

whether the source is official, secondary, outdated, archived, promotional, or user generated

whether the source has been corrected

whether the source is referring to older evidence

Stable Information

Stable information may use older sources if the concept does not change often.

Examples:

general frameworks

historical facts

evergreen business principles

basic technical concepts

Moderately Changing Information

Moderately changing information should prefer newer sources.

Examples:

tool features

platform workflows

pricing pages

SEO practices

marketing channel tactics

company services

Highly Time-Sensitive Information

Highly time-sensitive information must use current sources wherever possible.

Examples:

policy changes

affiliate payouts

product availability

ads platform rules

compliance rules

current events

laws and regulations

software versions

AI model capabilities

market trends

corporate leadership

financial performance

Rule

Freshness must be evaluated against the claim, not merely against the age of the webpage.

Retrieval Quality Rule

Deep Search quality depends on retrieval quality before reasoning quality.

A weak retrieval layer produces weak answers, even if the model sounds confident.

The retrieval layer must be evaluated for:

search query quality

claim coverage

number of search attempts

diversity of sources

source reliability

source relevance

source freshness

source independence

page accessibility

extracted content quality

failure handling

duplicate source handling

evidence sufficiency

supporting and opposing evidence

A Deep Search AI Employee should not rely only on search snippets when a decision requires deeper evidence.

Where appropriate, the system should inspect selected sources directly.

Search Path Standard

A strong Deep Search workflow may use several search paths.

Possible paths include:

direct claim search

official source search

primary evidence search

independent verification search

contradictory evidence search

recent update search

regulatory search

expert analysis search

user experience search

archived evidence search

Rule

One search query should not be treated as sufficient merely because it produced a plausible answer.

Crawler And Source Inspection Rules

When a crawler, scraper, browser tool, or source inspection tool is used, the AI Employee must treat retrieved page content as evidence, not automatic truth.

Crawler and scraper workflows should record:

source URL

source title

retrieval time

publication date where available

access status

extraction success or failure

extracted content summary

content length or completeness

whether the content appeared usable

whether important content may have been hidden, blocked, or missing

whether the source should be trusted

whether the source was used in the final answer

which claim the source relates to

whether the source supports, challenges, or contextualises the claim

Crawler failures must not be hidden.

If source inspection fails, the AI Employee should:

try another source

use the search result only with reduced confidence

escalate to human review

state that evidence was insufficient

Source Reliability Classification

MWMS Deep Search workflows should classify sources where possible.

Official Source

Examples:

vendor documentation

platform policy pages

government pages

official product pages

company announcements

regulatory databases

Default trust level:

High for what the organisation officially states, but still checked for bias, scope, and recency.

Primary Evidence Source

Examples:

research paper

court record

financial filing

official dataset

direct interview

original report

technical specification

Default trust level:

High where authentic, relevant, and correctly interpreted.

Expert Source

Examples:

industry analysis

recognised experts

specialist publications

technical blogs from credible practitioners

Default trust level:

Medium to high.

Independent News Or Analysis Source

Examples:

reputable journalism

independent investigation

established research organisation

Default trust level:

Medium to high depending on evidence and subject.

Commercial Source

Examples:

affiliate sales pages

product reviews with monetisation

vendor comparison pages

agency landing pages

Default trust level:

Medium to low unless corroborated.

User Generated Source

Examples:

forums

comments

social media posts

Default trust level:

Signal only, not proof.

Unknown Or Low Trust Source

Examples:

scraped reposts

thin content sites

anonymous blogs

outdated pages

AI-generated content farms

Default trust level:

Low.

Source Independence Standard

Multiple sources do not necessarily equal multiple independent confirmations.

Sources may repeat:

the same press release

the same research paper

the same anonymous claim

the same syndicated article

the same company statement

the same data provider

the same unverified social post

The system should identify where several sources depend on one original source.

Source relationship fields should include:

Original Source ID:

Derived From Source ID:

Syndicated: Yes / No

Shared Evidence Base: Yes / No

Independent Reporting: Yes / No / Unclear

Rule

Ten pages repeating one report are one evidence chain, not ten independent confirmations.

Evidence Role Standard

Each source used should be assigned an evidence role.

Possible roles include:

supports claim

challenges claim

partially supports claim

provides context

defines terminology

supplies original data

repeats another source

offers expert interpretation

provides user experience signal

does not materially support conclusion

Rule

A citation is not automatically supporting evidence.

Evidence Extraction Standard

The system should extract the exact evidence relevant to the claim.

Each evidence record should include:

Evidence ID:

Claim ID:

Source ID:

Evidence Summary:

Evidence Location:

Publication Or Event Date:

Evidence Role:

Evidence Strength:

Source Reliability:

Independent Evidence: Yes / No / Unclear

Contradiction Status:

Used In Final Conclusion: Yes / No

Rule

The final answer should be traceable to specific evidence, not merely to a source homepage.

Contradiction And Disagreement Standard

Deep Search must actively look for material disagreement.

Possible disagreement types include:

source factual disagreement

different measurement methods

different time periods

different definitions

official claim versus independent evidence

research disagreement

outdated versus current information

partial evidence presented as complete evidence

When disagreement exists, the output should explain:

what the sources disagree about

which sources are stronger

whether the disagreement can be resolved

whether both claims may be true under different conditions

what further evidence is needed

Rule

The system must not hide disagreement merely to produce a cleaner answer.

Fact Check Verdict Standard

Claim-checking workflows should use controlled verdicts.

Recommended verdicts:

Supported

Strong available evidence supports the claim.

Mostly Supported

The central claim is supported, but qualifications or minor inaccuracies exist.

Partially Supported

Some elements are supported, but the complete claim is not.

Misleading

The claim uses true elements but creates a materially incorrect impression.

Unsupported

Sufficient supporting evidence was not found.

Contradicted

Strong evidence conflicts with the claim.

Outdated

The claim may have been accurate previously but is no longer current.

Unverifiable

Available evidence is insufficient to reach a reliable conclusion.

Opinion Or Prediction

The statement cannot be treated as a settled factual claim.

Rule

The system should not force every claim into true or false.

Confidence Calibration Standard

Confidence must reflect evidence quality, not model certainty.

Confidence should consider:

claim clarity

source reliability

source independence

evidence strength

evidence coverage

freshness

contradictions

missing information

tool failures

Confidence Levels

High Confidence

Strong, relevant, current, and sufficiently independent evidence exists.

Moderate Confidence

Useful evidence exists but has limitations, dependence, age, or unresolved gaps.

Low Confidence

Evidence is weak, incomplete, indirect, old, or materially disputed.

Insufficient Evidence

A reliable conclusion cannot be reached.

Rule

Confidence must decrease when:

sources are not independent

important pages cannot be inspected

the topic is highly time sensitive

primary evidence is missing

credible sources conflict

the claim is broader than the evidence

Research Stage Separation

Strong Deep Search should separate:

retrieval

evidence extraction

fact checking

interpretation

recommendation

final synthesis

The system must not collapse all stages into one unobservable model response when the task is important.

Observability Requirement

Every serious Deep Search AI Employee must be observable.

Observability means the system can answer:

What was requested?

Who requested it?

Which Brain handled it?

Which AI Employee handled it?

Which claims were evaluated?

Which model was used?

What prompt was sent?

What tools were called?

What searches were performed?

What sources were found?

What sources were inspected?

What sources were rejected?

What evidence supported each claim?

What evidence challenged each claim?

Which sources shared the same original evidence?

What database records were read?

What database records were written?

What failed?

What cost was incurred?

How long did it take?

What verdict was produced?

What confidence score was produced?

What decision was made?

Where was the final output stored?

What should be improved next?

Observability must cover the full workflow, not just the model call.

Full Workflow Trace Standard

A Deep Search workflow should be traceable from beginning to end.

The ideal trace path is:

User Request

→ Brain Assignment

→ AI Employee Assignment

→ Task Or Thread ID

→ Claim Definition

→ Prompt

→ Model Call

→ Search Query

→ Tool Call

→ Source Result

→ Source Classification

→ Crawler Or Scraper Action

→ Extracted Evidence

→ Supporting And Challenging Evidence

→ Source Independence Review

→ Database Read

→ Database Write

→ Evaluation Score

→ Verdict

→ Confidence

→ Final Output

→ HeadOffice Review

→ Kaizen Improvement Log

No major Deep Search action should exist without a parent task, thread, workflow run, offer, experiment, source record, claim record, or report.

No orphaned AI output.

Required Observability Metadata

Deep Search traces should include operational metadata wherever possible.

Recommended metadata fields:

Brain Name

AI Employee Name

Workflow Type

Task ID

Thread ID

Claim ID

Claim Type

User Or Operator

Client Or Account If Relevant

Source Record ID

Evidence Record ID

Offer ID If Relevant

Experiment ID If Relevant

Newsletter ID If Relevant

Campaign ID If Relevant

Priority

Urgency

Verdict

Confidence

Model Used

Tool Used

Search Provider Used

Crawler Or Scraper Used

Number Of Searches

Number Of Sources Found

Number Of Sources Inspected

Number Of Independent Evidence Chains

Supporting Evidence Count

Challenging Evidence Count

Cost Estimate

Latency

Success Status

Failure Reason

Retry Count

Escalation Status

Decision Outcome

Final Storage Location

Review Status

Kaizen Note

Technical logs without business and evidence metadata are not enough for MWMS.

HeadOffice must be able to understand the business meaning of the trace.

Database Call Observability

Deep Search observability must include important database activity.

The system should log or trace:

database reads

database writes

task updates

claim record creation

claim status updates

queue status changes

source record creation

source record updates

evidence record creation

source relationship creation

result storage

duplicate detection

failed inserts

missing records

permission failures

status transitions

event log creation

This matters because many AI failures are not model failures.

They may be:

missing data

wrong record linkage

duplicate writes

broken task state

failed source storage

incorrect user ownership

outdated cached records

queue routing failure

claim-to-evidence mismatch

source duplication

A final answer should never be trusted if the supporting database workflow is broken.

Tool Call Observability

Every tool used by a Deep Search AI Employee should be visible and reviewable.

Tool traces should include:

tool name

tool purpose

input arguments

execution status

output summary

error message if failed

latency

retry count

whether the output was used

whether the output changed the final answer

whether the tool was authorised for that AI Employee

This applies to:

search tools

crawler tools

scraper tools

browser tools

database tools

file tools

email tools

calendar tools

analytics tools

ad platform tools

browser copilots

future MCP tools

future WordPress or Supabase tools

Browser Copilot Research Standard

A browser copilot may capture:

highlighted claim

current page URL

page title

selected text

user instruction

The backend research workflow should then:

normalise the claim

preserve the originating source

search independent sources

inspect evidence

compare support and contradiction

return a controlled verdict

show confidence

display source references

state limitations

The browser interface must not perform the entire evidence judgment through hidden frontend logic.

Rule

The browser copilot is a research access surface.

It is not the source of truth.

Multi-Model Research Standard

Multiple language models may be used to generate:

different interpretations

different research paths

different questions

different challenge perspectives

alternative syntheses

Multiple models do not automatically provide independent factual verification.

They may share:

similar training data

similar search results

similar assumptions

the same source set

the same prompt bias

Multi-model workflows should therefore:

separate evidence retrieval from interpretation

give each model a defined role

preserve each model’s output

record the evidence each model used

identify agreement

identify disagreement

identify omissions

use synthesis rather than simple voting

retain minority evidence where relevant

route unresolved conflicts for additional research

Rule

Model consensus is not evidence consensus.

Research Roles

Possible multi-model roles include:

Primary Researcher

Finds and summarises relevant evidence.

Challenge Researcher

Searches for contradictions, weaknesses, and missing evidence.

Source Reviewer

Evaluates reliability, freshness, and independence.

Domain Interpreter

Explains the subject-specific meaning of the evidence.

Synthesiser

Combines findings without hiding disagreement.

Decision Reviewer

Tests whether the final recommendation is justified.

Cost And Latency Tracking

Deep Search can become expensive if uncontrolled.

Each production-level Deep Search AI Employee should track:

cost per query

cost per claim evaluated

cost per user

cost per workflow run

cost per source inspected

cost per independent evidence chain

cost per successful answer

cost per failed answer

total daily cost

total monthly cost

average response time

slowest workflow paths

tool latency

database latency

model latency

crawler latency

Cost and latency are not only technical metrics.

They affect:

business viability

user trust

scaling

product packaging

research depth

service pricing

Evaluation Requirement

Every serious Deep Search AI Employee should have repeatable evaluation tests.

Manual judgement alone is not enough.

Evals should test whether the AI Employee produces outputs that meet MWMS standards over time.

Evaluation should apply before:

giving an AI Employee more autonomy

using the Employee for high-value decisions

scaling client-facing workflows

changing models

changing prompts

changing tools

changing search providers

changing crawler behaviour

changing source rules

changing verdict logic

automating downstream actions

Success Criteria Requirement

Every Deep Search AI Employee must have success criteria.

Success criteria define what “good” means.

Without success criteria, MWMS cannot know whether an AI Employee is:

improving

drifting

wasting cost

missing evidence

overstating certainty

creating risk

Deep Search success criteria should include:

factual accuracy

claim coverage

relevance

source quality

source independence

source visibility

freshness

evidence sufficiency

contradiction handling

completeness

clarity

actionability

decision usefulness

speed

cost control

error rate

confidence calibration

verdict accuracy

compliance safety

repeatability

human review usefulness

Base Success Criteria For Deep Search Outputs

A Deep Search output should be judged against the following base criteria.

Factual

The answer should be supported by evidence, not model confidence alone.

Claim Accurate

The system should evaluate the actual claim rather than a simplified substitute.

Relevant

The answer should directly address the task, decision, or question.

Sourced

The answer should make clear what evidence was used.

Independently Supported

Where independent verification is required, the output should not rely only on repeated versions of one source.

Up To Date

The answer should use information current enough for the decision.

Contradiction Aware

The answer should reveal credible disagreement.

Complete Enough

The answer should include enough information to support a decision while avoiding unnecessary filler.

Clear

The answer should be understandable by the intended operator.

Actionable

The answer should help MWMS decide what to do next.

Cost Controlled

The workflow should not use excessive model, tool, or crawler cost for the value of the task.

Fast Enough

The response time should fit the use case.

Traceable

The process should be reviewable after the fact.

Safe

The output should avoid compliance, policy, legal, privacy, or reputational risk.

Confidence Calibrated

The confidence should match the strength of the evidence.

Improvable

The output should produce enough evidence for future Kaizen improvement.

Deep Search Evaluation Categories

MWMS should evaluate Deep Search AI Employees across six categories.

Category 1: Claim Quality

Questions:

Was the claim clearly defined?

Was the correct claim evaluated?

Was the claim decomposed appropriately?

Was claim scope preserved?

Category 2: Evidence Quality

Questions:

Did the AI inspect enough sources?

Were the sources reliable?

Were the sources current?

Were the sources genuinely independent?

Were weak sources filtered out?

Were supporting and opposing sources identified?

Was the final answer grounded?

Category 3: Reasoning Quality

Questions:

Did the AI interpret the evidence correctly?

Did it avoid overclaiming?

Did it separate fact from inference?

Did it acknowledge uncertainty?

Did it explain contradictions?

Did it select an appropriate verdict?

Did it reach a useful conclusion?

Category 4: Operational Quality

Questions:

Did the workflow run without error?

Were tool calls successful?

Were database records saved correctly?

Were outputs connected to the right task, claim, thread, or evidence record?

Was the process visible to HeadOffice?

Category 5: Business Quality

Questions:

Did the answer help MWMS make a better decision?

Did it save time?

Did it identify risk?

Did it reveal an opportunity?

Did it support revenue, compliance, efficiency, or system improvement?

Category 6: Cost And Scaling Quality

Questions:

Was the output worth the cost?

Did the AI use too many searches?

Did the crawler inspect too many pages?

Did retries create waste?

Can the workflow scale safely?

Suggested Scoring Model

Each Deep Search output may be scored from 1 to 5 across key criteria.

Score 1

Failed or unusable.

Score 2

Weak and requires significant human correction.

Score 3

Acceptable but not strong.

Score 4

Strong and useful.

Score 5

Excellent and reusable.

Suggested scoring fields:

claim precision

factual accuracy

relevance

source quality

source independence

freshness

evidence sufficiency

contradiction handling

verdict quality

confidence calibration

completeness

decision usefulness

traceability

cost efficiency

speed

safety

A score below 3 in:

claim precision

factual accuracy

source quality

source independence where required

safety

verdict quality

decision usefulness

should trigger review.

Failure Conditions

A Deep Search output should be treated as failed or review required if:

no sources were used when sources were required

the wrong claim was evaluated

sources were outdated for a time-sensitive topic

the answer made unsupported claims

the crawler failed but the answer acted as if it succeeded

search snippets were treated as full evidence

the output did not answer the actual task

the system could not trace tool calls

the system could not trace database activity

the answer used low-trust sources without warning

multiple dependent sources were represented as independent confirmation

credible contradictory evidence was ignored

the cost was excessive for the task value

the output created compliance or reputational risk

the AI Employee showed high confidence with weak evidence

the verdict was more certain than the evidence allowed

the final answer could not be linked to a task, claim, thread, source, evidence record, or workflow record

Human Review Requirements

Human review is required when:

the decision has financial risk

the decision has compliance risk

the output affects a campaign launch

the output affects affiliate offer selection

the output recommends budget changes

the output recommends public claims

the output uses weak or conflicting sources

the output depends on current policy or regulation

the output contains medical, legal, financial, or safety conclusions

the AI Employee confidence is low

the evaluation score is below threshold

the system detects missing trace data

the claim is materially disputed

the system could not inspect primary evidence

Human review should record:

approved

rejected

needs more research

claim reframing required

park for later

route to another Brain

create task

update framework

add Kaizen note

HeadOffice Governance Role

HeadOffice owns this framework.

HeadOffice is responsible for:

defining success criteria

approving Deep Search AI Employees

reviewing observability outputs

monitoring cost and latency

reviewing eval results

identifying drift

routing failures to the correct Brain

deciding when an AI Employee can gain more autonomy

deciding when an AI Employee must be restricted

ensuring all Deep Search workflows support MWMS business goals

HeadOffice must not rely on polished AI answers without traceability.

A useful answer without evidence is not enough.

A confident answer without observability is not enough.

Several sources without independence are not enough.

Several models agreeing are not enough.

A fast answer that is wrong is not enough.

A cheap answer that creates risk is not enough.

Relationship To Other MWMS Standards

This framework supports and should align with:

MWMS AI Agent Operations Core

MWMS AI Tool Permission And Access Framework

MWMS AI Agent Deployment Readiness Checklist

MWMS AI Workflow Pipeline Standard

MWMS AI Schema And Decision Ready Output Framework

MWMS AI Output Validation Standard

MWMS Agentic Reporting Standard

MWMS Supabase Event Schema

MWMS Brain Room Architecture

HeadOffice Operational Intelligence Framework

HeadOffice Newsletter Intelligence Operating Protocol

Research Brain Source Evaluation Framework

Data Brain Measurement Integrity Framework

Experimentation Brain Canon

MWMS Kaizen Continuous Improvement Loop

MWMS System Change Log

MWMS Source Visibility And Evidence Display Standard

MWMS Independent Model Review And Rescue Routing Framework

MWMS Research Synthesis Documentation And Distribution Framework

MWMS External Knowledge Engine And Reasoning Agent Separation Framework

This framework does not replace those standards.

It provides the quality and observability layer for Deep Search style AI work.

Routing Rules

Deep Search findings should route according to their business function.

Source Quality Issue

Primary Destination: Research Brain

Data Integrity Issue

Primary Destination: Data Brain

AI Employee Failure

Primary Destination: AI Agent Operations

Cost Issue

Primary Destination: Finance Brain or HeadOffice

Experiment Insight

Primary Destination: Experimentation Brain

Affiliate Opportunity

Primary Destination: Affiliate Brain

Ad Or Compliance Issue

Primary Destination: Ads Brain or Risk Brain

Newsletter Signal

Primary Destination: HeadOffice Intelligence

Framework Improvement

Primary Destination: MCR

UI Visibility Issue

Primary Destination: Relevant Brain Site or Product Brain

Repeated Failure Pattern

Primary Destination: Kaizen Log and HeadOffice Review

Unresolved Claim

Primary Destination: Research Brain Review Queue

Contradictory Evidence

Primary Destination: Research Brain or Specialist Human Review

Weak Source Independence

Primary Destination: Additional Research Queue

Kaizen Loop

Every Deep Search AI Employee should feed a Kaizen loop.

After meaningful runs, the system should record:

what worked

what failed

what was unclear

which claims were poorly defined

which sources were weak

which sources were dependent

which contradictions were missed

what cost too much

what took too long

what should be improved

whether prompts need refinement

whether tools need refinement

whether search paths need refinement

whether verdict logic needs improvement

whether confidence rules need updating

whether success criteria need updating

whether the AI Employee is ready for more autonomy

Kaizen Loop

Reflect

→ Reduce

→ Refine

→ Record

The goal is not only to judge individual outputs.

The goal is to improve the system over time.

Minimum Viable Implementation

The first version of this framework does not require a complex observability platform.

MWMS can begin with:

task IDs

thread IDs

claim IDs

source records

evidence records

event logs

model used

tool used

sources inspected

verdict

confidence

final output

status

human review result

Kaizen note

As the system matures, MWMS can add:

full trace logging

external observability tools

model cost tracking

crawler performance metrics

database call traces

automated evals

AI Employee scorecards

source relationship graphs

claim-to-evidence mapping

contradiction alerts

confidence calibration reports

HeadOffice monitoring dashboards

Start simple.

Do not delay governance because tooling is not perfect.

Future System Enhancements

Future versions may include:

MWMS Deep Search Source Record Schema

MWMS Claim And Evidence Record Schema

MWMS Source Independence Graph

MWMS AI Employee Eval Registry

MWMS Deep Search Scorecard

HeadOffice AI Trace Dashboard

Research Brain Source Quality Dashboard

Affiliate Brain Offer Evidence Trail

Ads Brain Compliance Evidence Trail

automated source freshness checking

model comparison evals

cost per AI Employee reporting

confidence calibration reports

failure pattern detection

claim contradiction detection

source dependence detection

AI Employee promotion or restriction rules

Drift Protection

This framework prevents the following drift:

treating model output as truth

trusting AI confidence without evidence

relying only on search snippets

ignoring source freshness

counting repeated sources as independent evidence

treating model agreement as factual verification

hiding contradictory evidence

forcing uncertain claims into true or false

hiding tool calls from operators

ignoring database workflow failures

judging AI Employees by vibes

scaling AI Employees without evals

allowing cost to grow invisibly

creating orphaned outputs

separating AI output from business usefulness

using observability as technical logging only

building Deep Search without HeadOffice oversight

Drift Signals

Watch for:

“Three models agreed, so it must be true.”

“There are ten sources saying the same thing.”

“The sources all link back to one report, but that is fine.”

“The search snippet is enough.”

“The page looks credible.”

“The model is highly confident.”

“We do not need the original source.”

“There was no contradictory evidence in the first search.”

“We need a yes or no answer.”

“The claim is mostly right, so mark it true.”

“We can hide the uncertainty from the client.”

“The browser extension already fact checked it.”

“The AI researcher produced a polished report.”

Rule

When these drift signals appear, return to claim definition, primary evidence, source independence, contradiction analysis, verdict discipline, confidence calibration, and traceability.

Architectural Intent

The architectural intent of this framework is to make Deep Search a governed MWMS capability rather than a loose AI feature.

MWMS is not building simple chatbots.

MWMS is building business-aligned AI Employees that can:

research

verify

challenge

reason

report

improve

For that to work, every serious AI Employee must be:

observable

measurable

claim aware

source aware

date aware

source-independence aware

contradiction aware

cost aware

workflow aware

reviewable

improvable

This framework ensures Deep Search becomes part of the MWMS intelligence layer, not just another tool.

Final Standard

The MWMS final standard is:

No serious MWMS Deep Search, fact-checking, external research, browser copilot, research assistant, or evidence-based intelligence system should be trusted until its task definition, claim structure, search process, source inspection, evidence extraction, source reliability, source independence, contradiction handling, verdict logic, confidence calibration, observability, evaluation, human review, and Kaizen requirements are defined.

A valid MWMS Deep Search system must define:

task purpose

claim or decision question

claim types

time sensitivity

search paths

source classes

source inspection rules

evidence records

supporting evidence

challenging evidence

source independence

freshness rules

contradiction handling

verdict options

confidence rules

observability metadata

database records

tool traces

evaluation criteria

failure conditions

human review points

cost limits

latency expectations

routing rules

Kaizen process

That is the MWMS Deep Search Quality And Observability standard.

MWMS System Change Log

Version: v1.1

Date: 2026-06-21

Author: HeadOffice

Change

Updated the MWMS Deep Search Quality And Observability Framework from v1.0 to v1.1 using the AI Automations by Jack transcript block covering:

browser-based fact checking

claim validation

independent web research

source comparison

confidence scoring

fact-check verdicts

client research automation

multi-model research

research verification

source URL preservation

business meeting preparation

automated evidence synthesis

Expanded the framework beyond general research quality and technical observability to include formal claim-level research governance.

Added new standards covering:

claim decomposition

claim records

claim types

search path diversity

source independence

source relationship mapping

evidence roles

evidence extraction

claim-to-evidence traceability

supporting evidence

challenging evidence

contradiction handling

controlled fact-check verdicts

confidence calibration

browser copilot research

multi-model research roles

model consensus versus evidence consensus

Added controlled verdict classes:

• Supported

• Mostly Supported

• Partially Supported

• Misleading

• Unsupported

• Contradicted

• Outdated

• Unverifiable

• Opinion Or Prediction

Expanded observability metadata to include:

• Claim ID

• Claim Type

• Evidence Record ID

• Verdict

• Number Of Searches

• Sources Found

• Sources Inspected

• Independent Evidence Chains

• Supporting Evidence Count

• Challenging Evidence Count

Added explicit doctrine that:

• multiple sources may represent one evidence chain

• multiple models agreeing does not constitute independent verification

• a citation is not automatically supporting evidence

• uncertainty must not be hidden to produce a cleaner answer

• browser copilots are access surfaces, not truth systems

Change Impact Declaration

This update materially strengthens the Deep Search governance standard without changing its primary authority.

HeadOffice remains the framework owner.

Research Brain remains the primary operating destination for source and evidence quality.

The update does not authorise autonomous AI Employees to make final high-risk legal, medical, financial, compliance, safety, or material business decisions.

The update requires stronger human review where:

• source independence is weak

• evidence is contradictory

• primary evidence cannot be inspected

• claims are highly time sensitive

• confidence is low

• consequences are material

Pages Created

• None

Pages Updated

• MWMS Deep Search Quality And Observability Framework updated from v1.0 to v1.1

Pages Deprecated

• None

Standalone Pages Not Created

The following standalone pages were not created because their durable intelligence is governed within this updated framework:

• MWMS AI Fact Checking Framework

• MWMS Browser Fact Check Copilot Framework

• MWMS Claim Verification Framework

• MWMS Source Independence Framework

• MWMS Multi Model Research Framework

• MWMS Evidence Chain Analysis Framework

• MWMS Fact Check Confidence Framework

• MWMS Automated Client Research Framework

Registries Requiring Update

• MCR Page Registry

• HeadOffice Page Registry

• Research Brain Page Registry where this framework is operationally referenced

• MCR Copy Map where the framework copy and version are recorded

Canon Version Update Required

No immediate HeadOffice Canon or Research Brain Canon version change is required unless either Canon directly records framework versions or contains research quality rules that conflict with this update.

The new claim, evidence, source independence, and verdict controls should be included during the next scheduled HeadOffice and Research Brain Canon alignment review.

Change Log Entry Required

Yes.

The v1.1 update must be recorded in:

• MWMS System Change Log

• MCR Page Registry change history where applicable

• HeadOffice Page Registry change history where applicable

• Research Brain Page Registry change history where applicable

Strategic Absorption Result

The AI Automations by Jack material concerning browser fact checking, automated company research, multi-model analysis, source verification, and client intelligence research has been absorbed into the existing MWMS Deep Search Quality And Observability Framework.

The absorption preserves the durable research and evidence architecture while rejecting:

• source-count theatre

• model-consensus theatre

• forced true-or-false verdicts

• unsupported confidence scoring

• hidden contradictory evidence

• search snippets being treated as complete evidence

• browser extensions being treated as independent truth systems

• polished research reports being accepted without traceability

The resulting v1.1 framework establishes that MWMS Deep Search must be:

• claim-specific

• evidence-led

• source-inspectable

• independence-aware

• contradiction-aware

• date-aware

• verdict-controlled

• confidence-calibrated

• observable

• evaluable

• human-reviewable

• improvable

END OF FULL FILE OUTPUT