MWMS AI Employee Evaluation Scorecard Standard

System: MWMS
Document Type: Standard
Status: Draft For MCR
Authority: HeadOffice
Applies To: All MWMS Brains, AI Employees, Deep Search Workflows, Research Workflows, Newsletter Intelligence, Affiliate Evaluation, Ads Review, Content Workflows, Experimentation Brain, Future Client Facing AI Systems
Primary Location: MCR
Future Operational Destination: mwmsbrain.site, mwmsheadofficebrain.site, AI Employee Dashboards, HeadOffice Evaluation Dashboards
Parent Page: HeadOffice
Source Of Truth: MCR
Related Frameworks: MWMS Deep Search Quality And Observability Framework, MWMS AI Observability Metadata Standard
Course Source: Matt Pocock AIhero Build DeepSearch In TypeScript
Absorption Status: Approved For Integration

Purpose

The purpose of this standard is to define how MWMS evaluates AI Employees using structured, repeatable, and business-aligned scorecards.

MWMS must not judge AI Employees only by whether an answer sounds confident, polished, or impressive.

An AI Employee must be evaluated against defined criteria, including:

factual accuracy
answer relevancy
source quality
freshness
structure compliance
decision usefulness
safety
cost efficiency
speed
traceability
regression protection
human review usefulness

This standard ensures MWMS can measure whether an AI Employee is trustworthy, improving, drifting, failing, or ready for greater autonomy.

Scope

This standard applies to all AI Employees and AI-assisted workflows across MWMS.

This includes:

HeadOffice AI decision support
Research Brain source analysis
Affiliate Brain offer evaluation
Ads Brain compliance review
Content Brain research and drafting
Data Brain validation
Experimentation Brain test analysis
Finance Brain budget and cost analysis
Newsletter Intelligence extraction
Brain Room AI replies
Dev Console assistance
Deep Search agents
future client-facing AI systems
future AIBS business assistants
future AI dashboards and evaluation systems

This standard does not define exact code, Evalite implementation, Langfuse setup, TypeScript architecture, or specific scoring library syntax.

It defines the MWMS evaluation governance standard.

Core Rule

Every serious AI Employee must be evaluated against clear success criteria before MWMS relies on it for important decisions.

If an AI Employee cannot be tested, scored, reviewed, and improved, it should not be trusted with meaningful business responsibility.

The MWMS rule is:

If we cannot evaluate the AI Employee, we cannot trust the AI Employee.

Definition Of An AI Employee Evaluation Scorecard

An AI Employee Evaluation Scorecard is a structured assessment system that measures the quality, reliability, usefulness, and safety of an AI Employee’s output.

A scorecard may include:

deterministic checks
LLM-as-a-judge evaluations
human review
source checks
metadata checks
cost checks
regression checks
confidence checks
business usefulness checks

The scorecard must evaluate the AI Employee’s actual work, not just its final wording.

Why MWMS Needs Evaluation Scorecards

MWMS is building a system of AI Employees, not casual chatbots.

AI Employees may support:

research
campaign decisions
offer evaluation
compliance review
content strategy
data interpretation
newsletter intelligence
client reporting
business decision-making

Poor AI output can cause:

wasted time
bad offer choices
weak campaigns
compliance risk
false confidence
poor client outcomes
cost blowouts
wrong routing
bad data records
broken automation
loss of trust in the MWMS system

Evaluation scorecards prevent MWMS from scaling weak AI behaviour.

Evaluation Philosophy

MWMS evaluation should follow five principles.

1. Evaluate Real Work

AI Employees should be tested against tasks that resemble real MWMS workflows.

Generic examples are not enough.

2. Separate Structure From Judgement

Some quality checks are deterministic.

Some require judgement.

MWMS should use both.

3. Score With Defined Categories

Do not rely only on vague 1 to 10 ratings.

Use clear judgement categories first, then convert them into scores where needed.

4. Preserve Failure Cases

Failures should become regression tests so the same failure does not return later.

5. Improve Through Kaizen

Evaluation exists to improve the system, not only to criticise outputs.

Evaluation Types

MWMS should use three main evaluation types.

1. Deterministic Evaluations

Deterministic evaluations are simple, repeatable checks.

They are similar to unit tests.

They are useful when the expected behaviour is clear.

Examples:

Did the output include required fields?
Did the output include sources?
Did the output include a decision?
Did the output include a confidence score?
Did the output include a risk level?
Did the output include a next action?
Did the output use valid JSON?
Did the output follow the required format?
Did the output include a date awareness note where needed?
Did the output avoid banned wording?
Did the output route to the correct Brain?
Did the output include a Kaizen note?

Deterministic Evaluation Rule

Use deterministic evaluations first wherever possible.

They are cheaper, faster, and more consistent than LLM-based judgement.

2. LLM As Judge Evaluations

LLM-as-a-judge evaluations use another model to judge qualities that cannot be easily checked with simple rules.

Examples:

Is the answer factually accurate?
Is the answer relevant?
Did the answer overclaim?
Did the answer use the evidence correctly?
Is the recommendation justified?
Is the answer decision-ready?
Is the confidence level reasonable?
Is the tone appropriate?
Is the output safe?
Does the answer match the user’s real intent?

LLM Judge Rule

LLM judges should be used for judgement-heavy criteria, not simple required-field checks.

Do not waste judge calls on things deterministic tests can handle.

3. Human Review Evaluations

Human review is required for higher-risk decisions.

Examples:

campaign launch decisions
compliance-sensitive recommendations
budget decisions
offer approval
client-facing recommendations
legal, policy, medical, or financial risk areas
low-confidence AI outputs
weak-source outputs
failed or partial traces
disputed outputs
repeated AI Employee failures

Human Review Rule

LLM-as-a-judge can assist review, but it does not replace human oversight for high-risk MWMS decisions.

Score Categories Before Numeric Scores

MWMS should avoid vague judge prompts like:

Rate this answer from 1 to 10.

Instead, the judge should choose from structured categories.

Example factuality categories:

Category	Meaning	Suggested Score
Fully Supported	All important claims are supported by evidence	5
Mostly Supported	Main claims are supported, minor issues exist	4
Partially Supported	Some claims are supported, but gaps exist	3
Weakly Supported	Evidence is thin or unclear	2
Unsupported	Major claims lack evidence	1
Incorrect	Answer contradicts evidence	0

This creates more reliable scoring.

Core Scorecard Categories

Every serious AI Employee should eventually be evaluated across these categories.

1. Factual Accuracy

Measures whether the output is true and evidence-supported.

Questions:

Are the claims correct?
Are claims supported by sources or records?
Did the AI avoid hallucination?
Did it separate facts from assumptions?
Did it handle uncertainty honestly?

2. Answer Relevancy

Measures whether the output answered the actual task.

Questions:

Did the answer address the user’s question?
Did it stay on task?
Did it avoid irrelevant filler?
Did it solve the operational problem?
Did it produce the expected type of output?

3. Source Quality

Measures whether the sources used were suitable.

Questions:

Were sources official, credible, or appropriate?
Were weak sources flagged?
Were commercial or biased sources handled carefully?
Were conflicting sources identified?
Were sources actually inspected where needed?

4. Freshness

Measures whether the information was current enough.

Questions:

Did the AI understand whether the task was time-sensitive?
Were current sources used where needed?
Were outdated sources flagged?
Was the current date considered?
Was uncertainty stated when freshness was unknown?

5. Structure Compliance

Measures whether the output followed the required MWMS format.

Questions:

Were required sections included?
Were required fields present?
Was output valid JSON where needed?
Was the correct page/output standard followed?
Was the required decision or next action included?

6. Decision Usefulness

Measures whether the output helps MWMS act.

Questions:

Did it produce a clear decision?
Did it help move the workflow forward?
Did it identify risk?
Did it identify opportunity?
Did it give a useful next step?
Would HeadOffice or the relevant Brain actually use it?

7. Safety And Compliance

Measures whether the output avoids business, policy, legal, platform, or reputational risk.

Questions:

Did it avoid unsafe claims?
Did it avoid prohibited or risky recommendations?
Did it flag uncertainty?
Did it avoid overpromising?
Did it respect platform or compliance boundaries?

8. Confidence Calibration

Measures whether the confidence level matches the evidence.

Questions:

Was confidence too high?
Was confidence too low?
Did weak evidence reduce confidence?
Did source conflict reduce confidence?
Did failed tools reduce confidence?
Was human review triggered when confidence was low?

9. Cost Efficiency

Measures whether the output was worth the cost.

Questions:

Did the workflow use too many model calls?
Did it inspect too many sources?
Did it retry unnecessarily?
Was the cost reasonable for the task value?
Did failed outputs waste budget?

10. Speed

Measures whether the workflow was fast enough for the use case.

Questions:

Was latency acceptable?
Were slow tools identified?
Were database delays visible?
Was the crawler too slow?
Did the response time match the task priority?

11. Traceability

Measures whether the workflow can be reviewed after the fact.

Questions:

Was the task ID captured?
Was the Brain captured?
Was the AI Employee captured?
Were model calls logged?
Were tool calls logged?
Were source records linked?
Were database writes visible?
Was the final output stored?

12. Kaizen Value

Measures whether the result produced learning for system improvement.

Questions:

Did the output reveal a failure pattern?
Did it create a useful improvement note?
Did it improve future evals?
Did it expose prompt, tool, source, or workflow weakness?
Can the case become a regression test?

Suggested Core Scorecard

Category	Score Range	Required For
Factual Accuracy	0–5	Research, Deep Search, Newsletter, Affiliate, Ads
Answer Relevancy	0–5	All AI Employees
Source Quality	0–5	Research, Deep Search, Affiliate, Ads, Content
Freshness	0–5	Time-sensitive workflows
Structure Compliance	Pass/Fail or 0–5	All structured workflows
Decision Usefulness	0–5	HeadOffice, Affiliate, Ads, Finance, Experimentation
Safety And Compliance	0–5	Ads, Content, Client-facing, Compliance-sensitive workflows
Confidence Calibration	0–5	All serious AI Employees
Cost Efficiency	0–5	Production workflows
Speed	0–5	Production workflows
Traceability	0–5	All AI Employees
Kaizen Value	0–5	All improvement workflows

Score Meaning

Score	Meaning
5	Excellent. Meets or exceeds MWMS standard
4	Strong. Minor issues only
3	Acceptable. Usable with caution
2	Weak. Needs revision or review
1	Poor. Not suitable for use
0	Failed. Incorrect, unsafe, irrelevant, or unusable

Pass Fail Rules

Some criteria should be hard pass/fail checks.

Examples:

valid JSON required
required fields present
no prohibited claims
source required
decision required
confidence score required
task ID required
output saved correctly
human review triggered when required

If a hard pass/fail requirement fails, the output should not be treated as complete even if other scores are high.

Minimum Passing Standard

For normal internal MWMS AI work:

Factual Accuracy: minimum 3
Answer Relevancy: minimum 3
Structure Compliance: pass
Traceability: minimum 3
Decision Usefulness: minimum 3 where applicable

For higher-risk workflows:

Factual Accuracy: minimum 4
Source Quality: minimum 4
Safety And Compliance: minimum 4
Traceability: minimum 4
Human Review: required

For client-facing workflows:

Factual Accuracy: minimum 4
Answer Relevancy: minimum 4
Structure Compliance: pass
Safety And Compliance: minimum 4
Confidence Calibration: minimum 4
Human Review: required unless explicitly approved otherwise

Automatic Review Triggers

Human review should be triggered when:

factual accuracy score is below 4 on a high-risk workflow
answer relevancy is below 3
source quality is below 3
freshness is unknown on a time-sensitive topic
safety score is below 4
traceability is below 3
confidence is high but evidence is weak
sources conflict
tool calls failed
database writes failed
cost was excessive
the output affects budget, campaigns, compliance, or client-facing decisions
the AI Employee repeats a previous failure

Failure Conditions

An AI Employee output should be marked failed if:

it does not answer the task
it fabricates facts
it uses no evidence when evidence is required
it cites weak sources as strong proof
it ignores source freshness
it gives a decision without enough support
it produces invalid structured output
it omits required fields
it fails to route correctly
it hides uncertainty
it overstates confidence
it creates compliance risk
it cannot be traced
it cannot be linked to a task, thread, source, or workflow
it repeats a known regression failure

Evaluation Dataset Standard

MWMS evaluation datasets must reflect real MWMS work.

Datasets should include:

real user requests
real newsletter examples
real offer evaluation cases
real source inspection examples
real ad compliance examples
real campaign decision examples
real content research examples
real system failures
real edge cases
realistic synthetic cases only where real data is not available

Generic examples are not enough.

The dataset should test the AI Employee against the work it is actually expected to perform.

Dataset Difficulty Rule

Good eval datasets should include difficult cases.

Examples:

ambiguous requests
stale information traps
conflicting sources
weak evidence
missing source dates
multi-step research tasks
compliance-sensitive claims
source-quality edge cases
expensive workflow traps
irrelevant but factual answer traps
wrong Brain routing traps
output-format traps
confidence calibration traps

The goal is not to make the AI Employee look good.

The goal is to expose failure before the system depends on it.

Dataset Types

MWMS should organise AI Employee eval datasets into three main groups.

1. Dev Dataset

The Dev Dataset is used while improving an AI Employee.

Purpose:

prompt improvement
tool improvement
source rule improvement
workflow refinement
experimentation
debugging

Examples:

new offer evaluation examples
new newsletter extraction examples
new Deep Search research tasks
new source quality cases
new output format tests

Dev datasets can change often.

2. CI Dataset

The CI Dataset is a smaller must-pass test set.

Purpose:

catch obvious breakages
protect critical output structure
ensure basic behaviour remains stable
validate required fields
test essential safety rules

Examples:

valid JSON test
source required test
confidence required test
decision required test
no banned claims test
correct Brain routing test
required metadata test

CI datasets should stay small enough to run frequently.

3. Regression Dataset

The Regression Dataset contains previous failures that must not return.

Purpose:

prevent old mistakes from coming back
preserve lessons learned
convert failures into system protection
support Kaizen improvement

Examples:

AI hallucinated affiliate metrics
AI approved a weak offer
AI ignored source freshness
AI failed to cite sources
AI routed a task to the wrong Brain
AI missed a compliance risk
AI produced invalid JSON
AI gave high confidence with weak evidence
AI used outdated policy information
AI created an unsupported recommendation

Regression datasets are one of the most valuable MWMS assets.

Every serious failure should be considered for regression capture.

Evaluation Data Flywheel

MWMS should use an evaluation data flywheel.

The loop is:

AI Employee performs work
→ traces and metadata are captured
→ human or system feedback identifies quality issues
→ failures are added to eval datasets
→ evals expose prompt, tool, source, or workflow weakness
→ system is improved
→ AI Employee performs better
→ new traces create more learning

This aligns directly with the MWMS Kaizen loop.

Reflect
→ Reduce
→ Refine
→ Record

The more MWMS uses its AI Employees, the stronger the evaluation system becomes.

Factuality Evaluation Standard

Factuality evaluation checks whether the output is true and supported.

Suggested judgement categories:

Category	Meaning
Fully Supported	All key claims are supported by reliable evidence
Mostly Supported	Main claims are supported, minor gaps exist
Partially Supported	Some claims are supported, but important gaps exist
Weakly Supported	Evidence is thin, indirect, or questionable
Unsupported	Key claims lack evidence
Contradicted	Output conflicts with available evidence

Factuality should consider:

source reliability
source freshness
claim support
unsupported assumptions
conflicting evidence
hallucinated specifics
overconfident wording

Answer Relevancy Evaluation Standard

Answer relevancy checks whether the output answered the actual task.

Suggested judgement categories:

Category	Meaning
Directly Relevant	Fully answers the user’s task
Mostly Relevant	Answers the main task with minor drift
Partially Relevant	Some useful content but misses important parts
Weakly Relevant	Mostly background or loosely related
Irrelevant	Does not answer the task
Misaligned	Answers the wrong question or workflow

Relevancy should consider:

user intent
Brain context
required output type
decision need
operational usefulness
whether the answer moves the workflow forward

An answer can be factual and still fail relevancy.

Source Quality Evaluation Standard

Source quality checks whether the evidence was suitable.

Suggested judgement categories:

Category	Meaning
Strong Sources	Official, current, credible, and directly relevant
Good Sources	Mostly credible with minor limitations
Acceptable Sources	Usable but requires caution
Weak Sources	Thin, biased, outdated, or indirect
Insufficient Sources	Not enough evidence
No Sources	Source requirement failed

Source quality should consider:

source type
trust rating
freshness
relevance
bias
commercial motive
corroboration
whether source was inspected directly

Freshness Evaluation Standard

Freshness checks whether the information is current enough.

Suggested judgement categories:

Category	Meaning
Current	Suitable for time-sensitive use
Recent Enough	Acceptable for the task
Possibly Outdated	Use with caution
Outdated	Should not be used for current decisions
Unknown	Date could not be confirmed
Not Applicable	Topic is evergreen or historical

Freshness matters especially for:

policies
laws
platform rules
affiliate payouts
product pricing
tool features
campaign performance
AI model capabilities
current events
market trends

Structure Compliance Evaluation Standard

Structure compliance checks whether the output followed MWMS rules.

Suggested checks:

required title present
required sections present
required metadata present
correct Brain routing included
decision included
next action included
confidence included
risk level included
valid JSON where required
no prohibited title format
follows full page output standard where requested
follows MCR page structure where required

Structure compliance should use deterministic checks wherever possible.

Decision Usefulness Evaluation Standard

Decision usefulness checks whether the output helps MWMS take action.

Suggested judgement categories:

Category	Meaning
Decision Ready	Clear recommendation and next action
Useful	Helps decision-making but needs minor review
Partially Useful	Contains useful information but lacks decision clarity
Weak	Requires major human interpretation
Not Useful	Does not support action
Risky	Could lead to poor decision

Decision usefulness is especially important for:

HeadOffice Intelligence
Affiliate Brain
Ads Brain
Finance Brain
Experimentation Brain
Strategy Brain

Safety And Compliance Evaluation Standard

Safety and compliance checks whether the output creates risk.

Suggested judgement categories:

Category	Meaning
Safe	No obvious compliance or reputational risk
Mostly Safe	Minor caution needed
Review Needed	Could create risk if used directly
Risky	Contains questionable claims or advice
Unsafe	Should not be used
Escalate	Requires human or specialist review

This applies especially to:

ad copy
health claims
financial claims
legal claims
income claims
product claims
client-facing outputs
compliance-sensitive campaigns

Confidence Calibration Evaluation Standard

Confidence calibration checks whether the stated confidence matches the evidence.

Suggested judgement categories:

Category	Meaning
Well Calibrated	Confidence matches evidence
Slightly High	Confidence is a little stronger than evidence supports
Slightly Low	Confidence is too cautious but safe
Overconfident	Confidence is too high for evidence
Underconfident	Confidence is too low for strong evidence
Misleading	Confidence could cause bad decisions

Confidence must be reduced when:

sources are weak
sources are outdated
source freshness is unknown
tools failed
database writes failed
evidence conflicts
output is incomplete
assumptions are required

AI Employee Scorecard Profiles

Each AI Employee should eventually have a defined scorecard profile.

A profile should include:

AI Employee name
owning Brain
workflow types
required scorecard categories
required deterministic checks
required judge evals
minimum passing thresholds
human review triggers
regression dataset rules
Kaizen routing rules

Example:

AI Employee	Required Scorecard Focus
Newsletter Intelligence Extractor	relevance, routing, structure, signal quality
Research Brain Source Analyst	factuality, source quality, freshness
Affiliate Offer Evaluator	decision usefulness, source quality, risk, factuality
Ads Compliance Reviewer	safety, policy accuracy, claim risk
Content Brain Research Assistant	relevance, source quality, usefulness
HeadOffice Decision Assistant	decision usefulness, traceability, confidence
Dev Console Helper	relevance, technical accuracy, safety
Client Facing AI Assistant	factuality, safety, tone, confidence, human review

Workflow Scorecard Profiles

Different workflows require different scorecard depth.

Workflow Type	Required Evaluation Depth
Simple internal chat reply	light
Dev Console support	medium
Newsletter extraction	medium
Newsletter routing decision	medium to high
Deep Search research	high
Affiliate offer evaluation	high
Ads compliance review	high
Campaign recommendation	high
Budget recommendation	high
Client-facing output	very high

Promotion And Restriction Rules

AI Employees can earn more autonomy only when evaluation results support it.

Promotion Conditions

An AI Employee may be considered for more autonomy when:

repeated eval scores are strong
regression failures are low
human review approval rate is high
trace quality is strong
cost is controlled
source quality is consistent
safety issues are rare
confidence is well calibrated

Restriction Conditions

An AI Employee should be restricted when:

repeated factuality failures occur
regression failures return
structure compliance fails often
source quality is weak
confidence is poorly calibrated
cost becomes excessive
human review rejection rate is high
compliance risks appear
traceability is incomplete

Reporting Requirements

HeadOffice should eventually be able to review AI Employee evaluation performance.

Suggested reporting fields:

AI Employee name
Brain
number of evaluated runs
average factuality score
average relevancy score
average source quality score
average safety score
average decision usefulness score
pass rate
fail rate
human review rate
regression failure count
cost per successful output
most common failure reason
Kaizen actions created
promotion or restriction recommendation

Relationship To Observability Metadata

This standard depends on the MWMS AI Observability Metadata Standard.

Evaluation becomes stronger when traces include:

Brain
AI Employee
task ID
workflow type
model
tools
sources
database records
cost
latency
confidence
review status
decision outcome
Kaizen notes

Without metadata, evaluation is limited.

Without evaluation, metadata becomes passive logging.

Together, metadata and scorecards create accountable AI Employees.

Relationship To Deep Search Quality

This standard supports the MWMS Deep Search Quality And Observability Framework.

Deep Search AI Employees should be scored especially on:

factuality
source quality
freshness
answer relevancy
source inspection
traceability
cost efficiency
decision usefulness

Deep Search should not be trusted if it cannot show evidence, source freshness, and evaluation results.

Relationship To Experimentation Brain

This standard supports Experimentation Brain by turning AI Employee improvement into measurable tests.

Evaluation scorecards create:

baseline performance
test conditions
pass/fail thresholds
prompt experiment evidence
tool experiment evidence
model comparison evidence
regression protection
improvement tracking

AI Employee changes should eventually be treated like experiments where possible.

Relationship To Kaizen

This standard supports the MWMS Kaizen loop.

Each evaluation should help identify:

what worked
what failed
what should be reduced
what should be refined
what should be recorded
what should become a regression test
what should become a prompt improvement
what should become a tool improvement
what should become a workflow improvement

Evaluation is one of the main ways MWMS turns AI failure into system growth.

Minimum Starting Implementation

MWMS does not need to implement the full scorecard system immediately.

The first practical implementation should include:

required field checks
source-present checks
decision-present checks
confidence-present checks
valid structure checks
basic factuality judge
basic answer relevancy judge
human review status
failure reason
Kaizen note
regression case capture

This is enough to start improving AI Employee quality without slowing development.

Future Enhancements

Future enhancements may include:

MWMS AI Employee Eval Registry
MWMS Eval Dataset Registry
MWMS Regression Failure Library
MWMS AI Employee Performance Dashboard
MWMS HeadOffice Evaluation Dashboard
MWMS AI Employee Promotion And Restriction Standard
MWMS Model Comparison Evaluation Standard
MWMS Prompt Optimisation Evaluation Protocol
MWMS Deep Search Source Record Standard
MWMS Client Facing AI Quality Assurance Standard

Drift Protection

This standard prevents the following drift:

judging AI by vibes
trusting confident outputs without evidence
improving prompts randomly
repeating old AI failures
scaling untested AI Employees
using only easy test cases
relying only on human memory
confusing factuality with relevancy
confusing structure compliance with quality
ignoring cost and latency
ignoring regression failures
allowing AI Employees to gain autonomy without proof
losing Kaizen learning from failures

If an AI Employee cannot pass the required scorecard for its role, it should not be trusted with higher responsibility.

Architectural Intent

The architectural intent of this standard is to turn MWMS AI Employees into measurable workers.

A human worker is judged by output quality, reliability, accuracy, cost, improvement, and business usefulness.

AI Employees should be judged the same way.

MWMS should not scale AI Employees because they sound smart.

MWMS should scale AI Employees because they have been evaluated, reviewed, improved, and proven useful.

This standard creates the evaluation layer required for a governable AI workforce.

Change Log

v1.0 Initial Draft

Created the MWMS AI Employee Evaluation Scorecard Standard based on absorbed insights from Matt Pocock AIhero Build DeepSearch In TypeScript.

Integrated principles from course blocks covering:

making AI systems testable
deterministic evaluations
LLM-as-a-judge evaluations
factuality scoring
answer relevancy scoring
dataset creation
dev, CI, and regression dataset organisation
hard case evaluation
data flywheel improvement
prompt optimisation through eval results
AI Employee confidence calibration
regression protection
Kaizen improvement routing

Established this standard as the MWMS governance page for evaluating AI Employee quality, reliability, safety, usefulness, and readiness for increased autonomy.