MWMS AI Employee Evaluation Scorecard Standard

System: MWMS
Document Type: Standard
Status: Draft For MCR
Authority: HeadOffice
Applies To: All MWMS Brains, AI Employees, Deep Search Workflows, Research Workflows, Newsletter Intelligence, Affiliate Evaluation, Ads Review, Content Workflows, Experimentation Brain, Future Client Facing AI Systems
Primary Location: MCR
Future Operational Destination: mwmsbrain.site, mwmsheadofficebrain.site, AI Employee Dashboards, HeadOffice Evaluation Dashboards
Parent Page: HeadOffice
Source Of Truth: MCR
Related Frameworks: MWMS Deep Search Quality And Observability Framework, MWMS AI Observability Metadata Standard
Course Source: Matt Pocock AIhero Build DeepSearch In TypeScript
Absorption Status: Approved For Integration


Purpose

The purpose of this standard is to define how MWMS evaluates AI Employees using structured, repeatable, and business-aligned scorecards.

MWMS must not judge AI Employees only by whether an answer sounds confident, polished, or impressive.

An AI Employee must be evaluated against defined criteria, including:

  • factual accuracy
  • answer relevancy
  • source quality
  • freshness
  • structure compliance
  • decision usefulness
  • safety
  • cost efficiency
  • speed
  • traceability
  • regression protection
  • human review usefulness

This standard ensures MWMS can measure whether an AI Employee is trustworthy, improving, drifting, failing, or ready for greater autonomy.


Scope

This standard applies to all AI Employees and AI-assisted workflows across MWMS.

This includes:

  • HeadOffice AI decision support
  • Research Brain source analysis
  • Affiliate Brain offer evaluation
  • Ads Brain compliance review
  • Content Brain research and drafting
  • Data Brain validation
  • Experimentation Brain test analysis
  • Finance Brain budget and cost analysis
  • Newsletter Intelligence extraction
  • Brain Room AI replies
  • Dev Console assistance
  • Deep Search agents
  • future client-facing AI systems
  • future AIBS business assistants
  • future AI dashboards and evaluation systems

This standard does not define exact code, Evalite implementation, Langfuse setup, TypeScript architecture, or specific scoring library syntax.

It defines the MWMS evaluation governance standard.


Core Rule

Every serious AI Employee must be evaluated against clear success criteria before MWMS relies on it for important decisions.

If an AI Employee cannot be tested, scored, reviewed, and improved, it should not be trusted with meaningful business responsibility.

The MWMS rule is:

If we cannot evaluate the AI Employee, we cannot trust the AI Employee.


Definition Of An AI Employee Evaluation Scorecard

An AI Employee Evaluation Scorecard is a structured assessment system that measures the quality, reliability, usefulness, and safety of an AI Employee’s output.

A scorecard may include:

  • deterministic checks
  • LLM-as-a-judge evaluations
  • human review
  • source checks
  • metadata checks
  • cost checks
  • regression checks
  • confidence checks
  • business usefulness checks

The scorecard must evaluate the AI Employee’s actual work, not just its final wording.


Why MWMS Needs Evaluation Scorecards

MWMS is building a system of AI Employees, not casual chatbots.

AI Employees may support:

  • research
  • campaign decisions
  • offer evaluation
  • compliance review
  • content strategy
  • data interpretation
  • newsletter intelligence
  • client reporting
  • business decision-making

Poor AI output can cause:

  • wasted time
  • bad offer choices
  • weak campaigns
  • compliance risk
  • false confidence
  • poor client outcomes
  • cost blowouts
  • wrong routing
  • bad data records
  • broken automation
  • loss of trust in the MWMS system

Evaluation scorecards prevent MWMS from scaling weak AI behaviour.


Evaluation Philosophy

MWMS evaluation should follow five principles.

1. Evaluate Real Work

AI Employees should be tested against tasks that resemble real MWMS workflows.

Generic examples are not enough.

2. Separate Structure From Judgement

Some quality checks are deterministic.

Some require judgement.

MWMS should use both.

3. Score With Defined Categories

Do not rely only on vague 1 to 10 ratings.

Use clear judgement categories first, then convert them into scores where needed.

4. Preserve Failure Cases

Failures should become regression tests so the same failure does not return later.

5. Improve Through Kaizen

Evaluation exists to improve the system, not only to criticise outputs.


Evaluation Types

MWMS should use three main evaluation types.


1. Deterministic Evaluations

Deterministic evaluations are simple, repeatable checks.

They are similar to unit tests.

They are useful when the expected behaviour is clear.

Examples:

  • Did the output include required fields?
  • Did the output include sources?
  • Did the output include a decision?
  • Did the output include a confidence score?
  • Did the output include a risk level?
  • Did the output include a next action?
  • Did the output use valid JSON?
  • Did the output follow the required format?
  • Did the output include a date awareness note where needed?
  • Did the output avoid banned wording?
  • Did the output route to the correct Brain?
  • Did the output include a Kaizen note?

Deterministic Evaluation Rule

Use deterministic evaluations first wherever possible.

They are cheaper, faster, and more consistent than LLM-based judgement.


2. LLM As Judge Evaluations

LLM-as-a-judge evaluations use another model to judge qualities that cannot be easily checked with simple rules.

Examples:

  • Is the answer factually accurate?
  • Is the answer relevant?
  • Did the answer overclaim?
  • Did the answer use the evidence correctly?
  • Is the recommendation justified?
  • Is the answer decision-ready?
  • Is the confidence level reasonable?
  • Is the tone appropriate?
  • Is the output safe?
  • Does the answer match the user’s real intent?

LLM Judge Rule

LLM judges should be used for judgement-heavy criteria, not simple required-field checks.

Do not waste judge calls on things deterministic tests can handle.


3. Human Review Evaluations

Human review is required for higher-risk decisions.

Examples:

  • campaign launch decisions
  • compliance-sensitive recommendations
  • budget decisions
  • offer approval
  • client-facing recommendations
  • legal, policy, medical, or financial risk areas
  • low-confidence AI outputs
  • weak-source outputs
  • failed or partial traces
  • disputed outputs
  • repeated AI Employee failures

Human Review Rule

LLM-as-a-judge can assist review, but it does not replace human oversight for high-risk MWMS decisions.


Score Categories Before Numeric Scores

MWMS should avoid vague judge prompts like:

Rate this answer from 1 to 10.

Instead, the judge should choose from structured categories.

Example factuality categories:

CategoryMeaningSuggested Score
Fully SupportedAll important claims are supported by evidence5
Mostly SupportedMain claims are supported, minor issues exist4
Partially SupportedSome claims are supported, but gaps exist3
Weakly SupportedEvidence is thin or unclear2
UnsupportedMajor claims lack evidence1
IncorrectAnswer contradicts evidence0

This creates more reliable scoring.


Core Scorecard Categories

Every serious AI Employee should eventually be evaluated across these categories.

1. Factual Accuracy

Measures whether the output is true and evidence-supported.

Questions:

  • Are the claims correct?
  • Are claims supported by sources or records?
  • Did the AI avoid hallucination?
  • Did it separate facts from assumptions?
  • Did it handle uncertainty honestly?

2. Answer Relevancy

Measures whether the output answered the actual task.

Questions:

  • Did the answer address the user’s question?
  • Did it stay on task?
  • Did it avoid irrelevant filler?
  • Did it solve the operational problem?
  • Did it produce the expected type of output?

3. Source Quality

Measures whether the sources used were suitable.

Questions:

  • Were sources official, credible, or appropriate?
  • Were weak sources flagged?
  • Were commercial or biased sources handled carefully?
  • Were conflicting sources identified?
  • Were sources actually inspected where needed?

4. Freshness

Measures whether the information was current enough.

Questions:

  • Did the AI understand whether the task was time-sensitive?
  • Were current sources used where needed?
  • Were outdated sources flagged?
  • Was the current date considered?
  • Was uncertainty stated when freshness was unknown?

5. Structure Compliance

Measures whether the output followed the required MWMS format.

Questions:

  • Were required sections included?
  • Were required fields present?
  • Was output valid JSON where needed?
  • Was the correct page/output standard followed?
  • Was the required decision or next action included?

6. Decision Usefulness

Measures whether the output helps MWMS act.

Questions:

  • Did it produce a clear decision?
  • Did it help move the workflow forward?
  • Did it identify risk?
  • Did it identify opportunity?
  • Did it give a useful next step?
  • Would HeadOffice or the relevant Brain actually use it?

7. Safety And Compliance

Measures whether the output avoids business, policy, legal, platform, or reputational risk.

Questions:

  • Did it avoid unsafe claims?
  • Did it avoid prohibited or risky recommendations?
  • Did it flag uncertainty?
  • Did it avoid overpromising?
  • Did it respect platform or compliance boundaries?

8. Confidence Calibration

Measures whether the confidence level matches the evidence.

Questions:

  • Was confidence too high?
  • Was confidence too low?
  • Did weak evidence reduce confidence?
  • Did source conflict reduce confidence?
  • Did failed tools reduce confidence?
  • Was human review triggered when confidence was low?

9. Cost Efficiency

Measures whether the output was worth the cost.

Questions:

  • Did the workflow use too many model calls?
  • Did it inspect too many sources?
  • Did it retry unnecessarily?
  • Was the cost reasonable for the task value?
  • Did failed outputs waste budget?

10. Speed

Measures whether the workflow was fast enough for the use case.

Questions:

  • Was latency acceptable?
  • Were slow tools identified?
  • Were database delays visible?
  • Was the crawler too slow?
  • Did the response time match the task priority?

11. Traceability

Measures whether the workflow can be reviewed after the fact.

Questions:

  • Was the task ID captured?
  • Was the Brain captured?
  • Was the AI Employee captured?
  • Were model calls logged?
  • Were tool calls logged?
  • Were source records linked?
  • Were database writes visible?
  • Was the final output stored?

12. Kaizen Value

Measures whether the result produced learning for system improvement.

Questions:

  • Did the output reveal a failure pattern?
  • Did it create a useful improvement note?
  • Did it improve future evals?
  • Did it expose prompt, tool, source, or workflow weakness?
  • Can the case become a regression test?

Suggested Core Scorecard

CategoryScore RangeRequired For
Factual Accuracy0–5Research, Deep Search, Newsletter, Affiliate, Ads
Answer Relevancy0–5All AI Employees
Source Quality0–5Research, Deep Search, Affiliate, Ads, Content
Freshness0–5Time-sensitive workflows
Structure CompliancePass/Fail or 0–5All structured workflows
Decision Usefulness0–5HeadOffice, Affiliate, Ads, Finance, Experimentation
Safety And Compliance0–5Ads, Content, Client-facing, Compliance-sensitive workflows
Confidence Calibration0–5All serious AI Employees
Cost Efficiency0–5Production workflows
Speed0–5Production workflows
Traceability0–5All AI Employees
Kaizen Value0–5All improvement workflows

Score Meaning

ScoreMeaning
5Excellent. Meets or exceeds MWMS standard
4Strong. Minor issues only
3Acceptable. Usable with caution
2Weak. Needs revision or review
1Poor. Not suitable for use
0Failed. Incorrect, unsafe, irrelevant, or unusable

Pass Fail Rules

Some criteria should be hard pass/fail checks.

Examples:

  • valid JSON required
  • required fields present
  • no prohibited claims
  • source required
  • decision required
  • confidence score required
  • task ID required
  • output saved correctly
  • human review triggered when required

If a hard pass/fail requirement fails, the output should not be treated as complete even if other scores are high.


Minimum Passing Standard

For normal internal MWMS AI work:

  • Factual Accuracy: minimum 3
  • Answer Relevancy: minimum 3
  • Structure Compliance: pass
  • Traceability: minimum 3
  • Decision Usefulness: minimum 3 where applicable

For higher-risk workflows:

  • Factual Accuracy: minimum 4
  • Source Quality: minimum 4
  • Safety And Compliance: minimum 4
  • Traceability: minimum 4
  • Human Review: required

For client-facing workflows:

  • Factual Accuracy: minimum 4
  • Answer Relevancy: minimum 4
  • Structure Compliance: pass
  • Safety And Compliance: minimum 4
  • Confidence Calibration: minimum 4
  • Human Review: required unless explicitly approved otherwise

Automatic Review Triggers

Human review should be triggered when:

  • factual accuracy score is below 4 on a high-risk workflow
  • answer relevancy is below 3
  • source quality is below 3
  • freshness is unknown on a time-sensitive topic
  • safety score is below 4
  • traceability is below 3
  • confidence is high but evidence is weak
  • sources conflict
  • tool calls failed
  • database writes failed
  • cost was excessive
  • the output affects budget, campaigns, compliance, or client-facing decisions
  • the AI Employee repeats a previous failure

Failure Conditions

An AI Employee output should be marked failed if:

  • it does not answer the task
  • it fabricates facts
  • it uses no evidence when evidence is required
  • it cites weak sources as strong proof
  • it ignores source freshness
  • it gives a decision without enough support
  • it produces invalid structured output
  • it omits required fields
  • it fails to route correctly
  • it hides uncertainty
  • it overstates confidence
  • it creates compliance risk
  • it cannot be traced
  • it cannot be linked to a task, thread, source, or workflow
  • it repeats a known regression failure

Evaluation Dataset Standard

MWMS evaluation datasets must reflect real MWMS work.

Datasets should include:

  • real user requests
  • real newsletter examples
  • real offer evaluation cases
  • real source inspection examples
  • real ad compliance examples
  • real campaign decision examples
  • real content research examples
  • real system failures
  • real edge cases
  • realistic synthetic cases only where real data is not available

Generic examples are not enough.

The dataset should test the AI Employee against the work it is actually expected to perform.


Dataset Difficulty Rule

Good eval datasets should include difficult cases.

Examples:

  • ambiguous requests
  • stale information traps
  • conflicting sources
  • weak evidence
  • missing source dates
  • multi-step research tasks
  • compliance-sensitive claims
  • source-quality edge cases
  • expensive workflow traps
  • irrelevant but factual answer traps
  • wrong Brain routing traps
  • output-format traps
  • confidence calibration traps

The goal is not to make the AI Employee look good.

The goal is to expose failure before the system depends on it.


Dataset Types

MWMS should organise AI Employee eval datasets into three main groups.


1. Dev Dataset

The Dev Dataset is used while improving an AI Employee.

Purpose:

  • prompt improvement
  • tool improvement
  • source rule improvement
  • workflow refinement
  • experimentation
  • debugging

Examples:

  • new offer evaluation examples
  • new newsletter extraction examples
  • new Deep Search research tasks
  • new source quality cases
  • new output format tests

Dev datasets can change often.


2. CI Dataset

The CI Dataset is a smaller must-pass test set.

Purpose:

  • catch obvious breakages
  • protect critical output structure
  • ensure basic behaviour remains stable
  • validate required fields
  • test essential safety rules

Examples:

  • valid JSON test
  • source required test
  • confidence required test
  • decision required test
  • no banned claims test
  • correct Brain routing test
  • required metadata test

CI datasets should stay small enough to run frequently.


3. Regression Dataset

The Regression Dataset contains previous failures that must not return.

Purpose:

  • prevent old mistakes from coming back
  • preserve lessons learned
  • convert failures into system protection
  • support Kaizen improvement

Examples:

  • AI hallucinated affiliate metrics
  • AI approved a weak offer
  • AI ignored source freshness
  • AI failed to cite sources
  • AI routed a task to the wrong Brain
  • AI missed a compliance risk
  • AI produced invalid JSON
  • AI gave high confidence with weak evidence
  • AI used outdated policy information
  • AI created an unsupported recommendation

Regression datasets are one of the most valuable MWMS assets.

Every serious failure should be considered for regression capture.


Evaluation Data Flywheel

MWMS should use an evaluation data flywheel.

The loop is:

AI Employee performs work
→ traces and metadata are captured
→ human or system feedback identifies quality issues
→ failures are added to eval datasets
→ evals expose prompt, tool, source, or workflow weakness
→ system is improved
→ AI Employee performs better
→ new traces create more learning

This aligns directly with the MWMS Kaizen loop.

Reflect
→ Reduce
→ Refine
→ Record

The more MWMS uses its AI Employees, the stronger the evaluation system becomes.


Factuality Evaluation Standard

Factuality evaluation checks whether the output is true and supported.

Suggested judgement categories:

CategoryMeaning
Fully SupportedAll key claims are supported by reliable evidence
Mostly SupportedMain claims are supported, minor gaps exist
Partially SupportedSome claims are supported, but important gaps exist
Weakly SupportedEvidence is thin, indirect, or questionable
UnsupportedKey claims lack evidence
ContradictedOutput conflicts with available evidence

Factuality should consider:

  • source reliability
  • source freshness
  • claim support
  • unsupported assumptions
  • conflicting evidence
  • hallucinated specifics
  • overconfident wording

Answer Relevancy Evaluation Standard

Answer relevancy checks whether the output answered the actual task.

Suggested judgement categories:

CategoryMeaning
Directly RelevantFully answers the user’s task
Mostly RelevantAnswers the main task with minor drift
Partially RelevantSome useful content but misses important parts
Weakly RelevantMostly background or loosely related
IrrelevantDoes not answer the task
MisalignedAnswers the wrong question or workflow

Relevancy should consider:

  • user intent
  • Brain context
  • required output type
  • decision need
  • operational usefulness
  • whether the answer moves the workflow forward

An answer can be factual and still fail relevancy.


Source Quality Evaluation Standard

Source quality checks whether the evidence was suitable.

Suggested judgement categories:

CategoryMeaning
Strong SourcesOfficial, current, credible, and directly relevant
Good SourcesMostly credible with minor limitations
Acceptable SourcesUsable but requires caution
Weak SourcesThin, biased, outdated, or indirect
Insufficient SourcesNot enough evidence
No SourcesSource requirement failed

Source quality should consider:

  • source type
  • trust rating
  • freshness
  • relevance
  • bias
  • commercial motive
  • corroboration
  • whether source was inspected directly

Freshness Evaluation Standard

Freshness checks whether the information is current enough.

Suggested judgement categories:

CategoryMeaning
CurrentSuitable for time-sensitive use
Recent EnoughAcceptable for the task
Possibly OutdatedUse with caution
OutdatedShould not be used for current decisions
UnknownDate could not be confirmed
Not ApplicableTopic is evergreen or historical

Freshness matters especially for:

  • policies
  • laws
  • platform rules
  • affiliate payouts
  • product pricing
  • tool features
  • campaign performance
  • AI model capabilities
  • current events
  • market trends

Structure Compliance Evaluation Standard

Structure compliance checks whether the output followed MWMS rules.

Suggested checks:

  • required title present
  • required sections present
  • required metadata present
  • correct Brain routing included
  • decision included
  • next action included
  • confidence included
  • risk level included
  • valid JSON where required
  • no prohibited title format
  • follows full page output standard where requested
  • follows MCR page structure where required

Structure compliance should use deterministic checks wherever possible.


Decision Usefulness Evaluation Standard

Decision usefulness checks whether the output helps MWMS take action.

Suggested judgement categories:

CategoryMeaning
Decision ReadyClear recommendation and next action
UsefulHelps decision-making but needs minor review
Partially UsefulContains useful information but lacks decision clarity
WeakRequires major human interpretation
Not UsefulDoes not support action
RiskyCould lead to poor decision

Decision usefulness is especially important for:

  • HeadOffice Intelligence
  • Affiliate Brain
  • Ads Brain
  • Finance Brain
  • Experimentation Brain
  • Strategy Brain

Safety And Compliance Evaluation Standard

Safety and compliance checks whether the output creates risk.

Suggested judgement categories:

CategoryMeaning
SafeNo obvious compliance or reputational risk
Mostly SafeMinor caution needed
Review NeededCould create risk if used directly
RiskyContains questionable claims or advice
UnsafeShould not be used
EscalateRequires human or specialist review

This applies especially to:

  • ad copy
  • health claims
  • financial claims
  • legal claims
  • income claims
  • product claims
  • client-facing outputs
  • compliance-sensitive campaigns

Confidence Calibration Evaluation Standard

Confidence calibration checks whether the stated confidence matches the evidence.

Suggested judgement categories:

CategoryMeaning
Well CalibratedConfidence matches evidence
Slightly HighConfidence is a little stronger than evidence supports
Slightly LowConfidence is too cautious but safe
OverconfidentConfidence is too high for evidence
UnderconfidentConfidence is too low for strong evidence
MisleadingConfidence could cause bad decisions

Confidence must be reduced when:

  • sources are weak
  • sources are outdated
  • source freshness is unknown
  • tools failed
  • database writes failed
  • evidence conflicts
  • output is incomplete
  • assumptions are required

AI Employee Scorecard Profiles

Each AI Employee should eventually have a defined scorecard profile.

A profile should include:

  • AI Employee name
  • owning Brain
  • workflow types
  • required scorecard categories
  • required deterministic checks
  • required judge evals
  • minimum passing thresholds
  • human review triggers
  • regression dataset rules
  • Kaizen routing rules

Example:

AI EmployeeRequired Scorecard Focus
Newsletter Intelligence Extractorrelevance, routing, structure, signal quality
Research Brain Source Analystfactuality, source quality, freshness
Affiliate Offer Evaluatordecision usefulness, source quality, risk, factuality
Ads Compliance Reviewersafety, policy accuracy, claim risk
Content Brain Research Assistantrelevance, source quality, usefulness
HeadOffice Decision Assistantdecision usefulness, traceability, confidence
Dev Console Helperrelevance, technical accuracy, safety
Client Facing AI Assistantfactuality, safety, tone, confidence, human review

Workflow Scorecard Profiles

Different workflows require different scorecard depth.

Workflow TypeRequired Evaluation Depth
Simple internal chat replylight
Dev Console supportmedium
Newsletter extractionmedium
Newsletter routing decisionmedium to high
Deep Search researchhigh
Affiliate offer evaluationhigh
Ads compliance reviewhigh
Campaign recommendationhigh
Budget recommendationhigh
Client-facing outputvery high

Promotion And Restriction Rules

AI Employees can earn more autonomy only when evaluation results support it.

Promotion Conditions

An AI Employee may be considered for more autonomy when:

  • repeated eval scores are strong
  • regression failures are low
  • human review approval rate is high
  • trace quality is strong
  • cost is controlled
  • source quality is consistent
  • safety issues are rare
  • confidence is well calibrated

Restriction Conditions

An AI Employee should be restricted when:

  • repeated factuality failures occur
  • regression failures return
  • structure compliance fails often
  • source quality is weak
  • confidence is poorly calibrated
  • cost becomes excessive
  • human review rejection rate is high
  • compliance risks appear
  • traceability is incomplete

Reporting Requirements

HeadOffice should eventually be able to review AI Employee evaluation performance.

Suggested reporting fields:

  • AI Employee name
  • Brain
  • number of evaluated runs
  • average factuality score
  • average relevancy score
  • average source quality score
  • average safety score
  • average decision usefulness score
  • pass rate
  • fail rate
  • human review rate
  • regression failure count
  • cost per successful output
  • most common failure reason
  • Kaizen actions created
  • promotion or restriction recommendation

Relationship To Observability Metadata

This standard depends on the MWMS AI Observability Metadata Standard.

Evaluation becomes stronger when traces include:

  • Brain
  • AI Employee
  • task ID
  • workflow type
  • model
  • tools
  • sources
  • database records
  • cost
  • latency
  • confidence
  • review status
  • decision outcome
  • Kaizen notes

Without metadata, evaluation is limited.

Without evaluation, metadata becomes passive logging.

Together, metadata and scorecards create accountable AI Employees.


Relationship To Deep Search Quality

This standard supports the MWMS Deep Search Quality And Observability Framework.

Deep Search AI Employees should be scored especially on:

  • factuality
  • source quality
  • freshness
  • answer relevancy
  • source inspection
  • traceability
  • cost efficiency
  • decision usefulness

Deep Search should not be trusted if it cannot show evidence, source freshness, and evaluation results.


Relationship To Experimentation Brain

This standard supports Experimentation Brain by turning AI Employee improvement into measurable tests.

Evaluation scorecards create:

  • baseline performance
  • test conditions
  • pass/fail thresholds
  • prompt experiment evidence
  • tool experiment evidence
  • model comparison evidence
  • regression protection
  • improvement tracking

AI Employee changes should eventually be treated like experiments where possible.


Relationship To Kaizen

This standard supports the MWMS Kaizen loop.

Each evaluation should help identify:

  • what worked
  • what failed
  • what should be reduced
  • what should be refined
  • what should be recorded
  • what should become a regression test
  • what should become a prompt improvement
  • what should become a tool improvement
  • what should become a workflow improvement

Evaluation is one of the main ways MWMS turns AI failure into system growth.


Minimum Starting Implementation

MWMS does not need to implement the full scorecard system immediately.

The first practical implementation should include:

  • required field checks
  • source-present checks
  • decision-present checks
  • confidence-present checks
  • valid structure checks
  • basic factuality judge
  • basic answer relevancy judge
  • human review status
  • failure reason
  • Kaizen note
  • regression case capture

This is enough to start improving AI Employee quality without slowing development.


Future Enhancements

Future enhancements may include:

  • MWMS AI Employee Eval Registry
  • MWMS Eval Dataset Registry
  • MWMS Regression Failure Library
  • MWMS AI Employee Performance Dashboard
  • MWMS HeadOffice Evaluation Dashboard
  • MWMS AI Employee Promotion And Restriction Standard
  • MWMS Model Comparison Evaluation Standard
  • MWMS Prompt Optimisation Evaluation Protocol
  • MWMS Deep Search Source Record Standard
  • MWMS Client Facing AI Quality Assurance Standard

Drift Protection

This standard prevents the following drift:

  • judging AI by vibes
  • trusting confident outputs without evidence
  • improving prompts randomly
  • repeating old AI failures
  • scaling untested AI Employees
  • using only easy test cases
  • relying only on human memory
  • confusing factuality with relevancy
  • confusing structure compliance with quality
  • ignoring cost and latency
  • ignoring regression failures
  • allowing AI Employees to gain autonomy without proof
  • losing Kaizen learning from failures

If an AI Employee cannot pass the required scorecard for its role, it should not be trusted with higher responsibility.


Architectural Intent

The architectural intent of this standard is to turn MWMS AI Employees into measurable workers.

A human worker is judged by output quality, reliability, accuracy, cost, improvement, and business usefulness.

AI Employees should be judged the same way.

MWMS should not scale AI Employees because they sound smart.

MWMS should scale AI Employees because they have been evaluated, reviewed, improved, and proven useful.

This standard creates the evaluation layer required for a governable AI workforce.


Change Log

v1.0 Initial Draft

Created the MWMS AI Employee Evaluation Scorecard Standard based on absorbed insights from Matt Pocock AIhero Build DeepSearch In TypeScript.

Integrated principles from course blocks covering:

  • making AI systems testable
  • deterministic evaluations
  • LLM-as-a-judge evaluations
  • factuality scoring
  • answer relevancy scoring
  • dataset creation
  • dev, CI, and regression dataset organisation
  • hard case evaluation
  • data flywheel improvement
  • prompt optimisation through eval results
  • AI Employee confidence calibration
  • regression protection
  • Kaizen improvement routing

Established this standard as the MWMS governance page for evaluating AI Employee quality, reliability, safety, usefulness, and readiness for increased autonomy.