System: MWMS
Document Type: Standard
Status: Draft For MCR
Authority: HeadOffice
Applies To: All MWMS Brains, AI Employees, Deep Search Workflows, Research Workflows, Newsletter Intelligence, Affiliate Evaluation, Ads Review, Content Workflows, Experimentation Brain, Future Client Facing AI Systems
Primary Location: MCR
Future Operational Destination: mwmsbrain.site, mwmsheadofficebrain.site, AI Employee Dashboards, HeadOffice Evaluation Dashboards
Parent Page: HeadOffice
Source Of Truth: MCR
Related Frameworks: MWMS Deep Search Quality And Observability Framework, MWMS AI Observability Metadata Standard
Course Source: Matt Pocock AIhero Build DeepSearch In TypeScript
Absorption Status: Approved For Integration
Purpose
The purpose of this standard is to define how MWMS evaluates AI Employees using structured, repeatable, and business-aligned scorecards.
MWMS must not judge AI Employees only by whether an answer sounds confident, polished, or impressive.
An AI Employee must be evaluated against defined criteria, including:
- factual accuracy
- answer relevancy
- source quality
- freshness
- structure compliance
- decision usefulness
- safety
- cost efficiency
- speed
- traceability
- regression protection
- human review usefulness
This standard ensures MWMS can measure whether an AI Employee is trustworthy, improving, drifting, failing, or ready for greater autonomy.
Scope
This standard applies to all AI Employees and AI-assisted workflows across MWMS.
This includes:
- HeadOffice AI decision support
- Research Brain source analysis
- Affiliate Brain offer evaluation
- Ads Brain compliance review
- Content Brain research and drafting
- Data Brain validation
- Experimentation Brain test analysis
- Finance Brain budget and cost analysis
- Newsletter Intelligence extraction
- Brain Room AI replies
- Dev Console assistance
- Deep Search agents
- future client-facing AI systems
- future AIBS business assistants
- future AI dashboards and evaluation systems
This standard does not define exact code, Evalite implementation, Langfuse setup, TypeScript architecture, or specific scoring library syntax.
It defines the MWMS evaluation governance standard.
Core Rule
Every serious AI Employee must be evaluated against clear success criteria before MWMS relies on it for important decisions.
If an AI Employee cannot be tested, scored, reviewed, and improved, it should not be trusted with meaningful business responsibility.
The MWMS rule is:
If we cannot evaluate the AI Employee, we cannot trust the AI Employee.
Definition Of An AI Employee Evaluation Scorecard
An AI Employee Evaluation Scorecard is a structured assessment system that measures the quality, reliability, usefulness, and safety of an AI Employee’s output.
A scorecard may include:
- deterministic checks
- LLM-as-a-judge evaluations
- human review
- source checks
- metadata checks
- cost checks
- regression checks
- confidence checks
- business usefulness checks
The scorecard must evaluate the AI Employee’s actual work, not just its final wording.
Why MWMS Needs Evaluation Scorecards
MWMS is building a system of AI Employees, not casual chatbots.
AI Employees may support:
- research
- campaign decisions
- offer evaluation
- compliance review
- content strategy
- data interpretation
- newsletter intelligence
- client reporting
- business decision-making
Poor AI output can cause:
- wasted time
- bad offer choices
- weak campaigns
- compliance risk
- false confidence
- poor client outcomes
- cost blowouts
- wrong routing
- bad data records
- broken automation
- loss of trust in the MWMS system
Evaluation scorecards prevent MWMS from scaling weak AI behaviour.
Evaluation Philosophy
MWMS evaluation should follow five principles.
1. Evaluate Real Work
AI Employees should be tested against tasks that resemble real MWMS workflows.
Generic examples are not enough.
2. Separate Structure From Judgement
Some quality checks are deterministic.
Some require judgement.
MWMS should use both.
3. Score With Defined Categories
Do not rely only on vague 1 to 10 ratings.
Use clear judgement categories first, then convert them into scores where needed.
4. Preserve Failure Cases
Failures should become regression tests so the same failure does not return later.
5. Improve Through Kaizen
Evaluation exists to improve the system, not only to criticise outputs.
Evaluation Types
MWMS should use three main evaluation types.
1. Deterministic Evaluations
Deterministic evaluations are simple, repeatable checks.
They are similar to unit tests.
They are useful when the expected behaviour is clear.
Examples:
- Did the output include required fields?
- Did the output include sources?
- Did the output include a decision?
- Did the output include a confidence score?
- Did the output include a risk level?
- Did the output include a next action?
- Did the output use valid JSON?
- Did the output follow the required format?
- Did the output include a date awareness note where needed?
- Did the output avoid banned wording?
- Did the output route to the correct Brain?
- Did the output include a Kaizen note?
Deterministic Evaluation Rule
Use deterministic evaluations first wherever possible.
They are cheaper, faster, and more consistent than LLM-based judgement.
2. LLM As Judge Evaluations
LLM-as-a-judge evaluations use another model to judge qualities that cannot be easily checked with simple rules.
Examples:
- Is the answer factually accurate?
- Is the answer relevant?
- Did the answer overclaim?
- Did the answer use the evidence correctly?
- Is the recommendation justified?
- Is the answer decision-ready?
- Is the confidence level reasonable?
- Is the tone appropriate?
- Is the output safe?
- Does the answer match the user’s real intent?
LLM Judge Rule
LLM judges should be used for judgement-heavy criteria, not simple required-field checks.
Do not waste judge calls on things deterministic tests can handle.
3. Human Review Evaluations
Human review is required for higher-risk decisions.
Examples:
- campaign launch decisions
- compliance-sensitive recommendations
- budget decisions
- offer approval
- client-facing recommendations
- legal, policy, medical, or financial risk areas
- low-confidence AI outputs
- weak-source outputs
- failed or partial traces
- disputed outputs
- repeated AI Employee failures
Human Review Rule
LLM-as-a-judge can assist review, but it does not replace human oversight for high-risk MWMS decisions.
Score Categories Before Numeric Scores
MWMS should avoid vague judge prompts like:
Rate this answer from 1 to 10.
Instead, the judge should choose from structured categories.
Example factuality categories:
| Category | Meaning | Suggested Score |
|---|---|---|
| Fully Supported | All important claims are supported by evidence | 5 |
| Mostly Supported | Main claims are supported, minor issues exist | 4 |
| Partially Supported | Some claims are supported, but gaps exist | 3 |
| Weakly Supported | Evidence is thin or unclear | 2 |
| Unsupported | Major claims lack evidence | 1 |
| Incorrect | Answer contradicts evidence | 0 |
This creates more reliable scoring.
Core Scorecard Categories
Every serious AI Employee should eventually be evaluated across these categories.
1. Factual Accuracy
Measures whether the output is true and evidence-supported.
Questions:
- Are the claims correct?
- Are claims supported by sources or records?
- Did the AI avoid hallucination?
- Did it separate facts from assumptions?
- Did it handle uncertainty honestly?
2. Answer Relevancy
Measures whether the output answered the actual task.
Questions:
- Did the answer address the user’s question?
- Did it stay on task?
- Did it avoid irrelevant filler?
- Did it solve the operational problem?
- Did it produce the expected type of output?
3. Source Quality
Measures whether the sources used were suitable.
Questions:
- Were sources official, credible, or appropriate?
- Were weak sources flagged?
- Were commercial or biased sources handled carefully?
- Were conflicting sources identified?
- Were sources actually inspected where needed?
4. Freshness
Measures whether the information was current enough.
Questions:
- Did the AI understand whether the task was time-sensitive?
- Were current sources used where needed?
- Were outdated sources flagged?
- Was the current date considered?
- Was uncertainty stated when freshness was unknown?
5. Structure Compliance
Measures whether the output followed the required MWMS format.
Questions:
- Were required sections included?
- Were required fields present?
- Was output valid JSON where needed?
- Was the correct page/output standard followed?
- Was the required decision or next action included?
6. Decision Usefulness
Measures whether the output helps MWMS act.
Questions:
- Did it produce a clear decision?
- Did it help move the workflow forward?
- Did it identify risk?
- Did it identify opportunity?
- Did it give a useful next step?
- Would HeadOffice or the relevant Brain actually use it?
7. Safety And Compliance
Measures whether the output avoids business, policy, legal, platform, or reputational risk.
Questions:
- Did it avoid unsafe claims?
- Did it avoid prohibited or risky recommendations?
- Did it flag uncertainty?
- Did it avoid overpromising?
- Did it respect platform or compliance boundaries?
8. Confidence Calibration
Measures whether the confidence level matches the evidence.
Questions:
- Was confidence too high?
- Was confidence too low?
- Did weak evidence reduce confidence?
- Did source conflict reduce confidence?
- Did failed tools reduce confidence?
- Was human review triggered when confidence was low?
9. Cost Efficiency
Measures whether the output was worth the cost.
Questions:
- Did the workflow use too many model calls?
- Did it inspect too many sources?
- Did it retry unnecessarily?
- Was the cost reasonable for the task value?
- Did failed outputs waste budget?
10. Speed
Measures whether the workflow was fast enough for the use case.
Questions:
- Was latency acceptable?
- Were slow tools identified?
- Were database delays visible?
- Was the crawler too slow?
- Did the response time match the task priority?
11. Traceability
Measures whether the workflow can be reviewed after the fact.
Questions:
- Was the task ID captured?
- Was the Brain captured?
- Was the AI Employee captured?
- Were model calls logged?
- Were tool calls logged?
- Were source records linked?
- Were database writes visible?
- Was the final output stored?
12. Kaizen Value
Measures whether the result produced learning for system improvement.
Questions:
- Did the output reveal a failure pattern?
- Did it create a useful improvement note?
- Did it improve future evals?
- Did it expose prompt, tool, source, or workflow weakness?
- Can the case become a regression test?
Suggested Core Scorecard
| Category | Score Range | Required For |
|---|---|---|
| Factual Accuracy | 0–5 | Research, Deep Search, Newsletter, Affiliate, Ads |
| Answer Relevancy | 0–5 | All AI Employees |
| Source Quality | 0–5 | Research, Deep Search, Affiliate, Ads, Content |
| Freshness | 0–5 | Time-sensitive workflows |
| Structure Compliance | Pass/Fail or 0–5 | All structured workflows |
| Decision Usefulness | 0–5 | HeadOffice, Affiliate, Ads, Finance, Experimentation |
| Safety And Compliance | 0–5 | Ads, Content, Client-facing, Compliance-sensitive workflows |
| Confidence Calibration | 0–5 | All serious AI Employees |
| Cost Efficiency | 0–5 | Production workflows |
| Speed | 0–5 | Production workflows |
| Traceability | 0–5 | All AI Employees |
| Kaizen Value | 0–5 | All improvement workflows |
Score Meaning
| Score | Meaning |
|---|---|
| 5 | Excellent. Meets or exceeds MWMS standard |
| 4 | Strong. Minor issues only |
| 3 | Acceptable. Usable with caution |
| 2 | Weak. Needs revision or review |
| 1 | Poor. Not suitable for use |
| 0 | Failed. Incorrect, unsafe, irrelevant, or unusable |
Pass Fail Rules
Some criteria should be hard pass/fail checks.
Examples:
- valid JSON required
- required fields present
- no prohibited claims
- source required
- decision required
- confidence score required
- task ID required
- output saved correctly
- human review triggered when required
If a hard pass/fail requirement fails, the output should not be treated as complete even if other scores are high.
Minimum Passing Standard
For normal internal MWMS AI work:
- Factual Accuracy: minimum 3
- Answer Relevancy: minimum 3
- Structure Compliance: pass
- Traceability: minimum 3
- Decision Usefulness: minimum 3 where applicable
For higher-risk workflows:
- Factual Accuracy: minimum 4
- Source Quality: minimum 4
- Safety And Compliance: minimum 4
- Traceability: minimum 4
- Human Review: required
For client-facing workflows:
- Factual Accuracy: minimum 4
- Answer Relevancy: minimum 4
- Structure Compliance: pass
- Safety And Compliance: minimum 4
- Confidence Calibration: minimum 4
- Human Review: required unless explicitly approved otherwise
Automatic Review Triggers
Human review should be triggered when:
- factual accuracy score is below 4 on a high-risk workflow
- answer relevancy is below 3
- source quality is below 3
- freshness is unknown on a time-sensitive topic
- safety score is below 4
- traceability is below 3
- confidence is high but evidence is weak
- sources conflict
- tool calls failed
- database writes failed
- cost was excessive
- the output affects budget, campaigns, compliance, or client-facing decisions
- the AI Employee repeats a previous failure
Failure Conditions
An AI Employee output should be marked failed if:
- it does not answer the task
- it fabricates facts
- it uses no evidence when evidence is required
- it cites weak sources as strong proof
- it ignores source freshness
- it gives a decision without enough support
- it produces invalid structured output
- it omits required fields
- it fails to route correctly
- it hides uncertainty
- it overstates confidence
- it creates compliance risk
- it cannot be traced
- it cannot be linked to a task, thread, source, or workflow
- it repeats a known regression failure
Evaluation Dataset Standard
MWMS evaluation datasets must reflect real MWMS work.
Datasets should include:
- real user requests
- real newsletter examples
- real offer evaluation cases
- real source inspection examples
- real ad compliance examples
- real campaign decision examples
- real content research examples
- real system failures
- real edge cases
- realistic synthetic cases only where real data is not available
Generic examples are not enough.
The dataset should test the AI Employee against the work it is actually expected to perform.
Dataset Difficulty Rule
Good eval datasets should include difficult cases.
Examples:
- ambiguous requests
- stale information traps
- conflicting sources
- weak evidence
- missing source dates
- multi-step research tasks
- compliance-sensitive claims
- source-quality edge cases
- expensive workflow traps
- irrelevant but factual answer traps
- wrong Brain routing traps
- output-format traps
- confidence calibration traps
The goal is not to make the AI Employee look good.
The goal is to expose failure before the system depends on it.
Dataset Types
MWMS should organise AI Employee eval datasets into three main groups.
1. Dev Dataset
The Dev Dataset is used while improving an AI Employee.
Purpose:
- prompt improvement
- tool improvement
- source rule improvement
- workflow refinement
- experimentation
- debugging
Examples:
- new offer evaluation examples
- new newsletter extraction examples
- new Deep Search research tasks
- new source quality cases
- new output format tests
Dev datasets can change often.
2. CI Dataset
The CI Dataset is a smaller must-pass test set.
Purpose:
- catch obvious breakages
- protect critical output structure
- ensure basic behaviour remains stable
- validate required fields
- test essential safety rules
Examples:
- valid JSON test
- source required test
- confidence required test
- decision required test
- no banned claims test
- correct Brain routing test
- required metadata test
CI datasets should stay small enough to run frequently.
3. Regression Dataset
The Regression Dataset contains previous failures that must not return.
Purpose:
- prevent old mistakes from coming back
- preserve lessons learned
- convert failures into system protection
- support Kaizen improvement
Examples:
- AI hallucinated affiliate metrics
- AI approved a weak offer
- AI ignored source freshness
- AI failed to cite sources
- AI routed a task to the wrong Brain
- AI missed a compliance risk
- AI produced invalid JSON
- AI gave high confidence with weak evidence
- AI used outdated policy information
- AI created an unsupported recommendation
Regression datasets are one of the most valuable MWMS assets.
Every serious failure should be considered for regression capture.
Evaluation Data Flywheel
MWMS should use an evaluation data flywheel.
The loop is:
AI Employee performs work
→ traces and metadata are captured
→ human or system feedback identifies quality issues
→ failures are added to eval datasets
→ evals expose prompt, tool, source, or workflow weakness
→ system is improved
→ AI Employee performs better
→ new traces create more learning
This aligns directly with the MWMS Kaizen loop.
Reflect
→ Reduce
→ Refine
→ Record
The more MWMS uses its AI Employees, the stronger the evaluation system becomes.
Factuality Evaluation Standard
Factuality evaluation checks whether the output is true and supported.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Fully Supported | All key claims are supported by reliable evidence |
| Mostly Supported | Main claims are supported, minor gaps exist |
| Partially Supported | Some claims are supported, but important gaps exist |
| Weakly Supported | Evidence is thin, indirect, or questionable |
| Unsupported | Key claims lack evidence |
| Contradicted | Output conflicts with available evidence |
Factuality should consider:
- source reliability
- source freshness
- claim support
- unsupported assumptions
- conflicting evidence
- hallucinated specifics
- overconfident wording
Answer Relevancy Evaluation Standard
Answer relevancy checks whether the output answered the actual task.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Directly Relevant | Fully answers the user’s task |
| Mostly Relevant | Answers the main task with minor drift |
| Partially Relevant | Some useful content but misses important parts |
| Weakly Relevant | Mostly background or loosely related |
| Irrelevant | Does not answer the task |
| Misaligned | Answers the wrong question or workflow |
Relevancy should consider:
- user intent
- Brain context
- required output type
- decision need
- operational usefulness
- whether the answer moves the workflow forward
An answer can be factual and still fail relevancy.
Source Quality Evaluation Standard
Source quality checks whether the evidence was suitable.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Strong Sources | Official, current, credible, and directly relevant |
| Good Sources | Mostly credible with minor limitations |
| Acceptable Sources | Usable but requires caution |
| Weak Sources | Thin, biased, outdated, or indirect |
| Insufficient Sources | Not enough evidence |
| No Sources | Source requirement failed |
Source quality should consider:
- source type
- trust rating
- freshness
- relevance
- bias
- commercial motive
- corroboration
- whether source was inspected directly
Freshness Evaluation Standard
Freshness checks whether the information is current enough.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Current | Suitable for time-sensitive use |
| Recent Enough | Acceptable for the task |
| Possibly Outdated | Use with caution |
| Outdated | Should not be used for current decisions |
| Unknown | Date could not be confirmed |
| Not Applicable | Topic is evergreen or historical |
Freshness matters especially for:
- policies
- laws
- platform rules
- affiliate payouts
- product pricing
- tool features
- campaign performance
- AI model capabilities
- current events
- market trends
Structure Compliance Evaluation Standard
Structure compliance checks whether the output followed MWMS rules.
Suggested checks:
- required title present
- required sections present
- required metadata present
- correct Brain routing included
- decision included
- next action included
- confidence included
- risk level included
- valid JSON where required
- no prohibited title format
- follows full page output standard where requested
- follows MCR page structure where required
Structure compliance should use deterministic checks wherever possible.
Decision Usefulness Evaluation Standard
Decision usefulness checks whether the output helps MWMS take action.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Decision Ready | Clear recommendation and next action |
| Useful | Helps decision-making but needs minor review |
| Partially Useful | Contains useful information but lacks decision clarity |
| Weak | Requires major human interpretation |
| Not Useful | Does not support action |
| Risky | Could lead to poor decision |
Decision usefulness is especially important for:
- HeadOffice Intelligence
- Affiliate Brain
- Ads Brain
- Finance Brain
- Experimentation Brain
- Strategy Brain
Safety And Compliance Evaluation Standard
Safety and compliance checks whether the output creates risk.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Safe | No obvious compliance or reputational risk |
| Mostly Safe | Minor caution needed |
| Review Needed | Could create risk if used directly |
| Risky | Contains questionable claims or advice |
| Unsafe | Should not be used |
| Escalate | Requires human or specialist review |
This applies especially to:
- ad copy
- health claims
- financial claims
- legal claims
- income claims
- product claims
- client-facing outputs
- compliance-sensitive campaigns
Confidence Calibration Evaluation Standard
Confidence calibration checks whether the stated confidence matches the evidence.
Suggested judgement categories:
| Category | Meaning |
|---|---|
| Well Calibrated | Confidence matches evidence |
| Slightly High | Confidence is a little stronger than evidence supports |
| Slightly Low | Confidence is too cautious but safe |
| Overconfident | Confidence is too high for evidence |
| Underconfident | Confidence is too low for strong evidence |
| Misleading | Confidence could cause bad decisions |
Confidence must be reduced when:
- sources are weak
- sources are outdated
- source freshness is unknown
- tools failed
- database writes failed
- evidence conflicts
- output is incomplete
- assumptions are required
AI Employee Scorecard Profiles
Each AI Employee should eventually have a defined scorecard profile.
A profile should include:
- AI Employee name
- owning Brain
- workflow types
- required scorecard categories
- required deterministic checks
- required judge evals
- minimum passing thresholds
- human review triggers
- regression dataset rules
- Kaizen routing rules
Example:
| AI Employee | Required Scorecard Focus |
|---|---|
| Newsletter Intelligence Extractor | relevance, routing, structure, signal quality |
| Research Brain Source Analyst | factuality, source quality, freshness |
| Affiliate Offer Evaluator | decision usefulness, source quality, risk, factuality |
| Ads Compliance Reviewer | safety, policy accuracy, claim risk |
| Content Brain Research Assistant | relevance, source quality, usefulness |
| HeadOffice Decision Assistant | decision usefulness, traceability, confidence |
| Dev Console Helper | relevance, technical accuracy, safety |
| Client Facing AI Assistant | factuality, safety, tone, confidence, human review |
Workflow Scorecard Profiles
Different workflows require different scorecard depth.
| Workflow Type | Required Evaluation Depth |
|---|---|
| Simple internal chat reply | light |
| Dev Console support | medium |
| Newsletter extraction | medium |
| Newsletter routing decision | medium to high |
| Deep Search research | high |
| Affiliate offer evaluation | high |
| Ads compliance review | high |
| Campaign recommendation | high |
| Budget recommendation | high |
| Client-facing output | very high |
Promotion And Restriction Rules
AI Employees can earn more autonomy only when evaluation results support it.
Promotion Conditions
An AI Employee may be considered for more autonomy when:
- repeated eval scores are strong
- regression failures are low
- human review approval rate is high
- trace quality is strong
- cost is controlled
- source quality is consistent
- safety issues are rare
- confidence is well calibrated
Restriction Conditions
An AI Employee should be restricted when:
- repeated factuality failures occur
- regression failures return
- structure compliance fails often
- source quality is weak
- confidence is poorly calibrated
- cost becomes excessive
- human review rejection rate is high
- compliance risks appear
- traceability is incomplete
Reporting Requirements
HeadOffice should eventually be able to review AI Employee evaluation performance.
Suggested reporting fields:
- AI Employee name
- Brain
- number of evaluated runs
- average factuality score
- average relevancy score
- average source quality score
- average safety score
- average decision usefulness score
- pass rate
- fail rate
- human review rate
- regression failure count
- cost per successful output
- most common failure reason
- Kaizen actions created
- promotion or restriction recommendation
Relationship To Observability Metadata
This standard depends on the MWMS AI Observability Metadata Standard.
Evaluation becomes stronger when traces include:
- Brain
- AI Employee
- task ID
- workflow type
- model
- tools
- sources
- database records
- cost
- latency
- confidence
- review status
- decision outcome
- Kaizen notes
Without metadata, evaluation is limited.
Without evaluation, metadata becomes passive logging.
Together, metadata and scorecards create accountable AI Employees.
Relationship To Deep Search Quality
This standard supports the MWMS Deep Search Quality And Observability Framework.
Deep Search AI Employees should be scored especially on:
- factuality
- source quality
- freshness
- answer relevancy
- source inspection
- traceability
- cost efficiency
- decision usefulness
Deep Search should not be trusted if it cannot show evidence, source freshness, and evaluation results.
Relationship To Experimentation Brain
This standard supports Experimentation Brain by turning AI Employee improvement into measurable tests.
Evaluation scorecards create:
- baseline performance
- test conditions
- pass/fail thresholds
- prompt experiment evidence
- tool experiment evidence
- model comparison evidence
- regression protection
- improvement tracking
AI Employee changes should eventually be treated like experiments where possible.
Relationship To Kaizen
This standard supports the MWMS Kaizen loop.
Each evaluation should help identify:
- what worked
- what failed
- what should be reduced
- what should be refined
- what should be recorded
- what should become a regression test
- what should become a prompt improvement
- what should become a tool improvement
- what should become a workflow improvement
Evaluation is one of the main ways MWMS turns AI failure into system growth.
Minimum Starting Implementation
MWMS does not need to implement the full scorecard system immediately.
The first practical implementation should include:
- required field checks
- source-present checks
- decision-present checks
- confidence-present checks
- valid structure checks
- basic factuality judge
- basic answer relevancy judge
- human review status
- failure reason
- Kaizen note
- regression case capture
This is enough to start improving AI Employee quality without slowing development.
Future Enhancements
Future enhancements may include:
- MWMS AI Employee Eval Registry
- MWMS Eval Dataset Registry
- MWMS Regression Failure Library
- MWMS AI Employee Performance Dashboard
- MWMS HeadOffice Evaluation Dashboard
- MWMS AI Employee Promotion And Restriction Standard
- MWMS Model Comparison Evaluation Standard
- MWMS Prompt Optimisation Evaluation Protocol
- MWMS Deep Search Source Record Standard
- MWMS Client Facing AI Quality Assurance Standard
Drift Protection
This standard prevents the following drift:
- judging AI by vibes
- trusting confident outputs without evidence
- improving prompts randomly
- repeating old AI failures
- scaling untested AI Employees
- using only easy test cases
- relying only on human memory
- confusing factuality with relevancy
- confusing structure compliance with quality
- ignoring cost and latency
- ignoring regression failures
- allowing AI Employees to gain autonomy without proof
- losing Kaizen learning from failures
If an AI Employee cannot pass the required scorecard for its role, it should not be trusted with higher responsibility.
Architectural Intent
The architectural intent of this standard is to turn MWMS AI Employees into measurable workers.
A human worker is judged by output quality, reliability, accuracy, cost, improvement, and business usefulness.
AI Employees should be judged the same way.
MWMS should not scale AI Employees because they sound smart.
MWMS should scale AI Employees because they have been evaluated, reviewed, improved, and proven useful.
This standard creates the evaluation layer required for a governable AI workforce.
Change Log
v1.0 Initial Draft
Created the MWMS AI Employee Evaluation Scorecard Standard based on absorbed insights from Matt Pocock AIhero Build DeepSearch In TypeScript.
Integrated principles from course blocks covering:
- making AI systems testable
- deterministic evaluations
- LLM-as-a-judge evaluations
- factuality scoring
- answer relevancy scoring
- dataset creation
- dev, CI, and regression dataset organisation
- hard case evaluation
- data flywheel improvement
- prompt optimisation through eval results
- AI Employee confidence calibration
- regression protection
- Kaizen improvement routing
Established this standard as the MWMS governance page for evaluating AI Employee quality, reliability, safety, usefulness, and readiness for increased autonomy.