MWMS Data Extraction And Actor Infrastructure Framework

System: MWMS

Document Type: Operating Framework

Authority Level: MCR Source Of Truth

Status: Active

Version: v1.2

Primary Location: MCR

Future Operational Destination: Research Brain, Data Brain, AIBS Brain, Affiliate Brain, PPL Brain, Content Brain, Sales Brain, Experimentation Brain, HeadOffice Brain, Compliance Brain, Risk Brain, Automation Brain

Parent Page: Research Brain

Owner: Martyn

Developer Boundary: Do Not Touch M’s Active Build Areas Unless Specifically Assigned

Source Of Truth: MCR

Last Reviewed: 2026-06-21

Source / Origin: AI Automations by Jack Commercialisation Block, Apify Masterclass, Lead Generation Systems, Productised AIOS Service Packaging, Case Study Pattern Library, RSS Extraction, Website Crawling, YouTube Transcript Capture, Email And Meeting Intelligence, Browser Capture, RAG Intake, Research Automation, Keyword Driven Lead Discovery, Contact Extraction, Scrape Or Reject Review Gates, Prospect Enrichment, And Downstream Outreach Preparation Material

MWMS Classification: Research Brain Operating Framework / Data Extraction Standard / Actor Infrastructure Framework / Source Intake Governance / Candidate Discovery And Permitted Use Standard / Structured Intelligence Pipeline

Primary Brain: Research Brain

Supporting Brains: Data Brain, AIBS Brain, Affiliate Brain, PPL Brain, Content Brain, Sales Brain, Experimentation Brain, HeadOffice Brain, Compliance Brain, Risk Brain, Automation Brain

Related Pages: Research Brain Canon, Data Brain Canon, MWMS Search Scrape Summarise Evidence Pipeline Standard, MWMS Source Visibility And Evidence Display Standard, MWMS Deep Search Quality And Observability Framework, MWMS Research Synthesis Documentation And Distribution Framework, MWMS Outbound Lead Enrichment And Cold Outreach Governance Framework, MWMS AIOS Lead Capture And Conversion Infrastructure Framework, MWMS AI Assisted Outreach And Sales Follow Up Automation Framework, MWMS Personalised Visual Sales Asset Production And Governance Framework, MWMS Client Context Isolation And Privacy Boundary Standard, MWMS AI Automation Security And Risk Checklist, MWMS AI Tool Permission And Access Framework

Purpose

The purpose of the MWMS Data Extraction And Actor Infrastructure Framework is to define how MWMS uses structured web data extraction, actor-based automation, scraping systems, APIs, feeds, transcripts, files, browser capture, enrichment workflows, and research pipelines to support better decisions across the MWMS ecosystem.

This framework exists because Research Brain and Data Brain must become stronger than manual searching.

MWMS cannot rely only on:

random Google searches

scattered course notes

manual copying

isolated spreadsheets

unverified summaries

single-source assumptions

unowned scraping workflows

tool-specific actor configurations

The framework defines how authorised source data becomes:

structured evidence

clean records

enriched intelligence

scored opportunities

reviewable candidates

permitted downstream inputs

dashboards

reports

Brain requests

commercial opportunities

client AIOS inputs

The core purpose is:

Turn authorised external and internal source data into structured, source-visible MWMS intelligence that can be searched, scored, compared, reviewed, routed, tested, and used by the correct Brain.

Core Doctrine

The MWMS doctrine is:

Data extraction is not valuable by itself.

Data extraction becomes valuable when it feeds a decision, dashboard, offer, campaign, experiment, report, client system, or governed knowledge base.

A scraper is not the asset.

An actor is not the asset.

An API is not the asset.

An RSS feed is not the asset.

A transcript is not the asset.

A crawler is not the asset.

A lead list is not the asset.

The asset is the structured intelligence produced from authorised source evidence.

MWMS should never build data extraction systems merely because they are technically possible.

Every extraction workflow must answer:

What decision will this support?

Which Brain needs the data?

What source is being captured?

Are we permitted to access and use it?

What fields are needed?

How fresh must the data be?

Is this one-time capture or continuous monitoring?

What action may the data trigger?

How will the raw evidence be preserved?

How will the data be cleaned?

How will changes be detected?

How will duplicates be handled?

How will the data be stored?

How will the data be scored?

What compliance risks exist?

What downstream use is permitted?

What human review is required?

What dashboard, report, or knowledge system will show the value?

What is the cost of maintaining the data flow?

Strategic Importance

The long-term strategic value of this framework is not scraping.

It is reliable external intelligence infrastructure.

That infrastructure can support:

market research

offer discovery

competitor intelligence

lead discovery

client intelligence

content research

affiliate research

PPL research

review mining

trend monitoring

price monitoring

product monitoring

local business intelligence

research evidence

sales relevance

AIOS opportunity discovery

The strongest extraction systems combine:

clear business purpose

authorised sources

stable actors

raw evidence preservation

structured schemas

normalisation

identity resolution

deduplication

enrichment

freshness control

scoring

human review

permitted-use decisions

Brain routing

observability

Core Definitions

A data extraction workflow is a repeatable process that collects information from an authorised external or internal source and converts it into structured data or source-linked evidence.

An actor is a reusable automation component that performs a defined extraction or processing job, such as:

scraping a website

collecting listings

extracting reviews

monitoring a page

retrieving a feed

capturing transcripts

normalising records

enriching a dataset

An actor infrastructure layer is the system that stores, runs, monitors, versions, and routes these extraction components.

A research pipeline is the full pathway from:

source selection

capture

raw evidence preservation

cleaning

normalisation

enrichment

scoring

candidate review

permitted-use decision

storage

comparison

dashboarding

Brain routing

A source intake pipeline is the governed path through which authorised information enters MWMS.

Change detection is the comparison of newly captured source data with previous source records to identify material differences.

Raw evidence is the original captured material before interpretation, summarisation, or cleansing changes its presentation.

A candidate record is an extracted entity that may be relevant to a downstream purpose but has not yet been accepted for that purpose.

A permitted-use decision is the explicit determination of whether an accepted record may be used for research, analysis, reporting, personalisation, sales preparation, outreach, client delivery, or another named activity.

MWMS Definition

The MWMS Data Extraction And Actor Infrastructure Framework is:

Research Brain and Data Brain’s standard for converting authorised web data, feeds, transcripts, files, emails, meetings, browser captures, marketplace signals, competitor intelligence, lead data, review data, product data, and market signals into structured, governed, reusable intelligence pipelines that support MWMS decisions, dashboards, offers, campaigns, knowledge systems, and client AIOS systems.

Scope

This framework applies to:

Research Brain market research

Data Brain structured intelligence

Affiliate Brain product research

PPL offer research

AIBS client research

competitor monitoring

offer intelligence

ad intelligence

review mining

lead enrichment

Google Maps-style business research

keyword-driven prospect discovery

search-result extraction

contact discovery

website contact extraction

social profile extraction

marketplace scraping

e-commerce product monitoring

real estate data extraction

social proof monitoring

price monitoring

ranking and visibility monitoring

AI tool monitoring

newsletter intelligence enrichment

case study extraction

client intelligence reports

content opportunity systems

data-backed dashboards

actor-based SaaS or micro-app infrastructure

RSS ingestion

website crawling

sitemap capture

YouTube transcript capture

uploaded file extraction

email intelligence intake

meeting transcript intake

support conversation capture

browser-selected text capture

form data intake

CRM imports

API feeds

webhook events

scheduled source monitoring

This framework applies whenever MWMS uses extraction, capture, crawling, feeds, transcripts, files, or scraping to support a business decision.

Core Principle

The core principle is:

Extract only what MWMS is authorised to access and can use, structure, verify, govern, retain, and act on.

A data extraction workflow should not create a pile of raw data.

It should create usable intelligence.

Usable intelligence means:

purpose-led

authorised

structured

cleaned

timestamped

source-linked

identity-linked where required

deduplicated

change-aware

scored where useful

human-reviewable

downstream-use controlled

routed to the right Brain

connected to a decision

displayed where useful

retained appropriately

deletable

not over-collected

The MWMS Data Extraction And Actor Infrastructure Model

Every extraction and intake system should be designed across seventeen layers:

Intelligence Need Layer

Source Authority And Permission Layer

Source Selection Layer

Capture Method Layer

Actor And Automation Layer

Raw Evidence Layer

Data Schema Layer

Cleaning And Normalisation Layer

Identity And Deduplication Layer

Enrichment Layer

Change Detection And Freshness Layer

Scoring And Classification Layer

Storage And Retention Layer

Dashboard And Report Layer

Brain Routing Layer

Governance And Compliance Layer

Candidate Discovery And Permitted Downstream Use Layer

  1. Intelligence Need Layer

The first step is not choosing a scraper.

The first step is identifying the intelligence need.

Intelligence Need Questions

Ask:

What are we trying to learn?

Which Brain needs the answer?

What decision depends on this?

Is this for affiliate, PPL, AIBS, content, ads, research, client work, sales, or HeadOffice?

Is this one-time research or recurring monitoring?

How fresh must the data be?

What fields are needed?

What evidence must be preserved?

What output is required?

What will happen if the data confirms the hypothesis?

What will happen if the data contradicts the hypothesis?

What downstream use may be requested?

Example Intelligence Needs

Find local businesses with poor follow-up signals.

Identify companies with weak website conversion paths.

Monitor competitor offers.

Track new product or pricing changes.

Find content gaps.

Extract review themes.

Build a qualified market map.

Identify potential AIOS clients.

Rule

No extraction should begin without a defined business question.

  1. Source Authority And Permission Layer

The system must determine whether MWMS may access and use the source.

Source Authority Questions

Ask:

Is the source public?

Is authentication required?

Do we have permission?

Do platform terms restrict extraction?

Does the source contain personal data?

Is the intended use different from the source’s original purpose?

Does the source contain restricted or sensitive information?

Does the client own or authorise the source?

Is the source licensed?

Can the data be retained?

Can it be reused?

Can it be used for outreach?

Permission Status

Approved

Approved With Conditions

Review Required

Restricted

Prohibited

Unknown

Rule

Technical accessibility does not equal permission.

  1. Source Selection Layer

Sources should be selected according to the intelligence need.

Possible sources include:

official websites

search results

directories

marketplaces

feeds

sitemaps

public profiles

review platforms

social pages

news sources

documents

transcripts

emails

meetings

client systems

APIs

files

browser captures

Source Selection Criteria

authority

relevance

coverage

freshness

stability

cost

permission

reliability

extractability

change frequency

Rule

Use the strongest practical sources rather than the easiest source alone.

  1. Capture Method Layer

Possible capture methods include:

manual capture

API

feed

actor

scraper

crawler

browser capture

file upload

transcript retrieval

email intake

webhook

database export

The method should be selected according to:

source

volume

frequency

structure

cost

reliability

permission

maintenance burden

Rule

The capture method must fit the source and decision.

  1. Actor And Automation Layer

Each actor should perform a clear job.

Actor examples include:

search-result actor

website crawler

contact extractor

review scraper

feed reader

transcript retriever

file parser

price monitor

change detector

normaliser

enrichment actor

Actor Requirements

Actor name

owner

purpose

source

inputs

outputs

version

frequency

cost

permissions

failure handling

destination

last tested

Rule

Actors should be reusable, observable, and replaceable.

  1. Raw Evidence Layer

Raw evidence must be preserved where the decision may require audit, verification, or reprocessing.

Raw evidence may include:

source HTML

original JSON

feed item

file

email

transcript

screenshot

browser capture

API response

Raw Evidence Fields

Source ID:

Source URL:

Captured At:

Published At:

Updated At:

Capture Method:

Actor Version:

Raw Location:

Hash:

Access Status:

Rule

Cleaning must not erase the ability to inspect the original evidence.

  1. Data Schema Layer

Unstructured captured data must become structured before it can support MWMS decisions.

Schema fields depend on use case.

Example Fields For AIBS Lead Discovery

candidate_id

business_name

website

domain

industry

location

contact_name

contact_role

email

phone

social_profiles

review_rating

review_count

booking_link_present

chatbot_present

response_gap_signal

source_url

extracted_at

lead_score

risk_notes

candidate_status

permitted_use_status

Rule

Fields must have clear definitions and data types.

  1. Cleaning And Normalisation Layer

Cleaning may include:

removing duplicates

normalising phone numbers

normalising URLs

standardising categories

cleaning names

removing irrelevant records

removing broken records

validating required fields

checking timestamps

checking source links

detecting incomplete records

separating text from HTML

removing spam results

language detection

speaker label normalisation

transcript formatting

date normalisation

Cleaning Status Values

Raw

Cleaning

Cleaned

Partially Cleaned

Rejected

Review Needed

Rule

MWMS must distinguish raw evidence from cleaned data.

  1. Identity And Deduplication Layer

Extracted records must be matched to the correct entity.

Entity types may include:

business

person

client

product

offer

website

article

video

email

meeting

document

listing

campaign

source

Deduplication Methods

stable source ID

URL normalisation

content hash

file hash

email message ID

transcript ID

business identifier

domain

verified email

phone number

title plus publication date

source plus external record ID

Deduplication Questions

Is this record genuinely new?

Is it an updated version?

Is it a duplicate from another source?

Is it syndicated content?

Does it belong to an existing entity?

Does the new record replace the old record?

Rule

Duplicate data should not inflate evidence, lead counts, trend strength, source confidence, or outreach volume.

  1. Enrichment Layer

Enrichment adds useful context.

Possible enrichment includes:

email finding

domain lookup

social profile lookup

business size estimate

industry classification

review sentiment

website technology detection

traffic estimate

ad activity detection

offer classification

buyer persona classification

contact role detection

location enrichment

AI summary

pain signal extraction

opportunity note

source confidence

content classification

Enrichment should improve actionability, not merely add data volume.

Enrichment Evidence Status

Verified Fact

Derived Value

Estimate

Inference

Classification

Unverified Enrichment

Rule

Enriched inference must not be stored as confirmed source fact.

  1. Change Detection And Freshness Layer

Recurring extraction should identify material changes.

Possible changes include:

price change

headline change

offer change

product availability change

policy update

new review

rating change

website redesign

new CTA

new testimonial

new job listing

new RSS item

updated article

changed product specification

new meeting commitment

new client information

Freshness Questions

When was the source published?

When was it last updated?

When was it captured?

When was it last verified?

How often should it be rechecked?

Is the record still current?

Has it been replaced?

Rule

Capture time, source publication time, source update time, and verification time must remain separate.

  1. Scoring And Classification Layer

Data should be scored when decisions require ranking.

Possible scores include:

lead fit score

trophy client score

offer opportunity score

affiliate opportunity score

review weakness score

local SEO opportunity score

competitor threat score

content opportunity score

trend strength score

pricing gap score

pain signal score

buyer sophistication score

AIOS fit score

source confidence score

change materiality score

Example AIBS Lead Score

Pain Signal: 25

Ability To Pay: 20

Reachability: 15

AIOS Fit: 15

Review Or Reputation Gap: 10

Website Or Conversion Gap: 10

Compliance Risk: -5

Rule

Scoring must be explainable enough for human review.

Scoring must not convert weak inference into objective fact.

  1. Storage And Retention Layer

Possible destinations include:

Supabase

Google Sheets

Airtable

CRM

WordPress database

MCR page

vector memory

local CSV

dashboard database

client AIOS database

research archive

raw evidence archive

Storage Questions

Is this source of truth?

Is this temporary?

Is this raw evidence?

Is this structured metrics data?

Is this research context?

Does it need semantic retrieval?

Does it need dashboarding?

Does it contain personal data?

Does it need client isolation?

How long should it be retained?

Who can access it?

Can it be deleted?

Retention Status

Active

Temporary

Historical

Stale

Replaced

Archived

Delete Requested

Deleted

Rule

The raw source and processed records must remain connected.

  1. Dashboard And Report Layer

Extracted data should become visible only when visibility helps a decision.

Possible outputs include:

opportunity dashboard

competitor dashboard

lead review queue

change-monitoring report

market intelligence report

client intelligence report

source health report

actor performance report

Dashboard Questions

What decision should the dashboard support?

Who reviews it?

What needs action?

What is stale?

What failed?

What changed?

What was accepted?

What was rejected?

What is permitted for downstream use?

Rule

Dashboards must support review and action.

  1. Brain Routing Layer

Extracted intelligence must be routed to the correct Brain.

Possible routes include:

Research Brain

Data Brain

AIBS Brain

Affiliate Brain

PPL Brain

Content Brain

Sales Brain

Experimentation Brain

HeadOffice Brain

Compliance Brain

Risk Brain

Routing should define:

destination

reason

record type

evidence

confidence

requested action

owner

Rule

Extraction does not create downstream authority.

The receiving Brain must apply its own governance.

  1. Governance And Compliance Layer

Governance should cover:

source authority

privacy

personal data

terms of service

client isolation

retention

deletion

sensitive data

regulated data

outreach use

recording consent

copyright

access control

Governance Review Outcomes

Approved

Approved With Conditions

Human Review Required

Restricted

Rejected

Rule

Data that is lawful or appropriate for research may not automatically be appropriate for outreach, personalisation, publication, or client delivery.

  1. Candidate Discovery And Permitted Downstream Use Layer

This layer governs the transition from extracted record to accepted operational candidate.

It is especially important for:

lead discovery

prospect research

local business lists

contact extraction

sales personalisation

client intelligence

personalised asset production

outbound campaign preparation

Candidate Discovery Path

Business Question

→ Search Term Or Source Definition

→ Initial Extraction

→ Candidate Record Creation

→ Relevance Review

→ Accept, Reject, Merge, Or Hold

→ Approved Enrichment

→ Identity Verification

→ Contact Confidence

→ Permitted-Use Review

→ Campaign Or Brain Readiness

→ Downstream Handoff

Candidate Status

New

Review Required

Accepted For Enrichment

Rejected

Duplicate

Merged

Hold

Enriched

Identity Verified

Permitted For Research

Permitted For Internal Analysis

Permitted For Personalisation Preparation

Permitted For Outreach Review

Restricted

Expired

Rule

An extracted search result is a candidate, not an approved prospect.

Scrape Or Reject Gate

Before deeper extraction or enrichment, the system may apply a scrape-or-reject decision.

Review criteria may include:

business relevance

target-market fit

location

industry

company type

obvious competitor status

existing client relationship

existing suppression

duplicate status

source quality

compliance risk

commercial value

Scrape Decision

Scrape

Reject

Hold

Merge

Escalate

Rule

A scrape decision authorises only the approved next extraction step.

It does not authorise outreach.

Contact Extraction Standard

Contact extraction may collect:

public business email

named business contact

phone

social profile

contact page

role

department

The system should record:

contact source

contact type

identity confidence

business versus personal status

verification status

capture date

permitted-use status

Contact Confidence

Verified

Probable

Unverified

Conflicting

Invalid

Rule

A found email address is not automatically a verified decision-maker or an outreach-ready contact.

One-Contact-Per-Company Rule

Where workflows require one primary contact, the selection logic should be explicit.

Possible priority:

verified relevant decision-maker

verified role-based contact

general business contact

contact form

no suitable contact

The system must not silently discard useful alternative contacts without preserving source evidence where retention is justified.

Permitted Downstream Use

Possible permitted uses include:

Research Only

Internal Analysis

Market Mapping

Dashboarding

Client Intelligence

Personalisation Preparation

Human Outreach Review

Campaign Eligible

Publication

Restricted

Prohibited

Each permitted-use decision should record:

record ID

source authority

identity confidence

purpose

channel

reviewer

decision

conditions

expiry

Rule

Permission for one use does not create permission for every use.

Downstream Handoff Record

Candidate ID:

Entity:

Business Question:

Source:

Raw Evidence:

Candidate Status:

Identity Status:

Enrichment Status:

Score:

Risk:

Permitted Use:

Destination Brain:

Requested Action:

Human Review:

Owner:

Expiry:

Lead Discovery And Outreach Boundary

Research Brain and Data Brain may:

discover

extract

clean

normalise

deduplicate

enrich

score

classify

prepare evidence

recommend a downstream route

They must not independently:

authorise cold outreach

send messages

generate deceptive personalisation

override suppression

decide legal compliance

publish personal data

The Outbound Lead Enrichment And Cold Outreach Governance Framework controls outreach readiness and delivery.

The Personalised Visual Sales Asset Production And Governance Framework controls personalised visual, likeness, logo, voice, and synthetic-media assets.

The AIOS Lead Capture And Conversion Infrastructure Framework controls lead and conversion progression after a legitimate response or lead event.

Rule

Extraction authority stops before communication authority.

Actor Registry Standard

MWMS should maintain an Actor Registry.

Actor Name:

Brain Owner:

Purpose:

Source:

Capture Method:

Input Fields:

Output Fields:

Raw Evidence Location:

Run Frequency:

Destination Table:

Dashboard Or Report:

Compliance Notes:

Cost:

Status:

Version:

Last Tested:

Failure Notes:

Rule

Actors should be registered before becoming operational dependencies.

Source Pipeline Registry Standard

Pipeline Name:

Source Type:

Source Owner:

Permission Status:

Capture Method:

Actor Or Workflow:

Run Frequency:

Raw Evidence Location:

Processed Destination:

Deduplication Method:

Change Detection Rule:

Candidate Review Gate:

Permitted Use Rule:

Retention Rule:

Brain Destination:

Review Owner:

Status:

Rule

Recurring source intake must have an identifiable owner and maintenance status.

Data Extraction Request Template

Request Name:

Requesting Brain:

Business Question:

Decision Supported:

Source Or Sources:

Source Authority:

Data Needed:

Raw Evidence Needed: Yes / No

One-Time Or Recurring:

Watch Or Historical Retrieval:

Preferred Method:

Output Fields:

Destination:

Dashboard Needed: Yes / No

Scoring Needed: Yes / No

Candidate Review Needed: Yes / No

Change Detection Needed: Yes / No

Deduplication Method:

Permitted Downstream Use:

Retention Rule:

Compliance Risk:

Human Review Needed:

Action After Extraction:

Owner:

Due Date:

Extraction Output Template

Extraction Name:

Date:

Source Or Sources:

Source Authority:

Method Used:

Actor Or Workflow Used:

Raw Evidence Preserved:

Records Extracted:

Candidates Created:

Records Accepted:

Records Rejected:

Records Held:

Duplicates Detected:

Changes Detected:

Identity Verified:

Enrichment Completed:

Permitted Use Decisions:

Data Quality Notes:

Key Findings:

Top Opportunities:

Risks Or Compliance Notes:

Recommended Brain Routing:

Recommended Action:

Next Run Needed:

Retention Status:

Data Extraction Scorecard

Decision Value: 20

Data Availability: 15

Source Reliability: 10

Extraction Feasibility: 10

Repeat Use Potential: 10

Dashboard Or Reporting Value: 10

Commercial Value: 10

Compliance Risk: -10

Maintenance Burden: -5

MWMS Strategic Fit: 10

Interpretation

80 Or Higher

Strong extraction candidate.

65 To 79

Useful. Test carefully.

50 To 64

Use one-time or manual research first.

Below 50

Park or reject.

Rule

Do not build extraction infrastructure for low-value data.

Application To Research Brain

Research Brain is the primary owner of this framework.

Research Brain should:

define research questions

select sources

request extraction

interpret extracted data

create market intelligence

support avatar definition

validate niches

monitor competitors

detect trends

manage recurring source monitoring

review candidates

route intelligence

Research Brain Rule

Research Brain must convert extracted data into market understanding, not merely datasets.

Application To Data Brain

Data Brain owns:

schemas

tables

source records

data quality

field definitions

source tracking

identity matching

deduplication

change records

candidate states

permitted-use states

dashboard feeds

retention

deletion

actor registry

source pipeline registry

pipeline monitoring

Data Brain Rule

Data Brain must prevent raw extraction from becoming intelligence debt.

Application To AIBS Brain

AIBS may use this framework for:

prospect lists

trophy client scoring

local business opportunity research

review and reputation gaps

lead-capture AIOS candidates

AI audit targets

vertical AIOS research

competitor offers

client intelligence reports

AIBS Rule

AIBS should use extracted data to find better clients and build stronger AIOS packages.

Application To Sales Brain

Sales Brain may use accepted and permitted records for:

outreach relevance

recent observations

prospect prioritisation

human review

personalisation preparation

Sales Brain must apply:

contact verification

outreach governance

suppression checks

channel authority

message approval

Sales Rule

Extraction does not authorise contact.

Application To Affiliate Brain

Affiliate Brain may use data extraction for:

competitor affiliate pages

product angles

testimonials

sales claims

pricing

bonus stacks

content gaps

niche demand

ad angles

offer changes

Rule

Extracted intelligence should improve selection and testing, not support copying.

Application To PPL Brain

PPL Brain may use extraction for:

lead verticals

form flows

buyer categories

local demand

competitor lead-generation pages

compliance signals

offer economics

conversion friction

Rule

PPL extraction must respect lead-handling and compliance rules.

Application To Content Brain

Content Brain may mine:

reviews

comments

competitor blogs

YouTube titles

YouTube transcripts

questions

forums

search results

RSS feeds

social themes

customer pain language

Rule

Extracted content signals should become original MWMS content strategy.

Application To Automation Brain

Automation Brain may operate approved workflows.

It owns:

actor execution

schedules

retries

failure handling

status polling

logs

alerts

cost visibility

Automation Brain does not decide:

source authority

business purpose

permitted outreach

publication authority

Rule

Automation authority must remain narrower than governance authority.

Failure Modes

Failure Mode 1: Tool First Extraction

A scraper is selected before a business question exists.

Correction

Define the intelligence need first.

Failure Mode 2: Public Means Permitted

Accessible data is treated as unrestricted.

Correction

Apply source-authority and use-purpose review.

Failure Mode 3: Search Result Becomes Prospect

Every result is treated as outreach-ready.

Correction

Create a candidate record and review gate.

Failure Mode 4: Scrape Approval Becomes Outreach Approval

A record approved for enrichment is automatically contacted.

Correction

Separate extraction, permitted use, and communication authority.

Failure Mode 5: Found Email Becomes Decision-Maker

A generic or unverified email is treated as a named buyer.

Correction

Record contact type and confidence.

Failure Mode 6: Duplicate Companies Inflate Opportunity

The same business appears through multiple results.

Correction

Use domain and identity deduplication.

Failure Mode 7: AI Enrichment Becomes Fact

Inferred pain or company size is stored as verified.

Correction

Label enrichment status.

Failure Mode 8: No Raw Evidence

Cleaned records cannot be checked.

Correction

Preserve source-linked evidence.

Failure Mode 9: Candidate Rejection Is Lost

Rejected records reappear in later campaigns.

Correction

Persist rejection, duplicate, and suppression states.

Failure Mode 10: Downstream Use Is Undefined

Data moves into sales or personalisation without review.

Correction

Record permitted-use status.

Failure Mode 11: Stale Contacts Are Reused

Old contact data is treated as current.

Correction

Apply freshness and verification rules.

Failure Mode 12: One Contact Per Company Selection Is Arbitrary

The first found email is used.

Correction

Define contact-priority logic.

Failure Mode 13: Permanent Over-Collection

Every field is retained indefinitely.

Correction

Apply minimisation and retention.

Failure Mode 14: Actor Failure Is Silent

The pipeline produces incomplete records without warning.

Correction

Log status, coverage, and failure.

Failure Mode 15: Score Hides Weak Evidence

A numerical score appears objective.

Correction

Preserve explainable components and confidence.

Failure Mode 16: Data Volume Is Mistaken For Value

The team celebrates thousands of records.

Correction

Measure accepted candidates, useful decisions, and commercial outcomes.

Drift Protection

This framework protects MWMS from:

scraping without purpose

unrestricted data collection

tool-as-architecture thinking

weak source authority

lost raw evidence

duplicate records

stale data

unverified enrichment

lead-count inflation

scrape-to-send automation

wrong-contact use

unclear downstream permission

unowned recurring pipelines

data hoarding

Drift Signals

Watch for:

“Scrape everything.”

“It is public.”

“We found an email.”

“Just send to all of them.”

“The actor returned a thousand records.”

“AI says they are a good fit.”

“We can clean it later.”

“Duplicates do not matter.”

“The source link is not needed.”

“We approved the scrape, so outreach is fine.”

“Use the first contact.”

“Keep all fields forever.”

Rule

When these signals appear, return to purpose, source authority, evidence, identity, deduplication, candidate review, permitted use, and Brain routing.

Strategic Summary

The durable value of actor infrastructure is not volume.

It is the ability to turn authorised sources into structured, reviewable, decision-ready intelligence.

The later lead-generation material strengthens this framework by making the transition from search result to operational candidate explicit.

The controlled pathway is:

Keyword Or Source

→ Extraction

→ Candidate

→ Review

→ Accept, Reject, Merge, Or Hold

→ Enrichment

→ Identity Verification

→ Permitted-Use Decision

→ Brain Handoff

This prevents extraction systems from silently becoming uncontrolled outreach systems.

Final Standard

Every data extraction and source-intake workflow must begin with a business question and source authority check and end with structured, source-visible, deduplicated, freshness-aware, reviewed, permitted-use-controlled, routed intelligence.

A valid extraction system must define:

intelligence need

source authority

source

capture method

actor or workflow

raw evidence rule

schema

cleaning rules

identity and deduplication rules

enrichment rules

change detection rules

freshness rules

scoring logic

candidate review

scrape or reject decision where applicable

contact confidence

permitted downstream use

storage destination

retention and deletion rules

dashboard or report

Brain routing

compliance review

action after extraction

That is the MWMS Data Extraction And Actor Infrastructure Standard.

MWMS System Change Log

Version: v1.2

Date: 2026-06-21

Author: HeadOffice

Change

Updated the MWMS Data Extraction And Actor Infrastructure Framework from v1.1 to v1.2 using the later AI Automations by Jack material covering:

keyword-driven business discovery

search-result extraction

human scrape-or-reject decisions

contact-detail extraction

website enrichment

email and social-profile discovery

candidate selection

company deduplication

contact-confidence review

downstream outreach preparation

Expanded the existing sixteen-layer model into a seventeen-layer model.

Added:

  • Candidate Discovery And Permitted Downstream Use Layer

Added standards covering:

  • candidate records
  • candidate status
  • scrape-or-reject gates
  • accepted-for-enrichment status
  • contact extraction
  • contact-confidence classification
  • one-contact-per-company selection
  • permitted downstream use
  • research versus outreach authority
  • candidate handoff records
  • campaign-readiness boundaries
  • persistent rejection and suppression states

Expanded the schema to include:

  • candidate_status
  • permitted_use_status

Expanded the Source Pipeline Registry with:

  • Candidate Review Gate
  • Permitted Use Rule

Expanded the Data Extraction Request Template with:

  • Candidate Review Needed
  • Permitted Downstream Use

Expanded the Extraction Output Template with:

  • Candidates Created
  • Records Held
  • Identity Verified
  • Enrichment Completed
  • Permitted Use Decisions

Added explicit doctrine that:

  • an extracted search result is a candidate, not an approved prospect
  • a scrape decision authorises only the next extraction step
  • a found email is not automatically a verified decision-maker
  • permission for research does not create permission for outreach
  • extraction authority stops before communication authority

Change Impact Declaration

This update materially strengthens the lead-discovery and downstream-use boundary without changing the framework’s primary ownership.

Research Brain remains responsible for:

  • intelligence need
  • source selection
  • interpretation
  • candidate relevance
  • Brain routing

Data Brain remains responsible for:

  • schemas
  • source records
  • identity
  • deduplication
  • candidate states
  • permitted-use states
  • storage
  • retention
  • deletion

Automation Brain may execute approved extraction and enrichment workflows but does not determine outreach authority.

Sales Brain and AIBS Brain may receive accepted candidate records but must apply their own communication, compliance, asset, and campaign governance.

The update does not authorise:

  • unrestricted scraping
  • automatic outreach
  • automatic personalised asset delivery
  • use of unverified contacts
  • suppression overrides
  • publication of personal data

Pages Created

  • None

Pages Updated

  • MWMS Data Extraction And Actor Infrastructure Framework updated from v1.1 to v1.2

Pages Deprecated

  • None

Standalone Pages Not Created

The following standalone pages were not created because their durable intelligence is governed within this updated framework:

  • MWMS Lead Scraping Framework
  • MWMS Candidate Discovery Framework
  • MWMS Scrape Or Reject Gate Standard
  • MWMS Contact Extraction Framework
  • MWMS Contact Confidence Standard
  • MWMS Permitted Data Use Framework
  • MWMS Lead List Preparation Framework

Registries Requiring Update

  • MCR Page Registry
  • Research Brain Page Registry
  • Data Brain Page Registry where this framework is operationally referenced
  • MCR Copy Map where the framework version is recorded
  • MWMS Course Absorption Decision Registry

Canon Version Update Required

No immediate Research Brain Canon or Data Brain Canon version change is required unless either Canon directly records framework versions or contains candidate-use rules that conflict with v1.2.

The candidate review, scrape-or-reject, contact-confidence, and permitted-use controls should be included during the next scheduled Research Brain and Data Brain Canon alignment review.

Change Log Entry Required

Yes.

The v1.2 update must be recorded in:

  • MWMS System Change Log
  • MCR Page Registry change history where applicable
  • Research Brain Page Registry change history where applicable
  • Data Brain Page Registry change history where applicable
  • MWMS Course Absorption Decision Registry

Strategic Absorption Result

The later AI Automations by Jack lead-generation material has been absorbed into the existing MWMS Data Extraction And Actor Infrastructure Framework.

The absorption preserves:

  • keyword-driven discovery
  • search-result extraction
  • contact extraction
  • enrichment
  • human review
  • candidate selection
  • structured records
  • Brain routing

The absorption rejects:

  • every search result being treated as a prospect
  • scrape approval being treated as outreach approval
  • every found email being treated as a verified decision-maker
  • data volume being treated as commercial value
  • rejected records being forgotten and rediscovered
  • downstream use occurring without an explicit permitted-use decision

The resulting v1.2 framework establishes that MWMS extraction pipelines must separate:

  • discovery
  • capture
  • candidate review
  • enrichment
  • identity verification
  • permitted use
  • Brain handoff
  • communication authority

END OF FULL FILE OUTPUT