5. Architecture & Technology


Syncora is built around one guiding principle: raw sensitive data should never leave the source, but its statistical patterns should. To make this real, we’ve designed a pipeline that combines autonomous agents, privacy-preserving synthesis, rigorous quality checks, and on-chain provenance.

This section explains how that pipeline works in practice.


5.1 The High-Level Flow

At its core, Syncora turns raw private data → synthetic datasets → licensed training assets.

The steps are:

  1. Upload: Contributor provides a dataset.

  2. Validate: Agents scan for schema, ownership, and compliance.

  3. Synthesize: Generative models create synthetic copies.

  4. Score: Datasets are tested for fidelity, privacy, and novelty.

  5. License: Synthetic datasets are published to the Data Hub.

  6. Payout: Contributors earn royalties via $SYNKO.

This flow is fully automated. No Syncora employee ever sees raw data; all handling is agent-driven.


5.2 Agentic Structuring

The first layer of Syncora is our agentic pipeline, autonomous modules that handle tasks usually left to humans in Web2 systems.

Schema Inference Agent

  • Reads an uploaded dataset (tables, JSON, logs).

  • Auto-detects schema: column types, constraints, null distributions.

  • Flags inconsistencies (e.g., a “date” column with free text).

Privacy Gate Agent

  • Scans for personally identifiable information (PII).

  • Uses regex + ML models for names, SSNs, addresses, etc.

  • Removes or masks flagged values before synthesis.

Ownership Verification Agent

  • Detects signs of AI-generated or scraped/copyrighted content.

  • If data is fraudulent, contributor stake is slashed.

  • Protects against “garbage in, garbage out.”

Structuring Agent

  • Converts unstructured input (PDFs, logs, transcripts) into structured, model-ready formats.

  • Outputs hierarchical JSONL, tabular embeddings, or image tensors.

This agentic approach allows Syncora to handle messy, real-world enterprise data without human intervention.


5.3 Synthetic Generation

Once data is validated and structured, it moves into the synthetic generation stage. We use multiple model classes, depending on the modality:

  • Tabular (EHRs, financial records): GANs and CTGAN variants preserve joint distributions across mixed data types.

  • Time-Series (IoT sensors, trading logs): RNN-based and transformer-based sequence models simulate rare events and anomalies.

  • Text/JSON (logs, chat transcripts): Large language models fine-tuned for schema consistency.

  • Images (X-rays, scans): Diffusion models generate medically accurate but synthetic images.

The goal isn’t just to “reproduce” the data, but to learn its statistical fingerprint and then generate new samples that look and behave like the real thing without copying any record.


5.4 Verification & Quality Layer

Synthetic data is only useful if it’s both accurate and safe. To guarantee this, every dataset passes through a battery of tests:

Fidelity Metrics

  • F1 score: Measures overlap of distributions. Syncora benchmarks at 0.57 vs Gretel’s 0.33.

  • KS/KL divergence: Quantifies similarity of real vs synthetic distributions.

Coverage & Novelty

  • Detects whether rare edge cases are preserved.

  • Ensures the model hasn’t simply memorized and reproduced original rows.

Privacy Attacks

  • Membership inference: Can an attacker tell if a record was in the original set?

  • Nearest Neighbor Distance Ratio (NNDR): Measures how close synthetic points are to real ones. Syncora achieves 0.0003 (lower is better, safer) vs Gretel’s 0.0100.

Utility Testing

  • Train on Synthetic, Test on Real (TSTR): Train a model on synthetic, test on withheld real.

  • Train on Real, Test on Synthetic (TRTS): The reverse.

  • Ensures synthetic data supports downstream AI tasks.

Only if a dataset passes these checks does it move to the licensing stage.


5.5 Blockchain & Provenance Layer

The final layer is on-chain anchoring and payouts. This is what makes Syncora not just a SaaS tool, but a trustable marketplace.

Smart Contracts on Solana record four events:

  • VALIDATE: ownership and schema verified

  • SYNTHESIZE: synthetic dataset created

  • SCORE: fidelity and privacy metrics anchored

Storage

  • Synthetic datasets are uploaded to IPFS.

  • On-chain references point to datasets + their metrics.

Royalties

  • When a dataset is licensed, smart contracts automatically split revenue:

    • 80% to contributor

    • 20% to Syncora (platform fee)

  • Paid in $SYNKO (fiat payments auto-convert on backend).

This layer provides verifiable provenance: buyers can see exactly when a dataset was uploaded, validated, synthesized, and scored.


5.6 Compliance & Security

Because Syncora deals with sensitive domains, we designed compliance from day one.

GDPR

  • Data minimization (only synthetic retained).

  • Right to erasure (contributors can request account deletion).

  • Lawful basis: contributors consent to data submission.

HIPAA

  • Syncora acts as a Business Associate when handling PHI.

  • PHI deleted post-synthesis; synthetic datasets contain no identifiers.

SOC2

  • Roadmap to Type II by Q1 2026.

  • Covers encryption at rest/in transit, monitoring, and audit trails.

Security Measures

  • Encrypted uploads.

  • Secure enclaves for processing.

  • No human access to raw data.

  • Automated deletion of raw data.


5.7 Why This Architecture Matters

Each piece of the system addresses a fundamental blocker:

  • Agents handle messy real-world data without human leaks.

  • Synthesis models unlock regulated datasets without exposing originals.

  • Verification ensures datasets are useful and non-leaky.

  • Blockchain provenance builds trust between contributors and buyers.

  • Compliance alignment gives enterprises confidence to participate.

A pipeline that scales safely, automates trust, and turns private data into usable assets for AI.


5.8 Diagram

Other companies generate synthetic data. Others focus on data sovereignty. Syncora is the only one that combines:

  • Agentic automation

  • High-fidelity synthesis

  • On-chain provenance

  • Contributor royalties

That combination turns private, locked data into an active economy.

Last updated