5. Architecture & Technology
Syncora is built around one guiding principle: raw sensitive data should never leave the source, but its statistical patterns should. To make this real, we’ve designed a pipeline that combines autonomous agents, privacy-preserving synthesis, rigorous quality checks, and on-chain provenance.
This section explains how that pipeline works in practice.
5.1 The High-Level Flow
At its core, Syncora turns raw private data → synthetic datasets → licensed training assets.
The steps are:
Upload: Contributor provides a dataset.
Validate: Agents scan for schema, ownership, and compliance.
Synthesize: Generative models create synthetic copies.
Score: Datasets are tested for fidelity, privacy, and novelty.
License: Synthetic datasets are published to the Data Hub.
Payout: Contributors earn royalties via $SYNKO.
This flow is fully automated. No Syncora employee ever sees raw data; all handling is agent-driven.
5.2 Agentic Structuring
The first layer of Syncora is our agentic pipeline, autonomous modules that handle tasks usually left to humans in Web2 systems.
Schema Inference Agent
Reads an uploaded dataset (tables, JSON, logs).
Auto-detects schema: column types, constraints, null distributions.
Flags inconsistencies (e.g., a “date” column with free text).
Privacy Gate Agent
Scans for personally identifiable information (PII).
Uses regex + ML models for names, SSNs, addresses, etc.
Removes or masks flagged values before synthesis.
Ownership Verification Agent
Detects signs of AI-generated or scraped/copyrighted content.
If data is fraudulent, contributor stake is slashed.
Protects against “garbage in, garbage out.”
Structuring Agent
Converts unstructured input (PDFs, logs, transcripts) into structured, model-ready formats.
Outputs hierarchical JSONL, tabular embeddings, or image tensors.
This agentic approach allows Syncora to handle messy, real-world enterprise data without human intervention.
5.3 Synthetic Generation
Once data is validated and structured, it moves into the synthetic generation stage. We use multiple model classes, depending on the modality:
Tabular (EHRs, financial records): GANs and CTGAN variants preserve joint distributions across mixed data types.
Time-Series (IoT sensors, trading logs): RNN-based and transformer-based sequence models simulate rare events and anomalies.
Text/JSON (logs, chat transcripts): Large language models fine-tuned for schema consistency.
Images (X-rays, scans): Diffusion models generate medically accurate but synthetic images.
The goal isn’t just to “reproduce” the data, but to learn its statistical fingerprint and then generate new samples that look and behave like the real thing without copying any record.
5.4 Verification & Quality Layer
Synthetic data is only useful if it’s both accurate and safe. To guarantee this, every dataset passes through a battery of tests:
Fidelity Metrics
F1 score: Measures overlap of distributions. Syncora benchmarks at 0.57 vs Gretel’s 0.33.
KS/KL divergence: Quantifies similarity of real vs synthetic distributions.
Coverage & Novelty
Detects whether rare edge cases are preserved.
Ensures the model hasn’t simply memorized and reproduced original rows.
Privacy Attacks
Membership inference: Can an attacker tell if a record was in the original set?
Nearest Neighbor Distance Ratio (NNDR): Measures how close synthetic points are to real ones. Syncora achieves 0.0003 (lower is better, safer) vs Gretel’s 0.0100.
Utility Testing
Train on Synthetic, Test on Real (TSTR): Train a model on synthetic, test on withheld real.
Train on Real, Test on Synthetic (TRTS): The reverse.
Ensures synthetic data supports downstream AI tasks.
Only if a dataset passes these checks does it move to the licensing stage.
5.5 Blockchain & Provenance Layer
The final layer is on-chain anchoring and payouts. This is what makes Syncora not just a SaaS tool, but a trustable marketplace.
Smart Contracts on Solana record four events:
VALIDATE: ownership and schema verified
SYNTHESIZE: synthetic dataset created
SCORE: fidelity and privacy metrics anchored
Storage
Synthetic datasets are uploaded to IPFS.
On-chain references point to datasets + their metrics.
Royalties
When a dataset is licensed, smart contracts automatically split revenue:
80% to contributor
20% to Syncora (platform fee)
Paid in $SYNKO (fiat payments auto-convert on backend).
This layer provides verifiable provenance: buyers can see exactly when a dataset was uploaded, validated, synthesized, and scored.
5.6 Compliance & Security
Because Syncora deals with sensitive domains, we designed compliance from day one.
GDPR
Data minimization (only synthetic retained).
Right to erasure (contributors can request account deletion).
Lawful basis: contributors consent to data submission.
HIPAA
Syncora acts as a Business Associate when handling PHI.
PHI deleted post-synthesis; synthetic datasets contain no identifiers.
SOC2
Roadmap to Type II by Q1 2026.
Covers encryption at rest/in transit, monitoring, and audit trails.
Security Measures
Encrypted uploads.
Secure enclaves for processing.
No human access to raw data.
Automated deletion of raw data.
5.7 Why This Architecture Matters
Each piece of the system addresses a fundamental blocker:
Agents handle messy real-world data without human leaks.
Synthesis models unlock regulated datasets without exposing originals.
Verification ensures datasets are useful and non-leaky.
Blockchain provenance builds trust between contributors and buyers.
Compliance alignment gives enterprises confidence to participate.
A pipeline that scales safely, automates trust, and turns private data into usable assets for AI.
5.8 Diagram




Other companies generate synthetic data. Others focus on data sovereignty. Syncora is the only one that combines:
Agentic automation
High-fidelity synthesis
On-chain provenance
Contributor royalties
That combination turns private, locked data into an active economy.
Last updated