Platform Architecture

Infrastructure Overview

1.1M+ Domains Tracked

17+ Active Services

16 Celery Workers

$239M+ Domains Valued

Technology Stack

Python 3.11+ Flask Celery PostgreSQL Redis Caddy + WSGI Oracle Cloud DNSPython Systemd Cloudflare

Architecture Layers

🌐 Presentation Layer

Flask-based web application serving dynamic content with real-time updates. Built with responsive design principles and progressive enhancement.

🔧 Application Layer

RESTful API endpoints handling domain lookups, dark web monitoring, RDAP queries, and comprehensive DNS analysis.

⚙️ Service Layer

22 specialized daemons running continuously to discover, enrich, and monitor domain data from multiple sources.

💾 Data Layer

PostgreSQL database with Redis caching for high-performance queries. Optimized indexes and materialized views for analytics.

Daemon Architecture

Our platform runs 22 specialized daemons, each focused on a specific aspect of domain intelligence:

Data Discovery & Enrichment

domain_discovery_daemon.py Discovers domains from Certificate Transparency logs and public DNS datasets. Currently monitoring 1.1M+ domains.

enrichment_daemon.py Enriches domain records with WHOIS, geolocation, and metadata.

rdap_daemon.py Queries RDAP servers for authoritative registration data.

gtld_daemon.py Monitors generic top-level domain (gTLD) zones and changes.

Security & Threat Intelligence

ssl_monitord.py Monitors SSL/TLS certificates for expiration and anomalies.

threat_intel_daemon.py Integrates with threat intelligence feeds (AbuseIPDB, IPInfo).

reputationd.py Calculates domain reputation scores based on multiple signals.

arpad_daemon.py Advanced RPAD (Reputation & Phishing Analysis Daemon) for detecting malicious domains.

Network Security Integration

suricata_integration_daemon.py Integrates with Suricata IDS for network threat detection.

zeek_integration_daemon.py Processes Zeek network monitoring logs for DNS analysis.

Specialized Monitoring

web3d.py Monitors blockchain DNS systems (ENS, Handshake, Namecoin).

geoipd.py Real-time geolocation lookups for IP addresses.

emaild.py Email validation and deliverability scoring.

recordtyped.py Analyzes DNS record types and configurations.

User Management

auto_renewal_daemon.py Handles automatic subscription renewals and billing.

domain_expiry_daemon.py Monitors domain expirations and sends alerts.

email_scheduler_daemon.py Schedules and sends email notifications and reports.

Client Network DNS Monitoring

Our advanced DNS monitoring capabilities allow us to monitor and debug clients' internal network DNS problems, monitor for traffic trends, analyze attack data, and much more:

Internal DNS Debugging Monitor recursive DNS queries, resolution failures, and misconfigurations in client networks. Identify split-horizon DNS issues and internal resolution problems.

Traffic Pattern Analysis Track DNS query volumes, identify unusual traffic spikes, and detect traffic anomalies that may indicate DDoS attacks or DNS tunneling.

Attack Detection & Analysis Real-time detection of DNS-based attacks including cache poisoning attempts, amplification attacks, and DNS exfiltration. Comprehensive attack data logging and forensics.

Performance Monitoring Monitor DNS resolution times, identify slow resolvers, track query success rates, and analyze propagation delays across different geographic regions.

Data Flow & Processing Pipeline

Discovery: Certificate Transparency logs, zone files, passive DNS
Validation: DNS resolution, RDAP lookup, WHOIS query
Enrichment: GeoIP, SSL cert, threat intel, reputation scoring
Storage: PostgreSQL with Redis caching layer
API Exposure: RESTful endpoints with rate limiting
Real-time Updates: WebSocket connections for live data

Verification Process

Every domain in our database goes through a multi-stage verification process:

DNS Resolution Verification - Confirm domain resolves via authoritative nameservers
RDAP Validation - Query regional internet registries for registration data
WHOIS Cross-Reference - Compare RDAP data with WHOIS records for consistency
Certificate Transparency - Check for SSL certificates in CT logs
Historical Analysis - Compare against historical DNS data for anomalies
Reputation Scoring - Calculate score based on age, registrar, hosting, and threat intel

Dark Web Monitoring Architecture

Our dark web monitoring system provides passive intelligence on Tor hidden services, I2P networks, and blockchain DNS:

Tor Network Monitoring

Tracking 1,160 active Tor exit nodes updated hourly from Tor Project APIs. Database of known .onion addresses mapped to clearnet domains.

Blockchain DNS

Monitoring ENS (Ethereum Name Service), Handshake (.hns), and Namecoin (.bit) domains for alternative DNS registrations.

Certificate Analysis

Analyzing SSL certificates for hidden services, identifying anomalies and self-signed certs indicative of dark web infrastructure.

Passive Intelligence

100% passive monitoring - no active crawling. All data from public sources, CT logs, and community-verified mappings.

Database Schema

Dark web monitoring utilizes 10 specialized tables:

onion_addresses - Known .onion hidden services
i2p_addresses - I2P eepsite addresses
tor_exit_nodes - Active Tor exit node database
alternative_dns - Blockchain DNS registrations
darkweb_certificates - SSL cert anomaly tracking
darkweb_lookups - Audit trail of all lookups
onion_clearnet_mappings - Verified .onion ↔ clearnet associations
darkweb_rate_limits - Per-user rate limiting
darkweb_stats - Cached statistics
darkweb_audit_log - Compliance and security logging

Distributed Processing Architecture

Processing millions of domains required a fundamental shift from sequential to parallel processing. Here's how we scaled our infrastructure.

The Scaling Journey

1,600 Valuations/hr (Before)

30,000+ Valuations/hr (After)

12 days Backlog Clear (Before)

~15 hours Backlog Clear (After)

Problem: Sequential Processing Bottleneck

Our original domain valuation daemon processed domains sequentially:

# Original approach - single threaded
while True:
    domains = get_domains_needing_valuation(batch_size=100)
    for domain in domains:
        # 3-4 DB queries per domain
        age_data = fetch_rdap_data(domain)
        ssl_data = fetch_ssl_cert(domain)
        email_data = fetch_email_records(domain)

        # Calculate and save
        valuation = calculate_value(domain, age_data, ssl_data, email_data)
        save_to_database(valuation)

    time.sleep(60)  # Wait before next batch

Issues encountered at scale:

Database query overhead: 3-4 queries per domain = 300-400 queries per batch
Serial processing: Each domain waited for the previous to complete
Long sleep intervals: 60 second delays between batches
Single point of failure: If daemon crashed, processing stopped entirely
No retry logic: Failed valuations were silently skipped

Solution: Celery-Based Parallel Processing

We migrated to a distributed task queue architecture using Celery and Redis:

Task Distribution

Celery Beat schedules batches of 1,000 domains every 2 minutes. Each domain becomes an independent task that can be processed by any available worker.

Parallel Workers

16 concurrent Celery workers process valuations simultaneously. Each worker handles one domain at a time with automatic retry on failure.

Redis Queue

Redis serves as the message broker, queuing tasks and distributing them to workers. Provides persistence and visibility into queue depth.

Auto-Retry

Failed tasks automatically retry up to 3 times with exponential backoff. No more silently dropped valuations.

# New approach - Celery distributed tasks
@app.task(bind=True, max_retries=3)
def value_domain(self, domain_id, domain_name):
    try:
        # Same valuation logic, but runs in parallel
        valuation = calculate_and_save_valuation(domain_id, domain_name)
        return {'domain': domain_name, 'value': valuation}
    except Exception as e:
        self.retry(exc=e)  # Automatic retry with backoff

@app.task
def queue_valuation_batch(batch_size=1000):
    domains = get_domains_needing_valuation(batch_size)
    # Queue all domains as parallel tasks
    tasks = group(value_domain.s(d.id, d.name) for d in domains)
    tasks.apply_async()  # Fan out to all workers

Why This Design?

Horizontal scalability: Add more workers to increase throughput linearly
Fault tolerance: Worker crashes don't lose tasks - they return to the queue
Visibility: Monitor queue depth, task success rates, and processing times
Resource efficiency: Workers stay busy, no idle time waiting for I/O
Existing infrastructure: Already had Redis running for caching

Service Configuration

# /etc/systemd/system/dnsscience-celery-valuation.service
[Service]
ExecStart=/usr/local/bin/celery -A celery_config worker \
    -Q valuation \
    -c 16 \                    # 16 concurrent workers
    -n valuation@%h \
    --loglevel=INFO
Restart=always
RestartSec=10

Self-Healing Infrastructure

Operating 17+ background services requires automated monitoring and recovery. Manual intervention doesn't scale.

The Problem

During development, we encountered several recurring issues:

Services created but never deployed to production
Daemons crashing without notification
Data ingestion stopping silently
Stale statistics displayed to users
Services not starting after instance reboot

Multi-Layer Health Monitoring

Process Health (Every 15 min)

Checks if all enabled services are running. Auto-restarts crashed daemons. Logs restart events to system journal.

Data Freshness Checks

Monitors table timestamps. If no new valuations in 60 min, restarts valuation daemon. If no new domains in 60 min, restarts discovery.

Ingestion Rate Monitoring

Tracks records per hour. Alerts and restarts if below thresholds (e.g., <50 domains/hr or <100 valuations/hr).

Boot Recovery

Systemd service ensures all daemons start on instance reboot. No manual intervention required after AWS maintenance.

Health Monitor Implementation

# /usr/local/bin/dnsscience-health-monitor.sh (runs via cron every 15 min)

# Check data freshness - restart if stale
check_data_freshness() {
    local table=$1
    local service=$2
    local max_minutes=$3

    LAST_UPDATE=$(psql -c "SELECT EXTRACT(EPOCH FROM
        (NOW() - MAX(created_at)))/60 FROM $table;")

    if [ "$LAST_UPDATE" -gt "$max_minutes" ]; then
        logger "STALE DATA: $table - restarting $service"
        systemctl restart $service
    fi
}

# Domain discovery - should have new domains every hour
check_data_freshness "discovered_domains" "domain-discovery.service" 60

# Valuations - should value domains every hour
check_data_freshness "domain_valuations" "dnsscience-domain-valuation.service" 60

Deployment Automation

Single command deploys all services, daemons, and configuration to production:

./deploy_all_services.sh

# Syncs to S3, deploys to instance, enables services, restarts everything
# No more "forgot to deploy" issues

Scalability & Performance

Horizontal Scaling

Our architecture is designed to scale horizontally across multiple dimensions:

Web Tier

Auto Scaling Group with load balancer. Currently running t3.medium instances with capacity to scale to t3.xlarge.

Database Tier

RDS PostgreSQL with read replicas. Multi-AZ deployment for high availability.

Cache Tier

Redis ElastiCache (cache.t3.small) with 1.5GB memory for hot data.

Daemon Distribution

Daemons run independently and can be distributed across multiple worker instances.

Performance Optimizations

Database Indexing: Optimized B-tree and GIN indexes on frequently queried columns
Redis Caching: 5-minute TTL on frequently accessed data reduces DB load by 80%
Async Processing: Daemon jobs run asynchronously with queue-based architecture
Connection Pooling: Database connection pooling reduces overhead
CDN Integration: Static assets served via CloudFront (planned)
API Rate Limiting: Tier-based rate limiting prevents abuse

Current Performance Metrics

134ms Avg. Page Load

~50ms API Response

99.9% Uptime SLA

10K+ Queries/Hour

Security Architecture

Defense in Depth

Network Security: AWS VPC with private subnets for database and cache
Application Security: Input validation, SQL injection prevention, XSS protection
Authentication: Bcrypt password hashing, JWT tokens, session management
API Security: Rate limiting, API key rotation, request signing
Data Encryption: TLS 1.3 in transit, AES-256 at rest (planned)
Audit Logging: All sensitive operations logged with tamper-evident trails

Compliance

Our architecture is designed with compliance in mind:

GDPR-compliant data handling and right-to-erasure
CFAA-compliant passive monitoring (no unauthorized access)
SOC 2 Type II controls (in progress)
Responsible disclosure program for security researchers

API Architecture

RESTful Design

Our API follows REST principles with predictable endpoints:

GET  /api/stats/live          # Real-time platform statistics
GET  /api/darkweb/stats       # Dark web monitoring stats
GET  /api/darkweb/onion/:domain  # Check for .onion alternatives
POST /api/lookup              # Domain intelligence lookup
GET  /api/rdap/:domain        # RDAP registration data
GET  /api/whois/:domain       # WHOIS information
POST /api/bulk-lookup         # Batch domain analysis
POST /api/scan                # Domain scan (Simple/Advanced/Expert modes)
GET  /api/ip/:ip/scan         # IP scan (Simple/Advanced/Expert modes)

Expert Mode Scanning

Our platform offers three scanning modes with progressively granular control:

Simple Mode

Quick scans with automatic checks for DNS, SSL, and basic security indicators. Perfect for rapid assessments.

Advanced Mode

Comprehensive scans including DNSSEC validation, threat intelligence feeds, and enhanced security checks.

Expert Mode

Fully customizable scans with granular control over intelligence sources, DNS resolvers, and data collection methods. Choose exactly which checks to run.

Expert Mode Options

Domain Scans: Customize DNS analysis (records, DNSSEC, propagation), security checks (SSL, certificate transparency), email security (SPF, DKIM, DMARC), and threat intelligence sources.

IP Scans: Configure geolocation providers (IPInfo, MaxMind, BGP, RIPEstat), security sources (AbuseIPDB, RBL, threat feeds), and advanced analysis (Cloudflare detection, reverse DNS, WHOIS lookups).

# Example: Expert Mode Domain Scan
POST /api/scan
{
  "domain": "example.com",
  "expert": true,
  "options": {
    "dns": ["records", "dnssec", "propagation"],
    "security": ["ssl", "ssl-chain", "cert-transparency"],
    "email": ["spf", "dkim", "dmarc", "mx-health"],
    "intel": ["whois", "reputation", "threat"]
  }
}

# Example: Expert Mode IP Scan
GET /api/ip/8.8.8.8/scan?expert=true&options={"geo":["ipinfo","maxmind"]}

Rate Limiting Tiers

Free Tier 15,000 requests/day - Anonymous & registered users

Essentials ($29/mo) 80,000 requests/day - Small teams

Professional ($99/mo) 135,000 requests/day - Security teams

Commercial ($299/mo) 375,000 requests/day - Enterprises

Research ($199/mo) 275,000 requests/day - Academic institutions

Enterprise (Custom) Unlimited - Custom integrations

Future Architecture Enhancements

Planned Advanced Features

GraphQL API

Timeline: 3-4 weeks

Cost: $0-10/month

Flexible querying interface for complex use cases. Query exactly the data you need with a single request. Perfect for advanced integrations and custom dashboards.

Tech: Graphene-Python, Apollo Server, GraphQL subscriptions

Real-time WebSockets

Timeline: 2-3 weeks

Cost: $50-150/month

Live domain monitoring feeds with instant notifications. Stream CT log discoveries, SSL certificate changes, and DNS updates in real-time.

Tech: Socket.IO, Redis Pub/Sub, AWS API Gateway WebSocket

Machine Learning

Timeline: 6-8 weeks

Cost: $50-200/month

Predictive analytics for domain reputation scoring. Anomaly detection, phishing prediction, and automated threat classification using TensorFlow and scikit-learn.

Tech: TensorFlow, scikit-learn, AWS SageMaker

AI Features

Timeline: 4-6 weeks

Cost: $200-800/month

Natural language queries, automated report generation, and intelligent domain recommendations. Powered by large language models and vector embeddings.

Tech: OpenAI GPT-4, Claude API, Pinecone vector DB

Infrastructure Enhancements

Kubernetes Migration: Container orchestration for better resource utilization and auto-scaling
Multi-Region Deployment: Global edge presence for low latency access worldwide
Data Lake: Long-term historical analysis with S3 + Athena for trend detection
Blockchain Integration: Immutable audit trails on-chain for compliance and transparency
Advanced Caching: Redis Cluster with read replicas for sub-millisecond response times
CDN Integration: CloudFront distribution for static assets and API caching

Note: Detailed implementation plans, technical architecture diagrams, and cost breakdowns are available in our internal documentation. These features are being prioritized based on user feedback and enterprise requirements.