Scalable ETL Pipelines for Research Data: Lessons from HCI Studies

Published on January 15, 2026

Research data processing presents unique challenges: heterogeneous formats, quality validation requirements, and the need for reproducibility. In this post, I'll share insights from building an ETL pipeline that processed 50,000+ finger-movement events for multi-modal touchscreen studies at Saarland University's Computational Interaction Group.

The Challenge: Multi-Modal Research Data

Human-Computer Interaction (HCI) studies generate diverse data streams:

Video frames: RGB streams from multiple camera angles
Touch events: Timestamped finger position logs
Sensor data: Pressure, velocity, and acceleration measurements
Annotations: Ground truth labels from human reviewers

Our challenge was to align these streams, extract meaningful features, and validate data quality—all while maintaining reproducibility for PhD research projects.

Pipeline Architecture

The solution combined MediaPipe for hand tracking, OpenCV for video processing, and custom validation logic:

1. Extraction Layer

import mediapipe as mp
import cv2

class FingerTracker:
    def __init__(self):
        self.hands = mp.solutions.hands.Hands(
            static_image_mode=False,
            max_num_hands=2,
            min_detection_confidence=0.7
        )
    
    def process_frame(self, frame):
        """Extract hand landmarks from video frame."""
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = self.hands.process(rgb_frame)
        
        if results.multi_hand_landmarks:
            return self._extract_finger_positions(results)
        return None

2. Transformation Layer

Raw coordinates needed alignment with touch event logs. We implemented temporal synchronization using timestamp interpolation:

def align_streams(video_events, touch_events, tolerance_ms=50):
    """Align video-extracted positions with touch event logs."""
    aligned = []
    for touch in touch_events:
        # Find closest video frame within tolerance
        video_match = find_nearest(
            video_events, 
            touch['timestamp'], 
            tolerance_ms
        )
        if video_match:
            aligned.append({
                'touch': touch,
                'video': video_match,
                'offset_ms': abs(touch['timestamp'] - video_match['timestamp'])
            })
    return aligned

3. Validation Layer

Data quality was crucial for research validity. We implemented Euclidean thresholding to identify discrepancies between video-extracted and logged positions:

import numpy as np

def validate_alignment(aligned_events, threshold_px=15):
    """Validate position alignment using Euclidean distance."""
    results = {
        'valid': [],
        'violations': [],
        'accuracy': 0.0
    }
    
    for event in aligned_events:
        video_pos = np.array(event['video']['position'])
        touch_pos = np.array(event['touch']['position'])
        
        distance = np.linalg.norm(video_pos - touch_pos)
        
        if distance <= threshold_px:
            results['valid'].append(event)
        else:
            results['violations'].append({
                **event,
                'distance': distance
            })
    
    results['accuracy'] = len(results['valid']) / len(aligned_events)
    return results

This approach achieved >95% accuracy for event classification, with automated violation reports for manual review.

Automated Reporting & Visualization

Real-time monitoring was essential for catching data quality issues early. We built dashboards showing:

Alignment accuracy per session and participant
Violation heatmaps identifying problematic screen regions
Temporal drift analysis for synchronization issues

import matplotlib.pyplot as plt
import seaborn as sns

def generate_validation_report(results, output_path):
    """Generate visual validation report."""
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Accuracy over time
    sns.lineplot(data=results['temporal_accuracy'], ax=axes[0, 0])
    axes[0, 0].set_title('Alignment Accuracy Over Time')
    
    # Violation heatmap
    sns.heatmap(results['violation_density'], ax=axes[0, 1])
    axes[0, 1].set_title('Violation Density by Screen Region')
    
    # Distance distribution
    sns.histplot(results['distances'], ax=axes[1, 0])
    axes[1, 0].set_title('Position Error Distribution')
    
    # Per-participant summary
    sns.barplot(data=results['participant_accuracy'], ax=axes[1, 1])
    axes[1, 1].set_title('Accuracy by Participant')
    
    plt.tight_layout()
    plt.savefig(output_path)
    return output_path

Impact & Results

The pipeline delivered significant improvements to the research workflow:

30% reduction in manual processing time
Reproducible datasets with full provenance tracking
Queryable outputs enabling dynamic analysis for PhD projects
Automated quality gates preventing bad data from entering analysis

Lessons Learned

1. Validate Early, Validate Often

Catching data quality issues at ingestion is far cheaper than discovering them during analysis. Build validation into every pipeline stage.

2. Design for Reproducibility

Research pipelines must be reproducible. We versioned not just code, but also model weights (MediaPipe versions), configuration parameters, and random seeds.

3. Optimize for Iteration Speed

Researchers need to experiment quickly. We built caching layers so that expensive operations (video processing) weren't repeated unnecessarily.

4. Document Everything

Every transformation, threshold, and design decision was documented. This proved invaluable when writing papers and onboarding new team members.

Future Directions

This experience has informed my approach to data engineering more broadly:

MLOps integration: Connecting pipelines to model training workflows
Cloud deployment: Scaling to handle larger studies on AWS/Azure
Real-time processing: Moving from batch to streaming architectures

Conclusion

Building ETL pipelines for research data requires balancing engineering rigor with scientific flexibility. The techniques described here—automated validation, comprehensive reporting, and reproducible workflows—are applicable across domains from HCI to bioinformatics to climate science.

As data volumes grow and research becomes more interdisciplinary, robust data engineering practices will become increasingly critical for scientific progress.

Related Experience: Graduate Research Assistant at CIX, Saarland University