Scalable ETL Pipelines for Research Data: Lessons from HCI Studies

Published on January 15, 2026

Research data processing presents unique challenges: heterogeneous formats, quality validation requirements, and the need for reproducibility. In this post, I'll share insights from building an ETL pipeline that processed 50,000+ finger-movement events for multi-modal touchscreen studies at Saarland University's Computational Interaction Group.

The Challenge: Multi-Modal Research Data

Human-Computer Interaction (HCI) studies generate diverse data streams:

Our challenge was to align these streams, extract meaningful features, and validate data quality—all while maintaining reproducibility for PhD research projects.

Pipeline Architecture

The solution combined MediaPipe for hand tracking, OpenCV for video processing, and custom validation logic:

1. Extraction Layer

import mediapipe as mp
import cv2

class FingerTracker:
    def __init__(self):
        self.hands = mp.solutions.hands.Hands(
            static_image_mode=False,
            max_num_hands=2,
            min_detection_confidence=0.7
        )
    
    def process_frame(self, frame):
        """Extract hand landmarks from video frame."""
        rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        results = self.hands.process(rgb_frame)
        
        if results.multi_hand_landmarks:
            return self._extract_finger_positions(results)
        return None

2. Transformation Layer

Raw coordinates needed alignment with touch event logs. We implemented temporal synchronization using timestamp interpolation:

def align_streams(video_events, touch_events, tolerance_ms=50):
    """Align video-extracted positions with touch event logs."""
    aligned = []
    for touch in touch_events:
        # Find closest video frame within tolerance
        video_match = find_nearest(
            video_events, 
            touch['timestamp'], 
            tolerance_ms
        )
        if video_match:
            aligned.append({
                'touch': touch,
                'video': video_match,
                'offset_ms': abs(touch['timestamp'] - video_match['timestamp'])
            })
    return aligned

3. Validation Layer

Data quality was crucial for research validity. We implemented Euclidean thresholding to identify discrepancies between video-extracted and logged positions:

import numpy as np

def validate_alignment(aligned_events, threshold_px=15):
    """Validate position alignment using Euclidean distance."""
    results = {
        'valid': [],
        'violations': [],
        'accuracy': 0.0
    }
    
    for event in aligned_events:
        video_pos = np.array(event['video']['position'])
        touch_pos = np.array(event['touch']['position'])
        
        distance = np.linalg.norm(video_pos - touch_pos)
        
        if distance <= threshold_px:
            results['valid'].append(event)
        else:
            results['violations'].append({
                **event,
                'distance': distance
            })
    
    results['accuracy'] = len(results['valid']) / len(aligned_events)
    return results

This approach achieved >95% accuracy for event classification, with automated violation reports for manual review.

Automated Reporting & Visualization

Real-time monitoring was essential for catching data quality issues early. We built dashboards showing:

import matplotlib.pyplot as plt
import seaborn as sns

def generate_validation_report(results, output_path):
    """Generate visual validation report."""
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Accuracy over time
    sns.lineplot(data=results['temporal_accuracy'], ax=axes[0, 0])
    axes[0, 0].set_title('Alignment Accuracy Over Time')
    
    # Violation heatmap
    sns.heatmap(results['violation_density'], ax=axes[0, 1])
    axes[0, 1].set_title('Violation Density by Screen Region')
    
    # Distance distribution
    sns.histplot(results['distances'], ax=axes[1, 0])
    axes[1, 0].set_title('Position Error Distribution')
    
    # Per-participant summary
    sns.barplot(data=results['participant_accuracy'], ax=axes[1, 1])
    axes[1, 1].set_title('Accuracy by Participant')
    
    plt.tight_layout()
    plt.savefig(output_path)
    return output_path

Impact & Results

The pipeline delivered significant improvements to the research workflow:

Lessons Learned

1. Validate Early, Validate Often

Catching data quality issues at ingestion is far cheaper than discovering them during analysis. Build validation into every pipeline stage.

2. Design for Reproducibility

Research pipelines must be reproducible. We versioned not just code, but also model weights (MediaPipe versions), configuration parameters, and random seeds.

3. Optimize for Iteration Speed

Researchers need to experiment quickly. We built caching layers so that expensive operations (video processing) weren't repeated unnecessarily.

4. Document Everything

Every transformation, threshold, and design decision was documented. This proved invaluable when writing papers and onboarding new team members.

Future Directions

This experience has informed my approach to data engineering more broadly:

Conclusion

Building ETL pipelines for research data requires balancing engineering rigor with scientific flexibility. The techniques described here—automated validation, comprehensive reporting, and reproducible workflows—are applicable across domains from HCI to bioinformatics to climate science.

As data volumes grow and research becomes more interdisciplinary, robust data engineering practices will become increasingly critical for scientific progress.