Scalable ETL Pipelines for Research Data: Lessons from HCI Studies
Published on January 15, 2026
Research data processing presents unique challenges: heterogeneous formats, quality validation requirements, and the need for reproducibility. In this post, I'll share insights from building an ETL pipeline that processed 50,000+ finger-movement events for multi-modal touchscreen studies at Saarland University's Computational Interaction Group.
The Challenge: Multi-Modal Research Data
Human-Computer Interaction (HCI) studies generate diverse data streams:
- Video frames: RGB streams from multiple camera angles
- Touch events: Timestamped finger position logs
- Sensor data: Pressure, velocity, and acceleration measurements
- Annotations: Ground truth labels from human reviewers
Our challenge was to align these streams, extract meaningful features, and validate data quality—all while maintaining reproducibility for PhD research projects.
Pipeline Architecture
The solution combined MediaPipe for hand tracking, OpenCV for video processing, and custom validation logic:
1. Extraction Layer
import mediapipe as mp
import cv2
class FingerTracker:
def __init__(self):
self.hands = mp.solutions.hands.Hands(
static_image_mode=False,
max_num_hands=2,
min_detection_confidence=0.7
)
def process_frame(self, frame):
"""Extract hand landmarks from video frame."""
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
results = self.hands.process(rgb_frame)
if results.multi_hand_landmarks:
return self._extract_finger_positions(results)
return None
2. Transformation Layer
Raw coordinates needed alignment with touch event logs. We implemented temporal synchronization using timestamp interpolation:
def align_streams(video_events, touch_events, tolerance_ms=50):
"""Align video-extracted positions with touch event logs."""
aligned = []
for touch in touch_events:
# Find closest video frame within tolerance
video_match = find_nearest(
video_events,
touch['timestamp'],
tolerance_ms
)
if video_match:
aligned.append({
'touch': touch,
'video': video_match,
'offset_ms': abs(touch['timestamp'] - video_match['timestamp'])
})
return aligned
3. Validation Layer
Data quality was crucial for research validity. We implemented Euclidean thresholding to identify discrepancies between video-extracted and logged positions:
import numpy as np
def validate_alignment(aligned_events, threshold_px=15):
"""Validate position alignment using Euclidean distance."""
results = {
'valid': [],
'violations': [],
'accuracy': 0.0
}
for event in aligned_events:
video_pos = np.array(event['video']['position'])
touch_pos = np.array(event['touch']['position'])
distance = np.linalg.norm(video_pos - touch_pos)
if distance <= threshold_px:
results['valid'].append(event)
else:
results['violations'].append({
**event,
'distance': distance
})
results['accuracy'] = len(results['valid']) / len(aligned_events)
return results
This approach achieved >95% accuracy for event classification, with automated violation reports for manual review.
Automated Reporting & Visualization
Real-time monitoring was essential for catching data quality issues early. We built dashboards showing:
- Alignment accuracy per session and participant
- Violation heatmaps identifying problematic screen regions
- Temporal drift analysis for synchronization issues
import matplotlib.pyplot as plt
import seaborn as sns
def generate_validation_report(results, output_path):
"""Generate visual validation report."""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Accuracy over time
sns.lineplot(data=results['temporal_accuracy'], ax=axes[0, 0])
axes[0, 0].set_title('Alignment Accuracy Over Time')
# Violation heatmap
sns.heatmap(results['violation_density'], ax=axes[0, 1])
axes[0, 1].set_title('Violation Density by Screen Region')
# Distance distribution
sns.histplot(results['distances'], ax=axes[1, 0])
axes[1, 0].set_title('Position Error Distribution')
# Per-participant summary
sns.barplot(data=results['participant_accuracy'], ax=axes[1, 1])
axes[1, 1].set_title('Accuracy by Participant')
plt.tight_layout()
plt.savefig(output_path)
return output_path
Impact & Results
The pipeline delivered significant improvements to the research workflow:
- 30% reduction in manual processing time
- Reproducible datasets with full provenance tracking
- Queryable outputs enabling dynamic analysis for PhD projects
- Automated quality gates preventing bad data from entering analysis
Lessons Learned
1. Validate Early, Validate Often
Catching data quality issues at ingestion is far cheaper than discovering them during analysis. Build validation into every pipeline stage.
2. Design for Reproducibility
Research pipelines must be reproducible. We versioned not just code, but also model weights (MediaPipe versions), configuration parameters, and random seeds.
3. Optimize for Iteration Speed
Researchers need to experiment quickly. We built caching layers so that expensive operations (video processing) weren't repeated unnecessarily.
4. Document Everything
Every transformation, threshold, and design decision was documented. This proved invaluable when writing papers and onboarding new team members.
Future Directions
This experience has informed my approach to data engineering more broadly:
- MLOps integration: Connecting pipelines to model training workflows
- Cloud deployment: Scaling to handle larger studies on AWS/Azure
- Real-time processing: Moving from batch to streaming architectures
Conclusion
Building ETL pipelines for research data requires balancing engineering rigor with scientific flexibility. The techniques described here—automated validation, comprehensive reporting, and reproducible workflows—are applicable across domains from HCI to bioinformatics to climate science.
As data volumes grow and research becomes more interdisciplinary, robust data engineering practices will become increasingly critical for scientific progress.