Machine Learning Research

Distributed ML Performance Study

Research project comparing how machine learning models perform when trained across distributed systems versus centralized training. Used PyTorch and TensorFlow.

PyTorch TensorFlow Python Federated Learning

Timeline

Aug 2022 - Apr 2023

My Role

Co-Researcher & Developer

Team Size

2 researchers

Research Question

The Problem

Traditional machine learning requires centralizing all training data, which creates privacy concerns and bandwidth limitations. Federated learning trains models across distributed devices without sharing raw data.

What We Tested

Built experiments to compare model accuracy, training time, and convergence rates between federated learning and traditional centralized training across different dataset types and sizes.

Research code implementation

Multiple Datasets

Tested on image classification, text processing, and numerical data to understand performance across different domains.

Framework Comparison

Implemented experiments in both PyTorch and TensorFlow to validate results across different ML frameworks.

Performance Metrics

Measured accuracy, training time, memory usage, and communication overhead for comprehensive analysis.

Key Findings

Accuracy Trade-offs

Federated learning achieved 92-96% of centralized model accuracy while maintaining data privacy. Performance gap varied by dataset complexity.

Communication Costs

Network bandwidth became the bottleneck in federated training. Gradient compression techniques helped but didn't eliminate the overhead.

Convergence Patterns

Federated models required 2-3x more training rounds to converge, but individual rounds were faster due to parallel processing.

What I Learned

Research Design

Learned to design controlled experiments, manage multiple variables, and draw meaningful conclusions from data.

Framework Differences

Got hands-on experience with both PyTorch and TensorFlow, understanding their strengths for different use cases.

Distributed Systems

Understanding the practical challenges of coordinating training across multiple nodes and handling network failures.