About Me

I serve as the tech lead for PyTorch Distributed team and work on all major features in the distributed package, including FullyShardedDataParallel, DistributedDataParallel, collective communication, and RPC-based distributed training. I am broadly interested in ML systems and large scale ML applications.

Experiences

  • 2018/11 - Present Senior Staff Research Scientist, Meta AI.
  • 2016/06 - 2018/11 Research Staff Member, IBM Research T. J. Watson Lab
  • 2010/10 - 2016/05 Research Assitant, University of Illinois Urbana-Champaign
  • 2015/05 - 2015/08 Research Intern, IBM Research T. J. Watson Lab
  • 2014/05 - 2014/08 Research Intern, IBM Research T. J. Watson Lab
  • 2013/06 - 2013/08 Software Engineer Intern, Facebook Inc.
  • 2012/05 - 2012/08 Software Engineer Intern, Yahoo! Inc.

Projects

  • Large Language Model Training

    This project optimized training efficiency with PyTorch native features for GPT models ranging from 162M to 1T parameters. We summarized the guidelines on using cloud resources and PyTorch features including DistributedDataParallel, Pipeline Parallel, FullyShardedDataParallel, Activation Checkpionting/Offloading, etc.

    • PyTorch Data Parallel Best Practices on Google Cloud [Joint Post and Talk with GCP]
    • Training and Profiling Large Scale Models with PyTorch [GTC’22 Talk]
    • Training a 1 Trillion Parameter Model With PyTorch Fully Sharded Data Parallel on AWS [Joint Post and Talk with AWS]
  • PyTorch FullyShardedDataParallel

    This is the first native feature in PyTorch that can support models with up to trillions of parameters. FullyShardedDataParallel (FSDP) decomposes the model into smaller units. For every unit, FSDP shards Tensor storage across data-parallel processes to reduce memory footprint, allgathers full parameters before computation, and discards gathered parameter shards afterward. FSDP offers multiple performance optimizations out-of-box, including computation and communication overlap, parameter prefetching, parameter CPU offloading, and mixed-precision support in both computation and communication. [VLDB’23][Post][Doc][Tutorial]

  • Pipelined Data Parallel

    We developed a pipelined data parallel training paradigm called PipeTransformer which leverages automated elastic pipelining for efficient distributed training of Transformer models. PipeTransformer can adaptively freeze model layers, shrink pipelines based on the number of active layers to free resources, and dynamically allocate those resources to increase data parallel width. [ICML’21][Post]

  • PyTorch RPC

    PyTorch RPC provides a flexible and high-performance set of low-level APIs for distributed deep learning. PyTorch RPC natively provides essential features for implementing training applications in a distributed environment, including optimized tensor communications, remote memory management, and distributed autograd. It allows users to easily implement different training paradigms (e.g., parameter servers, pipeline parallel, etc.) on top. [GTC’21][DevDay’20][Doc][Tutorial]

  • PyTorch DistributedDataParallel

    DistributedDataParallel (DDP) helps to scale model training to large datasets and large clusters. It can be enabled on top of local training scripts with 1-line code change and will automatically overlap backward computation with gradient communication. This feature is widely adopted both internally at Meta and externally across the industry. [VLDB’18][Talk1][Talk2][Doc][Tutorial]

  • IBM Streams Beam Runner

    IBM Streams is a high-throughput low-latency analytic platform for streaming data applications. I work on multiple research issues in IBM Streams, including transform graph translation and optimization, out-of-order event arrival processing, large window aggregation, etc. Currently, I lead a small team of four people to adopt Apache Beam model into IBM Streams, which involves filling in the model gaps between Beam and Streams, indexing/garbage-collecting Beam transform states, and managing operator parallelism. Our work produces both research papers and a product (IBM Streams Beam runner toolkit) deployed in IBM Cloud which won 2018 IBM Outstanding Technical Achievement Award. [VLDB’18]

  • Pyro: A Spatial-Temporal Big-Data Storage System

    In this project, we designed a spatial-temporal big-data storage system tailored for high-resolution geometry queries and dynamic workload hotspots. With the rapid growth of mobile devices and applications, geo-tagged data has become a significant workload for big data storage systems. In dealing with spatial-temporal big-data workloads, existing systems either fall short in scalability or fail to deliver high efficiency. This project attacks this problem by optimizing the HBase/HDFS stack for spatial-temporal data. [ATC’15]