Model Distributed training的介绍

by admin · July 17, 2022

本周我们继续ML Platform的第五讲，感谢锅锅在百忙之中给我们带来的精彩讲座，相关内容的总结如下：

Slides

锅锅使用的Slides。
1. 锅锅发的参考链接。
ML Platform系列讲座总结：
- 第一讲：ML Infra的整体框架介绍
- 第二讲：ML OPS的深入介绍
- 第三讲：Model Serving的介绍
- 第四讲：资源调度，Federated Learning的介绍
- 第五讲：Model Distributed training的介绍
- 第六讲：Feature Store, Parameter Server的介绍
- 第七讲：KServe和Triton的介绍 — realtime inference

Q&A

感谢Nancy提供的笔记供参考

异步训练模型如何合并

【扩展】强化学习中的并行（parallel）、异步（asynchronous）与分布式（distribute）
【扩展】强化学习异步分布式训练实现
【扩展】Pytorch 分布式模式介绍
【扩展】数据并行——ps

分布式训练的扩展阅读

【扩展】一文说清楚Tensorflow分布式训练必备知识
【扩展】[源码解析] 深度学习分布式训练框架 Horovod — (1) 基础知识
【扩展】分布式机器学习
【扩展】分布式 GPU 训练
【扩展】分布式训练架构-horovod
【扩展】PyTorch分布式训练简明教程
【扩展】pytorch分布式训练

再次感谢大家的参与，也希望大家有好的资源能联系我更新这篇文章，或者在下面留言。谢谢大家。

下周话题安排和往期话题回顾敬请参见《系统设计开荒小分队话题讨论简介》

欢迎大家订阅公众号或者注册邮箱（具体方法见左右侧边栏），可以第一时间收到更新。

Post Views: 1,376

You may also like...

5 Responses

Comments0
Pingbacks5

Model Serving的介绍 - 东哥IT笔记

July 17, 2022

[…] 第五讲：Model Distributed training的介绍 […]
ML OPS的深入介绍 - 东哥IT笔记

July 17, 2022

[…] 第五讲：Model Distributed training的介绍 […]
资源调度，Federated Learning的介绍 - 东哥IT笔记

July 31, 2022

[…] 第五讲：Model Distributed training的介绍 […]
ML Infra的整体框架介绍 - 东哥IT笔记

July 31, 2022

[…] 第五讲：Model Distributed training的介绍 […]
Feature Store, Parameter server的介绍 - 东哥IT笔记

August 1, 2022

[…] 第五讲：Model Distributed training的介绍 […]

Leave a Reply Cancel reply