[P] ML training cluster for university students

Hi! I’m an exec at a University AI research club. We are trying to build a gpu cluster for our student body so they can have reliable access to compute, but we aren’t sure where to start.

Our goal is to have a cluster that can be improved later on – i.e. expand it with more GPUs. We also want something that is cost effective and easy to set up. The cluster will be used for training ML models. For example, a M4 Ultra Studio cluster with RDMA interconnect is interesting to us since it’s easier to use since it’s already a computer and because we wouldn’t have to build everything. However, it is quite expensive and we are not sure if RDMA interconnect is supported by pytorch – even if it is, it still slower than NVelink

There are also a lot of older GPUs being sold in our area, but we are not sure if they will be fast enough or Pytorch compatible, so would you recommend going with the older ones? We think we can also get sponsorship up to around 15-30k Cad if we have a decent plan. In that case, what sort of a set up would you recommend? Also why are 5070s cheaper than 3090s on marketplace. Also would you recommend a 4x Mac Ultra/Max Studio like in this video https://www.youtube.com/watch?v=A0onppIyHEg&t=260s
or a single h100 set up?

Also ideally, instead of it being ran over the cloud, students would bring their projects and run locally on the device.

submitted by /u/guywiththemonocle
[link] [comments]

Liked Liked