Do data scientists really like Git?

Feb 28, 2024

I have a theory: data scientists do not like Git. I think they did not adopt Git because they needed to version control their notebooks. They adopted Git because when they approached the software developers and DevOps engineers for collaboration that is what they were forced to do.

There are no bad intentions here. Software developers really do not have any other platform but Git to collaborate. It was only natural that they recommended data scientists to use Git too. Git is a great tool that is optimized to handle a large number

of small, text-based files. On the other hand, AI/ML projects are not just about coding; they involve training an AI model with data. The ML code facilitates the training. And the training requires large datasets. The datasets used for training AI models can be huge and unstructured (images, videos, audio).

There have been attempts to remedy the deficiencies of Git with varying degrees of success. However, I do not think the answer is mending a versioning system that is optimized for code-bases and I do worry about AI/ML development moving towards specialized proprietary platforms instead of using well known open platforms.

The Software Maker

Discussion about this post