I've been thinking about how our research team can maximise the effectiveness of their machine learning (ML) experiments. I'm thinking of it as maximising the "throughput" but not in the usual computer-science approach (parallel processing, SIMD, GPUs etc) -- instead, I'm trying to pay attention to human factors, the full loop of research and ideas, and trying to maximise the number of useful ML experiment results we have in our hands per month.
Note that the word "useful" in that sentence is not trivial! I don't simply want to run the most ML jobs possible. We're not looking for results, really, we're looking for insights!
We had a discussion of these issues in my group this week. Here are some topics raised by people in the group:
- Hardware availability (e.g. GPU machines) -- Even though we have some good modern GPU servers, cloud compute, and HPC farms, multiple team members still raised that a big limitation for them was the hardware ability to run many experiments quickly. I was honestly suprised by this. So, why might this still be the case?
- Option 1: Researchers' appetite is unbounded?? Possibly!
- Option 2: Our use of GPU machines is not very efficient, so we're not squeezing many experiments into our available resource? Possibly! But, importantly, we're in a research environment here, not a deployment environment, so there always has to be room for exploratory, badly-coded experiments; we don't want to spend our time on premature optimisation. It's good to avoid being wasteful, but not to waste all your time on optimisation.
- Option 3: There is existing resource, but it's hard to access? -- Yes, I think this is one issue. In my previous position and in my current position, the university provided a HPC system for everyone to log in and run many parallel jobs. In both places, students didn't use it, even though we tried to get them to. Why not? Probably it's the barrier-to-entry, the fact that dispatching jobs into a HPC system is hard to understand at first, and you also lose some benefits such as direct interaction with your running process (e.g. to visualise results, or to debug it interactively). --- How can we improve on this in future? We will have diverse resources - local, HPC, and cloud - and we want to lower the barriers to using all of them.
- Dataset formats -- Every dataset typically has its own idiosyncratic formatting, column names, etc. It's possible to advocate for standards (recently, I'm working with Audiovisual Core which standardises column names, and I like their philosophy of low-key harmonisation -- but realistically, I can't expect the majority of 3rd-party datasets EVER to follow my formatting wishes. As one person commented, "Writing the DataLoader can sometimes be the part that takes the most work!"
- One strategy to work around dataset formats is to structure your code with some "glue" files that do the following two things:
(1) Express basic information in a fixed format such as a simple JSON (e.g. the list of files);
(2) Apply custom data loading operations for that dataset in a python script (e.g. to load a particular file and its annotations).
- Set your own intentions e.g. for the coming month, and - crucially for ML work - don't allow yourself to fiddle with every part of your workflow all at once. You could leave your dataset and feature-extraction constant even if you think it needs improving, and work only on the network, for now. This helps you to avoid falling off into the underworld of far-too-many options you could tweak, and make sure you can compare some experiments against each other.
- Experiment management frameworks such as MLFlow: we don't really use these, but it's worth dabbling with them to see if they can help us, without us getting trapped inside their way of doing things. Weights And Biases is a related framework for tracking experiments, popular with some of our team.
- ACTION: we're going to dabble more with MLflow in our experimental work.
- Getting started as a new student/researcher is hard, finding how to code basic things, how to access datasets, how to store ML models and results. In many university research groups we don't provide very structured induction for this kind of thing. We're never going to be as structured as a commercial AI company, partly because we need the freedom to explore completely different workflows and ideas, but some sort of ML developer starter's pack would be good!
- ACTION: We'll draft something, e.g. in a wiki. We'll start with something as simple as a list of useful python packages.
Note that out of these topics raised by my research team, they correspond to items 3+4+5 in this nice list of "Tips for setting up code management in a research lab" -- I can't find the original link but I think it maybe came from Patrick Mineault.