Ban AI in the classroom? My comment
There's an open letter circulating this week entitled "Stop the Uncritical Adoption of AI Technologies in Academia", initiated by academics at Radboud. I agree with quite a lot of it, but their demands are quite sweeping and as a result I can't sign this letter. My reasons, my response, are in this post. I'd be interested in your reaction too.
Here's my comment:
One of my big concerns is that they don't specify what it is they want to ban. We all know that "AI" is a widely-used term which is sometimes taken broadly and sometimes narrowly. I believe that the aim of the letter is to ban big-industry generative AI from the classroom (judging my the motivations they express). I sympathise with that. However, the authors have chosen to simplify this to the term "AI" without explanation, and that turns their demands into quite extreme ones.
The closest we get to a definition in this letter is AI "...such as chatbots, large language models, and related products." So is it only text generation they want to ban, ignoring image generation? Maybe, but that's probably too narrow. Do they want to ban all use of machine learning, even the teaching of machine learning? I very much doubt it, but it's easy to read the demands that way, since "AI" is understood by many people to include all of that.
(The title of the letter seems to exhibit nuance: "Stop the Uncritical Adoption of AI" is much better than "Stop the Adoption of AI". But the letter's demands go further.)
For myself, I'd like to
- (a) keep the Big GenAI industry out of the classroom (I'd like it to be possible to complete my course without sending data outside the country, without helping to train some VC company's algorithm, without supporting unknown amounts of energy waste);
- (b) take seriously the threat to learning from "cognitive shortcuts" and bullshit text;
- (c) never use any AI whose energy/climate footprint cannot be measured;
while I also want to
- (d) make my course robust and fair even when some students might over-use LLMs outside my control;
- (e) take account of LLMs' valuable use cases - most notably in programming, for code debugging and writing code to adapt data formats.
For me, the open letter's demand to "ban AI use in the classroom for student assignments" accounts for (a,b,c) but fails at (d) and (e).
I've avoided LLMs so far but I don't believe I can achieve (d) without taking some nuanced tactical alterations to the course that I teach. I might use EduGenAI or possibly an offline local LLM since it helps with points (a) and (c) (but not completely).
So, for my own personal perspective: I don't agree with the open letter because it "throws the baby out with the bathwater": the "baby" being ML tools in the classroom, the "bathwater" being Big GenAI and LLM-induced de-skilling. I would prefer our strategy to be one that deliberately guards against both of those without banning all "AI".
I also have in mind the fatalistic voices who will comment: "Students will use ChatGPT anyway" and "But ChatGPT is better than ____". I work at Tilburg University, whose motto is "Understanding Society". Surely, this now includes understanding the societal context and implications of using LLMs, including the societal position of one LLM versus another LLM. For me, tools like GPT-NL or EduGenAI should help to make this case. (Or offline LLMs?) We can disentangle LLMs as a tool from Big GenAI as an industry, in the messaging we give to students.
I'm grateful to the letter authors for taking a stand, and for providing good food for thought.
Using Zarr for storing large audio data files in cloud-optimised format
Imagine you have recorded lots of very long audio files. Hours long, at least. Gigabytes long, at least. You can store these in standard formats such as WAV or FLAC. You can share those via cloud storage too. But now imagine that you want to do data analysis involving random access, for example "I only want to analyse a particular 10-second region from midday on the 7th of October." What can you do?
In most cases you'll need to download the entire WAV or FLAC that contains your chunk. Depending your software, you may need to load the whole file into memory to find the excerpt that you need. Bah! That's two very clunky and wasteful steps: downloading too much, and loading too much into memory.
Enter cloud-optimised data formats such as Zarr. With Zarr you can store and share multi-dimensional arrays, and when you want to access a chunk you can specify the range you want, and Zarr will only need to transfer the file chunks that are needed to cover that.
So: can we use this for audio? Does it even make sense? I think so: audio data usually only has one "long" axis (the time dimension), but we record lots of long soundscape audio, and it's hard to share these datasets conveniently. I suspect that Zarr can help us with cloud-hosting audio data... and one more thing: Zarr's choice of compression tools allows us even to use FLAC compression which is designed for audio, meaning that we should be able to get great data compression AT THE SAME TIME as making it really convenient for analysis.
I'm new to Zarr. But it seems easy to use. Here's my method and Python code to try Zarr with FLAC:
- Find a long audio file. I'm using a soundscape recording about 90 minutes long, 2.0G on disk. The file format of my WAV file is: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, stereo 96000 Hz.
- Try compressing it using the standard FLAC utility, with default parameters. This isn't cloud-optimised or analysis-ready, but it's 466M -- great! about 24% of original.
- Now let's try the Zarr and Python:
First here's how I set up a virtual environment for it:
python -m venv venv_zarr
source ./venv_zarr/bin/activate
pip install -U pip
pip install ipython zarr librosa flac-numcodecs
Then I'll simply use librosa to read the file, and Zarr to write it out to disk, once in Zarr's default compression format (Blosc) and once in FLAC.
Note that you need to be careful and decide what audio sample format you want to use. Since my sound file is stored on disk as 16 bit, I'll use the "int16" data format in Python, so that the audio data points are in the same precision in every case.
import librosa
import zarr
import os
from math import ceil
from flac_numcodecs import Flac
filename = os.path.expanduser("~/ZOOM0001_LR_Leiden_20230308_0800.WAV")
sr = librosa.get_samplerate(filename)
dur = librosa.get_duration(path=filename)
audio_dtype="int16"
# NB I don't have any strong preferences about chunk size.
chunk_size = 4096
######################################
# First using Zarr default compression
stream = librosa.stream(filename,
block_length=chunk_size,
hop_length=chunk_size,
frame_length=chunk_size,
mono=False,
dtype=audio_dtype)
durspls = int(ceil(dur * sr))
# Create the zarr array.
z = zarr.open("/tmp/bigaudio_default.zarr", mode='w', shape=(2, durspls),
chunks=(2, chunk_size), dtype=audio_dtype)
totspls = 0
for y_block in stream:
z[:, totspls:totspls+y_block.shape[1]] = y_block
totspls += y_block.shape[1]
stream.close()
z.info # BLOSC: No. bytes stored: 1.1G, Storage ratio: 1.9
#######################################
# Second using FLAC compression in Zarr
stream = librosa.stream(filename,
block_length=4096,
hop_length=4096,
frame_length=4096,
mono=False,
dtype=audio_dtype)
durspls = int(ceil(dur * sr))
# Create the zarr array.
flac_compressor = Flac(level=8)
z = zarr.open("/tmp/bigaudio_flac.zarr", mode='w', shape=(2, durspls),
chunks=(2, chunk_size), dtype=audio_dtype,
compressor=flac_compressor)
totspls = 0
for y_block in stream:
z[:, totspls:totspls+y_block.shape[1]] = y_block
totspls += y_block.shape[1]
stream.close()
z.info # FLAC: No. bytes stored: 481.1M, Storage ratio: 4.3
And for completeness here's a simple test of loading a 10-second chunk from the middle of the file:
z = zarr.open("/tmp/bigaudio_flac.zarr", mode='r')
readoffset = 200_000_000
readdur = int(10 * sr)
subset = z[:, readoffset:readoffset+readdur]
print(subset)
It's that line subset = z[...] which loads some data into memory, giving me a numpy array which I can use as normal. And for me that was lightning-fast. It definitely didn't load the whole audio; and I didn't have to tell the system anything about how to achieve that :)
According to my usual filesize tool ("du"), here's a summary of the filesizes:
- Original WAV: 2.0 GB
- FLAC file: 466 M
- Zarr using Blosc: 1.6 GB
- Zarr using FLAC: 540 M
Great! The FLAC settings might not be identical in the two setups but that doesn't matter here. We get the best of both worlds: Highly compressed audio data, and lightning-fast access to arbitrary little chunks of the massive file. Specifically, it's really notable that the default compression (Blosc) is about 3 times worse than FLAC at compressing soundscape audio.
Measuring the CO2 footprint of an AI model
Training AI models, or running AI experiments, can consume a lot of power. But not always! Some are large and some are small. This week I've been using CodeCarbon, a tool for measuring the CO2 emissions of your code.
CodeCarbon tracks the amount of power that your computer's CPU/RAM/GPU/etc use during an experiment, to calculate a total of the power usage. It then performs an online lookup to find out how carbon-intensive your local electricity supplier is (since the CO2 impact of electricity generation varies throughout the day, throughout the year, and from place to place). From that, it calculates a total CO2 impact.
Using CodeCarbon in your Python script is easy:
from codecarbon import EmissionsTracker
tracker = EmissionsTracker(project_name='waffleogram')
# ...
tracker.start()
try:
# The training loop goes here
finally:
tracker.stop()
I've used this code to compare running the same experiment (simply training a small CNN) on 3 different machines I have available: my work laptop, my home server, and a dedicated GPU server at Naturalis. All these are evaluated on insect sound classification, running 2 epochs of EfficientNet training (dataset: InsectSet66). I'll quote the total electricity used for the training run, as calculated by CodeCarbon:
- My laptop (no GPU, Intel i7 CPU):
About 3h30 to run 2 epochs.
0.048196 kWh of electricity used - My home server (no GPU, Intel pentium CPU):
About 18h to run 2 epochs (!).
0.247716 kWh of electricity used - Our Naturalis GPU server (A40 GPU [2 present, but only 1 used here], Intel Xeon CPU):
About 7 minutes to run 2 epochs.
0.045225 kWh of electricity used.
My home server is the least efficient method, primarily because its CPU is old and power-hungry.
The laptop and the GPU server apparently use a similar amount of energy for this task, despite many differences! The GPU server is much more power-hungry (e.g. the RAM takes 188W of power whereas my laptop RAM takes 6W) but it completes the task quickly.
The analysis that CodeCarbon gives you is incomplete. It's still useful! But there are a few extra factors that are worth thinking about, which a tool like this does not know about. Firstly, it doesn't know that my home computers were powered by our solar panels -- I ran these tests on a bright summer's day, when our home was generating excess energy, meaning that the true carbon footprint is effectively zero. Certainly much lower than electricity from the general Dutch grid. Secondly, it doesn't know whether you're using a machine that is already running for other reasons, or whether you bought/powered-up this machine specially for your experiment. The difference that makes is that in the latter case you should also count the carbon cost of running the base system.
I also tried running the same experiment on the GPU server, but swapping the simple CNN-based architecture for a slightly more complicated one running an adaptive feature extractor. Without changing anything else at all, this makes a big difference: the adaptive feature extractor makes the training task more complicated, and slower, to calculate -- it took 50 minutes (rather than 7) and its power usage was approximately 10 times higher.
So: it makes a big difference what machine-learning model you're training; it makes a big difference what machine you're running it on. Factors of 5 or 10 are really significant, especially when multiplied up to the scale of a whole research project. The important thing is to measure it. Hence tools like CodeCarbon.
See also this other recent blog from me: Is using LLMs bad for the environment?
Is using LLMs bad for the environment?
We've been asked the question: "Is using LLMs bad for the environment?" It's an important question, since it's a new technology that many are trying to find uses for -- yet it clearly uses a nontrivial amount of energy to run LLMs, which translates into impacts such as CO2 emissions. As machine learning researchers, I think we have a duty to be able to give a decent answer.
Here's a beginning, from my Tilburg colleague Nikos: "One way to answer the question whether using LLMs is bad for the environment is to take a comparative approach. There are tasks that LLMs can do that other technologies can do (e.g., search), and there we can compare the resource intensity of the technologies. There are other capabilities that are unique to the LLM technology, and for those cases, the best available reference is how much resources humans would use to perform the same task." -- Good start. What to add?
LLMs are a general-purpose technology which, when used for a specific task such as web search, will always be much less efficient, simply because "classic" web search can be heavily optimised for the single task.
Some early estimates of the carbon footprint of LLMs (Strubell et al 2019) were too pessimistic and created some very bad headlines. Improved estimates are more accurate. Here's a recent research paper that tries to make accurate and precise estimates: Faiz et al (2024)
The biggest and most impressive LLMs are undoubtedly highly carbon-intensive as part of their drive to outperform their competitors. The difference can certainly be a factor of 100. However, there is also a push to create efficient "small" LLMs, even ones that could run on your own computer.
So, the exact footprint of an LLM depends on which LLM it is, but also on other factors such as how clean is the energy used for the data centre. These factors can easily change the footprint by a factor of 10 or 100 - they are not to be ignored. (Just the same way as, when deciding to take a flight or a train, the difference in carbon footprint can be x10 to x100. These "multipliers" are important to take care of.)
It is important to note that the developers of the most well-known LLMs (GPT4) refuse to publish the information that would give a clear answer to exactly how bad they are for the environment. -- Thus, I would recommend only using LLMs that make clear numerical statements about their carbon footprint. We must not incentivise bad practice.
How to maximise our AI experiment throughput?
I've been thinking about how our research team can maximise the effectiveness of their machine learning (ML) experiments. I'm thinking of it as maximising the "throughput" but not in the usual computer-science approach (parallel processing, SIMD, GPUs etc) -- instead, I'm trying to pay attention to human factors, the full loop of research and ideas, and trying to maximise the number of useful ML experiment results we have in our hands per month.
Note that the word "useful" in that sentence is not trivial! I don't simply want to run the most ML jobs possible. We're not looking for results, really, we're looking for insights!
We had a discussion of these issues in my group this week. Here are some topics raised by people in the group:
- Hardware availability (e.g. GPU machines) -- Even though we have some good modern GPU servers, cloud compute, and HPC farms, multiple team members still raised that a big limitation for them was the hardware ability to run many experiments quickly. I was honestly suprised by this. So, why might this still be the case?
- Option 1: Researchers' appetite is unbounded?? Possibly!
- Option 2: Our use of GPU machines is not very efficient, so we're not squeezing many experiments into our available resource? Possibly! But, importantly, we're in a research environment here, not a deployment environment, so there always has to be room for exploratory, badly-coded experiments; we don't want to spend our time on premature optimisation. It's good to avoid being wasteful, but not to waste all your time on optimisation.
- Option 3: There is existing resource, but it's hard to access? -- Yes, I think this is one issue. In my previous position and in my current position, the university provided a HPC system for everyone to log in and run many parallel jobs. In both places, students didn't use it, even though we tried to get them to. Why not? Probably it's the barrier-to-entry, the fact that dispatching jobs into a HPC system is hard to understand at first, and you also lose some benefits such as direct interaction with your running process (e.g. to visualise results, or to debug it interactively). --- How can we improve on this in future? We will have diverse resources - local, HPC, and cloud - and we want to lower the barriers to using all of them.
- Dataset formats -- Every dataset typically has its own idiosyncratic formatting, column names, etc. It's possible to advocate for standards (recently, I'm working with Audiovisual Core which standardises column names, and I like their philosophy of low-key harmonisation -- but realistically, I can't expect the majority of 3rd-party datasets EVER to follow my formatting wishes. As one person commented, "Writing the DataLoader can sometimes be the part that takes the most work!"
- One strategy to work around dataset formats is to structure your code with some "glue" files that do the following two things:
(1) Express basic information in a fixed format such as a simple JSON (e.g. the list of files);
(2) Apply custom data loading operations for that dataset in a python script (e.g. to load a particular file and its annotations).
- One strategy to work around dataset formats is to structure your code with some "glue" files that do the following two things:
- Set your own intentions e.g. for the coming month, and - crucially for ML work - don't allow yourself to fiddle with every part of your workflow all at once. You could leave your dataset and feature-extraction constant even if you think it needs improving, and work only on the network, for now. This helps you to avoid falling off into the underworld of far-too-many options you could tweak, and make sure you can compare some experiments against each other.
- Experiment management frameworks such as MLFlow: we don't really use these, but it's worth dabbling with them to see if they can help us, without us getting trapped inside their way of doing things. Weights And Biases is a related framework for tracking experiments, popular with some of our team.
- ACTION: we're going to dabble more with MLflow in our experimental work.
- Getting started as a new student/researcher is hard, finding how to code basic things, how to access datasets, how to store ML models and results. In many university research groups we don't provide very structured induction for this kind of thing. We're never going to be as structured as a commercial AI company, partly because we need the freedom to explore completely different workflows and ideas, but some sort of ML developer starter's pack would be good!
- ACTION: We'll draft something, e.g. in a wiki. We'll start with something as simple as a list of useful python packages.
Note that out of these topics raised by my research team, they correspond to items 3+4+5 in this nice list of "Tips for setting up code management in a research lab" -- I can't find the original link but I think it maybe came from Patrick Mineault.
Find me on Mastodon
Find me on Mastodon! This is me: https://mastodon.social/@danstowell
I also have a couple of separate "topical" accounts, where I post
food and drink here: https://hostux.social/@nomnomdan
music here: https://ravenation.club/@mcldnowplaying
I'm really enjoying it so far, it feels lively. I don't know how many people run separate accounts like I am doing, but I find it a nice way to connect in to some topical communities too.
Computers should get BETTER over time, not worse
I was given this pretty surprising insight at a permacomputing meet-up in Utrecht:
Computers should get BETTER over time not WORSE
Why? It's our everyday experience that computers, software, smartphones all tend to get worse over time. Slower, more unreliable, strange behaviour. We have a habit of ascribing that to …
The carbon footprint of AI will NOT shrink
The heavy computing involved in AI - training and running deep learning algorithms, with big datasets - consumes power and resources. How much should we be concerned about its environmental impact?
In 2019, the paper Energy and Policy Considerations for Deep Learning in NLP made quite a big splash by explicitly calculating …
How to schedule regular no-wifi time on a Linux laptop
I'm having problems focusing on my work. It's very difficult to avoid distractions.
One big distraction is something called the World Wide Web... well in my job the most difficult distraction is actually work email - there are lots of different things that distract attention by email. Email is great but …
Favourite audio/science/misc software to install on Linux
I was setting a new laptop up recently. If you're not familiar with Linux you probably don't know how amazing is the ecosystem of software you can have for free, almost instantly. Yes sure the software is free but what's actually impressive is how well it all stitches together through …