1 Title: Speech to text, she APTly whispered, how hard can it be?
2 Tags: english, debian, multimedia, video
5 <p>While visiting a convention during Easter, it occurred to me that
6 it would be great if I could have a digital Dictaphone with
7 transcribing capabilities, providing me with texts to cut-n-paste into
8 stuff I need to write. The background is that long drives often bring
9 up the urge to write on texts I am working on, which of course is out
10 of the question while driving. With the release of
11 <a href="https://github.com/openai/whisper/">OpenAI Whisper</a>, this
12 seem to be within reach with Free Software, so I decided to give it a
13 go. OpenAI Whisper is a Linux based neural network system to read in
14 audio files and provide text representation of the speech in that
15 audio recording. It handle multiple languages and according to its
16 creators even can translate into a different language than the spoken
17 one. I have not tested the latter feature. It can either use the CPU
18 or a GPU with CUDA support. As far as I can tell, CUDA in practice
19 limit that feature to NVidia graphics cards. I have few of those, as
20 they do not work great with free software drivers, and have not tested
21 the GPU option. While looking into the matter, I did discover some
22 work to provide CUDA support on non-NVidia GPUs, and some work with
23 the library used by Whisper to port it to other GPUs, but have not
24 spent much time looking into GPU support yet. I've so far used an old
25 X220 laptop as my test machine, and only transcribed using its
28 <p>As it from a privacy standpoint is unthinkable to use computers
29 under control of someone else (aka a "cloud" service) to transcribe
30 ones thoughts and personal notes, I want to run the transcribing
31 system locally on my own computers. The only sensible approach to me
32 is to make the effort I put into this available for any Linux user and
33 to upload the needed packages into Debian. Looking at Debian Bookworm, I
34 discovered that only three packages were missing,
35 <a href="https://bugs.debian.org/1034307">tiktoken</a>,
36 <a href="https://bugs.debian.org/1034144">triton</a>, and
37 <a href="https://bugs.debian.org/1034091">openai-whisper</a>. For a while
39 <a href="https://bugs.debian.org/1034286">ffmpeg-python</a> was
41 <a href="https://github.com/kkroening/ffmpeg-python/issues/760">upstream
42 seem to have vanished</a> I found it safer
43 <a href="https://github.com/openai/whisper/pull/1242">to rewrite
44 whisper</a> to stop depending on in than to introduce ffmpeg-python
45 into Debian. I decided to place these packages under the umbrella of
46 <a href="https://salsa.debian.org/deeplearning-team">the Debian Deep
47 Learning Team</a>, which seem like the best team to look after such
48 packages. Discussing the topic within the group also made me aware
49 that the triton package was already a future dependency of newer
50 versions of the torch package being planned, and would be needed after
51 Bookworm is released.</p>
53 <p>All required code packages have been now waiting in
54 <a href="https://ftp-master.debian.org/new.html">the Debian NEW
55 queue</a> since Wednesday, heading for Debian Experimental until
56 Bookworm is released. An unsolved issue is how to handle the neural
57 network models used by Whisper. The default behaviour of Whisper is
58 to require Internet connectivity and download the model requested to
59 <tt>~/.cache/whisper/</tt> on first invocation. This obviously would
60 fail <a href="https://people.debian.org/~bap/dfsg-faq.html">the
61 deserted island test of free software</a> as the Debian packages would
62 be unusable for someone stranded with only the Debian archive and solar
63 powered computer on a deserted island.</p>
65 <p>Because of this, I would love to include the models in the Debian
66 mirror system. This is problematic, as the models are very large
67 files, which would put a heavy strain on the Debian mirror
68 infrastructure around the globe. The strain would be even higher if
69 the models change often, which luckily as far as I can tell they do
70 not. The small model, which according to its creator is most useful
71 for English and in my experience is not doing a great job there
72 either, is 462 MiB (deb is 414 MiB). The medium model, which to me
73 seem to handle English speech fairly well is 1.5 GiB (deb is 1.3 GiB)
74 and the large model is 2.9 GiB (deb is 2.6 GiB). I would assume
75 everyone with enough resources would prefer to use the large model for
76 highest quality. I believe the models themselves would have to go
77 into the non-free part of the Debian archive, as they are not really
78 including any useful source code for updating the models. The
79 "source", aka the model training set, according to the creators
80 consist of "680,000 hours of multilingual and multitask supervised
81 data collected from the web", which to me reads material with both
82 unknown copyright terms, unavailable to the general public. In other
83 words, the source is not available according to the Debian Free
84 Software Guidelines and the model should be considered non-free.</p>
86 <p>I asked the Debian FTP masters for advice regarding uploading a
87 model package on their IRC channel, and based on the feedback there it
88 is still unclear to me if such package would be accepted into the
89 archive. In any case I wrote build rules for a
90 <a href="https://salsa.debian.org/deeplearning-team/openai-whisper-model">OpenAI
91 Whisper model package</a> and
92 <a href="https://github.com/openai/whisper/pull/1257">modified the
93 Whisper code base</a> to prefer shared files under <tt>/usr/</tt> and
94 <tt>/var/</tt> over user specific files in <tt>~/.cache/whisper/</tt>
95 to be able to use these model packages, to prepare for such
96 possibility. One solution might be to include only one of the models
97 (small or medium, I guess) in the Debian archive, and ask people to
98 download the others from the Internet. Not quite sure what to do
99 here, and advice is most welcome (use the debian-ai mailing list).</p>
101 <p>To make it easier to test the new packages while I wait for them to
102 clear the NEW queue, I created an APT source targeting bookworm. I
103 selected Bookworm instead of Bullseye, even though I know the latter
104 would reach more users, is that some of the required dependencies are
105 missing from Bullseye and I during this phase of testing did not want
106 to backport a lot of packages just to get up and running.</p>
108 <p>Here is a recipe to run as user root if you want to test OpenAI
109 Whisper using Debian packages on your Debian Bookworm installation,
110 first adding the APT repository GPG key to the list of trusted keys,
111 then setting up the APT repository and finally installing the packages
112 and one of the models:</p>
115 curl https://geekbay.nuug.no/~pere/openai-whisper/D78F5C4796F353D211B119E28200D9B589641240.asc \
116 -o /etc/apt/trusted.gpg.d/pere-whisper.asc
117 mkdir -p /etc/apt/sources.list.d
118 cat > /etc/apt/sources.list.d/pere-whisper.list <<EOF
119 deb https://geekbay.nuug.no/~pere/openai-whisper/ bookworm main
120 deb-src https://geekbay.nuug.no/~pere/openai-whisper/ bookworm main
123 apt install openai-whisper
126 <p>The package work for me, but have not yet been tested on any other
127 computer than my own. With it, I have been able to (badly) transcribe
128 a 2 minute 40 second Norwegian audio clip to test using the small
129 model. This took 11 minutes and around 2.2 GiB of RAM. Transcribing
130 the same file with the medium model gave a accurate text in 77 minutes
131 using around 5.2 GiB of RAM. My test machine had too little memory to
132 test the large model, which I believe require 11 GiB of RAM. In
133 short, this now work for me using Debian packages, and I hope it will
134 for you and everyone else once the packages enter Debian.</p>
136 <p>Now I can start on the audio recording part of this project.</p>
138 <p>As usual, if you use Bitcoin and want to show your support of my
139 activities, please send Bitcoin donations to my address
140 <b><a href="bitcoin:15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b">15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b</a></b>.</p>