From: Petter Reinholdtsen Date: Sun, 23 Apr 2023 07:16:53 +0000 (+0200) Subject: New post on OpenAI Whisper. X-Git-Url: http://pere.pagekite.me/gitweb/homepage.git/commitdiff_plain/56b07ae0314a05f01cb791d819f2a06feb28f71d New post on OpenAI Whisper. --- diff --git a/blog/data/2023-04-23-whisper-apt-debian.txt b/blog/data/2023-04-23-whisper-apt-debian.txt new file mode 100644 index 0000000000..44163f4f81 --- /dev/null +++ b/blog/data/2023-04-23-whisper-apt-debian.txt @@ -0,0 +1,140 @@ +Title: Speech to text, she APTly whispered, how hard can it be? +Tags: english, debian, multimedia, video +Date: 2023-04-23 09:40 + +

While visiting a convention during Eastern, it occurred to me that +it would be great if I could have a digital Dictaphone with +transcribing capabilities, providing me with texts to cut-n-paste into +stuff I need to write. The background is that long drives often bring +up the urge to write on texts I am working on, which of course is out +of the question while driving. With the release of +OpenAI Whisper, this +seem to be within reach with Free Software, so I decided to give it a +go. OpenAI Whisper is a Linux based neural network system to read in +audio files and provide text representation of the speech in that +audio recording. It handle multiple languages and according to its +creators even can translate into a different language than the spoken +one. I have not tested the latter feature. It can either use the CPU +or a GPU with CODA support. As far as I can tell, CODA in practice +limit that feature to NVidia graphics cards. I have few of those, as +they do not work great with free software drivers, and have not tested +the GPU option. While looking into the matter, I did discover some +work to provide CODA support on non-NVidia GPUs, and some work with +the library used by Whisper to port it to other GPUs, but have not +spent much time looking into GPU support yet. I've so far used an old +X220 laptop as my test machine, and only transcribed using its +CPU.

+ +

As it from a privacy standpoint is unthinkable to use computers +under control of someone else (aka a "cloud" service) to transcribe +ones thoughts and personal notes, I want to run the transcribing +system locally on my own computers. The only sensible approach to me +is to make the effort I put into this available for any Linux user and +to upload the needed packages into Debian. Looking at Debian Bookworm, I +discovered that only three packages were missing, +tiktoken, +triton, and +openai-whisper. For a while +I also believed +ffmpeg-python was +needed, but as its +upstream +seem to have vanished I found it safer +to rewrite +whisper to stop depending on in than to introduce ffmpeg-python +into Debian. I decided to place these packages under the umbrella of +the Debian Deep +Learning Team, which seem like the best team to look after such +packages. Discussing the topic within the group also made me aware +that the triton package was already a future dependency of newer +versions of the torch package being planned, and would be needed after +Bookworm is released.

+ +

All required code packages have been now waiting in +the Debian NEW +queue since Wednesday, heading for Debian Experimental until +Bookworm is released. An unsolved issue is how to handle the neural +network models used by Whisper. The default behaviour of Whisper is +to require Internet connectivity and download the model requested to +~/.cache/whisper/ on first invocation. This obviously would +fail the +deserted island test of free software as the Debian packages would +be unusable for someone stranded with only the Debian archive and solar +powered computer on a deserted island.

+ +

Because of this, I would love to include the models in the Debian +mirror system. This is problematic, as the models are very large +files, which would put a heavy strain on the Debian mirror +infrastructure around the globe. The strain would be even higher if +the models change often, which luckily as far as I can tell they do +not. The small model, which according to its creator is most useful +for English and in my experience is not doing a great job there +either, is 462 MiB (deb is 414 MiB). The medium model, which to me +seem to handle English speech fairly well is 1.5 GiB (deb is 1.3 GiB) +and the large model is 2.9 GiB (deb is 2.6 GiB). I would assume +everyone with enough resources would prefer to use the large model for +highest quality. I believe the models themselves would have to go +into the non-free part of the Debian archive, as they are not really +including any useful source code for updating the models. The +"source", aka the model training set, according to the creators +consist of "680,000 hours of multilingual and multitask supervised +data collected from the web", which to me reads material with both +unknown copyright terms, unavailable to the general public. In other +words, the source is not available according to the Debian Free +Software Guidelines and the model should be considered non-free.

+ +

I asked the Debian FTP masters for advice regarding uploading a +model package on their IRC channel, and based on the feedback there it +is still unclear to me if such package would be accepted into the +archive. In any case I wrote build rules for a +OpenAI +Whisper model package and +modified the +Whisper code base to prefer shared files under /usr/ and +/var/ over user specific files in ~/.cache/whisper/ +to be able to use these model packages, to prepare for such +possibility. One solution might be to include only one of the models +(small or medium, I guess) in the Debian archive, and ask people to +download the others from the Internet. Not quite sure what to do +here, and advice is most welcome (use the debian-ai mailing list).

+ +

To make it easier to test the new packages while I wait for them to +clear the NEW queue, I created an APT source targeting bookworm. I +selected Bookworm instead of Bullseye, even though I know the latter +would reach more users, is that some of the required dependencies are +missing from Bullseye and I during this phase of testing did not want +to backport a lot of packages just to get up and running.

+ +

Here is a recipe to run as user root if you want to test OpenAI +Whisper using Debian packages on your Debian Bookworm installation, +first adding the APT repository GPG key to the list of trusted keys, +then setting up the APT repository and finally installing the packages +and one of the models:

+ +

+curl https://geekbay.nuug.no/~pere/openai-whisper/D78F5C4796F353D211B119E28200D9B589641240.asc \
+  -o /etc/apt/trusted.gpg.d/pere-whisper.asc
+mkdir -p /etc/apt/sources.list.d
+cat > /etc/apt/sources.list.d/pere-whisper.list <<EOF
+deb https://geekbay.nuug.no/~pere/openai-whisper/ bookworm main
+deb-src https://geekbay.nuug.no/~pere/openai-whisper/ bookworm main
+EOF
+apt update
+apt install openai-whisper
+

+ +

The package work for me, but have not yet been tested on any other +computer than my own. With it, I have been able to (badly) transcribe +a 2 minute 40 second Norwegian audio clip to test using the small +model. This took 11 minutes and around 2.2 GiB of RAM. Transcribing +the same file with the medium model gave a accurate text in 77 minutes +using around 5.2 GiB of RAM. My test machine had too little memory to +test the large model, which I believe require 11 GiB of RAM. In +short, this now work for me using Debian packages, and I hope it will +for you and everyone else once the packages enter Debian.

+ +

Now I need to start on the audio recording part of this project.

+ +

As usual, if you use Bitcoin and want to show your support of my +activities, please send Bitcoin donations to my address +15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.