the 'powerplayer' ui is coming along nicely. but i made a simple error on a feature and i wanted to talk about it.
as i was making the graphical representations of call activity, i decided it would be cool to add color based on the radio id that transmitted a particular call. i googled for js functions that could convert a string into a color value and found one pretty fast. i added it into my code, wired it up to the stuff that does the dom work, and boom! colors, and a radio id always gets the same color.
over the course of a few days, i started to notice that an awful lot of the colors being chosen were within a fairly narrow range and it was hard to tell them apart visually. it did not support the intended ability to make interpretations about the flow of a conversation. so i went back and looked at the hashing part of the code i copied:
hash = string.charCodeAt(i) + ((hash << 5) - hash);
this is inside a loop with 'i' iterating up to the length of the string. so it walks the string and gets the ascii value of the character at the current position. this is then added together with a number sourced from bit shifting the current hash value then subtracting the current hash value.
the problem is i'm feeding this thing radio id's, which are generally 3 to 8 chars and all digits. the ascii values of digit chars are all in the 48-57 range. oftentimes, the radio id's that are talking to each other may share prefixes. for example the vast majority of radios carried by berkeley county, wv, employees have a '110' prefix. i can expect a conversation on a berkeley co talkgroup will have multiple radio id's of the form 110xxx. the hash value after the i=2 iteration is identical for all of these, and there's only so much variation that can happen over the remaining cycles.
the solution was easy. i thought about changing the hash out for something less likely to be affected by string similarity, but realized that i could just make the strings longer and possibly work around the problem. so i added to the function, before the loop:
string = string + string + string + string + string;
theoretically this still has a lot of similarity in what's being hashed, but the extra iterations give the minor differences more opportunities to affect the hash value. adjacent radio id's are no longer getting similar colors assigned, at least not as a consistent behavior. there's still times when close colors get picked, but it feels more random now. the radio id's are not always close to each other. this makes sense as the function is just picking the 'h' portion of an h/s/l color, there's only 360 possibilities, collisions and near-misses are absolutely expected.
there is now something to try out if you want to.
right now it mostly lets you have a very customized live listening experience. you can create 'buckets', which gather or reject new clips depending on the tags, talkgroups, and other options you specify. to make things somewhat easier for myself and others, there's some presets to automatically create common patterns.
to create a new bucket, click the 'create bucket' button in the top left. this will open the create bucket modal. you'll want to give the bucket a unique name and configure some behavioral options. autoplay means new, unplayed clips in the bucket will be automatically elegible for playback. paused is a way to temporarily override autoplay, which can be useful if you're choosing to focus on one bucket while you have many active. if autopurge is enabled, clips that have been played will be removed from the bucket.
next, choose which tags you want to match on, and whether all tags must match or any match will do. you can also specify which tags to exclude, which can be useful for things like not having fire-ems and ambulance traffic tied at the hip. you can also specify a comma separated list of talkgroups to explicitly include regardless of tags. to use a preset, choose it from the presets dropdown. the form will be filled according to the preset configuration. when the configuration is as desired, click the 'create bucket' button at the bottom of the form. you should now see your bucket added to the ui.
i have also configured a small number of 'bucket groups' for creating a large number of buckets based on jurisdictions and such. the dropdown picker for these is on the main page. pick one then click 'create bucket group'. note that the bucket names will come from the preset configs, which if you've already added some presets may cause name collisions. you will be notified if this occurs and the duplicate name bucket will not be created.
also on the main page, you can toggle a few other behavioral options at the ui level. the autoplay here controls autoplay globally; if this is disabled, buckets with autoplay enabled will not autoplay. 'autoplay unbucketed' controls whether clips that have not been assigned to a bucket will be played when there are no bucketed clips available to play. 'prefer last played' means the last played bucket will keep playing if it has more clips, versus switching to the first bucket with unplayed calls. the 'autobucket' toggle determines whether unbucketed calls will be placed into buckets based on bucket configs. 'match 1st only' controls how autobucketing works. if enabled, a clip will only be added to the first bucket it matches. if disabled, a clip will be added to all buckets whose criteria match the clip. this can mean you'll hear the same clip played more than once. this is probably more useful later on when some other features come along.
each bucket gets a section of the ui with some buttons at the top to control the bucket's behavior. right now you can't edit the tag/talkgroup configs of a bucket...that's tbd. you'll also see a list of the clips currently in that bucket. you can play a clip immediately by clicking on its entry in the list, overriding the behavior of autoplay.
for a quick start, choose the '🚑 / 🔥 / 🏥 by juris' option on the bucket group dropdown and click 'create bucket group'. default options plus that group will give you fire/ems activity split out by jurisdiction, plus a bucket for hospital comms.
to-do / feature list, in no particular order or priority...
- config persistence/saving. intent is to allow a full set of bucket and global configs to be stored. preferably, this will allow multiple configs to be stored and easily switch between them.
- manual assignment of clips to buckets. you should be able to move a clip to a bucket with a direct action. not sure if this will be drag and drop or a context menu or something else yet. may also come with a search-to-bucket feature that allows clips to be accumulated on ad-hoc criteria beyond what bucket configs can do.
- clip list saving/sharing. you should be able to capture the state of a bucket's clip list and restore that state for your own use or share that state with others.
- recently played clips interface. ability to review the last n played clips, replay them directly, return to their originating buckets (if any), etc. useful for 'wait what was that?' moments.
- change from a text based clip list to a graphical representation influenced by the event listener interface.
the ui will probably be broken at times as i mess with things. feel free to contact me if you have specific feedback...but know i don't test this outside of chrome on windows, chrome on linux, chrome on android so ymmv on any other browser/platform combo and i have little interest in fixing browser compat issues that don't affect my own usage.
i want to make a new "power user" interface that allows clips to be accessed as soon as they are ingested on the back end. this would be somewhat closer to real-time than the conversation accumulation process, and much closer than event accumulation. the idea is that you get a list of recent clips and you get to bucket and play them at your own control, with some options to do things like automatically bucket and/or play clips based on tags or other metadata.
to support that, i need to make an api endpoint for recently ingested clips. the easy way to do this would be to do the same thing i've done for all the other "api endpoints" in this system. i put that in quotes because what's really happening is the back end is updating several json files on disk every few seconds, and nginx serves those up as files. i've gotten away with that by largely avoiding endpoints that support any kind of query parameter or option...you just get what's in the file right now. i've kept the use cases of the interfaces narrow to stay within this capability.
i absolutely could do the same thing for the recent clips endpoint, but i want this interface to potentially request new data every second or two. if i'm also making a few minutes' worth of ingested clips available, that becomes a fair bit of data i'm re-transmitting on every update. i've lived with this on interfaces like this one but i know the problem is potentially much worse with this new endpoint...especially as i'd need to pack some metadata with each clip that was previously only included once per conversation. so i decided this endpoint should support the ability to ask for just the clips newer than the last update.
i thought about several potential ways to approach this. part of why i do the other endpoints the way i do is paranoia over exposing my code directly to web clients. "here's some json that nginx will pass along to the clients" creates a wall...there's no way for client-generated data to hit the back end system. the back end, at the beginning of the project and before any of it was exposed outside of my home network, started life as a flask app and still has the ability to serve up web things directly. i could simply add a function that takes a parameter of a last-update timestamp and it returns the appropriate subset of a list of recently ingested calls. the back end has all the info it needs to do that. the back end started life as a flask app, but it never went out of development mode in that regard. it's got a bunch of warts that would make it really hard to cram it back into the wsgi line of thinking. it's really intended to be a persistently running service, doing a ton of things that aren't serving web stuff. i have big concerns about having that app serve any traffic directly and no desire to figure out how to make it work within a wsgi context
as i thought on the problem more and read a few things, i learned about redis streams. the data structure is perfect for what i'm looking for...an ordered list of items, the native indexing is based on timestamps, and it's great for situations where different clients need to see all the events. this seemed to be a potential way forward. i could let the back end operate similarly as it does for the other things, just now dumping to redis instead of dumping to a file on disk.
redis does not speak http, so i still had a gap to fill. i still needed something to take the request from the client, turn it into a redis query, and return the result in a consumable way. so i made a new tiny flask app just to do that. and since nginx doesn't do the full part of the wsgi chain on its own, i also set up uwsgi. the final flow of a request would be client -> cloudflare -> nginx -> uwsgi -> flask app -> redis.
this is a lot of parts! but it keeps "my code vs the public internet" to a minimum - just the little flask app that only knows how to talk to redis and otherwise has no knowledge of or access to the back end. redis itself isn't exposed outside the local net, and it only contains transient non-critical data.
all of this is to ensure only new clips are transferred on each update, which should reduce the per-update bandwidth usage substantially. this is almost certainly a premature optimization, but i want this to be usable even on relatively low bandwidth or unstable connections. and now that i have this end of the system up and running, i have the opportunity to use it for other things. maybe the conversation listener will get a similar improvement to reduce its bandwidth usage!
stay tuned for additional news on the new ui.
it is not coincidental that the game 'marauders' opened for early access on steam a week after the previous post. it's a really good game and i've spent a lot of time in it. highly recommended if you're into the whole extraction shooter thing.
anyway. radio stuff. my oldest listening post is in hagerstown, md to receive washington county, md traffic better than i can from my house. last week, this device began dropping offline a lot. now, these listening posts are raspberry pi 4's with 3 usb sdr's hanging off of them, doing p25 decoding and capture 24/7. some degree of instability is expected. but this was new. i would ask the person at the location to power cycle it, it would be up for 1-3 hours, and drop again. deeper diagnostics kept getting delayed due to poor intersections of its uptime and my free time (see above re: marauders).
realizing this one had been in service for a bit over 2 years, i went ahead and ordered a replacement rpi (hooray improving supply situation) and some sdr's. i know i'll use them eventually regardless, so whatever. those haven't arrived yet, so over the weekend i dropped by for some in-person diagnostics.
the cpu fan was full of cat hair, but the logs showed no evidence of thermal throttling. i cleaned it out anyway. i reset it and spent some time checking the logs but didn't find much of interest...mostly because early on i saw one annoying pattern, dismissed it as irrelevant, and grepped it out of sight. spoiler alert: it was relevant.
the damn thing wouldn't fail while i was onsite, so i left a small monitor attached. if it locked up again, i could at least get a picture taken of the console, see if there's a kernel panic or something. of course, it failed as i was driving home. the console was full of kernel errors...not panics though. and still going, repeating every few seconds. the system isn't fully dead...i've been assuming a full halt/lockup up to this point.
and i remember the pattern i grepped out...and it sure does look like the lines i'm seeing on the console. and it's not the overly chatty network driver bs i assumed, but it's representing errors trying to use the network. ok, so possibly still failing hardware, the replacement stuff is still potentially useful. i created a script to monitor logs for the error pattern and initiate a reboot if seen. if a hard reset isn't requried to clear the errors, this may be a good-enough mitigation until the replacement can be performed.
that was 14 hours ago. this damn thing that hasn't managed more than 3 hours of contiguous runtime in a week is now up for 14 hours. because now a script is watching and waiting. because i'm watching and waiting.
i hate technology sometimes.
...and literally as i write this that bastard did the thing!
log evidence: Mon Jun 12 07:20:01 AM EDT 2023 - matches in syslog, rebooting!
ok maybe hate is a strong word. anyway, the script saw the log pattern, the reboot happened, the listening post came back online afterwards. guess a soft reboot is good enough, yay! i'm ok with the few minutes gaps in coverage i'll have from the reboots, better than the hours gaps i've had since this problem began. replacement hardware should arrive over the course of the week and i can swap things out next weekend.
i have a bunch of user interface work i want to do. i have a 4 day weekend coming up at july 4...might do a coding binge somewhere in there.
so those clever bastards at openai released a new speech recognition system called whisper. and it's surprisingly good at "zero shot" speech recognition (and translation!), meaning it can produce good results without any domain-specific tuning of the model or pre/post processing stages. naturally, i am giving it a try with the radio system.
openai has provided several sizes of the model: tiny, small, base, medium, and large. tiny, small, and medium also have english-only variants available. in my testing thus far, the english-only versions of the models do not produce better results and seem to be more prone to two kinds of problems:
repetition: a single word or series of words is repeated several times over. this can also manifest as long repeating or cycling digit patterns. my favorite so far was a transcription that started normally but degraded into repeating "i'm a satan!" a couple dozen times.
training set vomit: an intelligible phrase (or several phrases!) is in the transcription, but bears little or no phonetic or contextual similarity to the speech content. given a lot of the input data for the models was video pulled from the internet, i am not surprised that "thank you for watching this video" has shown up in my transcripts.
i'm now doing a bunch of comparison work to evaluate whether this can perform better than my customized vosk models and estimate what the resource cost might be. vosk recognition runs pretty well on the cpu despite not using more than one cpu core. i get about 2x speed out of it, meaning a 10 sec clip takes 5 sec to recognize. given most clips are less than 20 seconds and nearly all are less than 60, this keeps things pretty reasonable. i can run several recognizers in parallel on a multi-core system for a fair bit of capacity.
whisper needs a gpu. it can run on a cpu, but it is extremely slow. transcribing the same 14.8 sec clip with a gtx 1080ti, perf looks like this:
small: 1.16 sec / 12.27x
medium: 3.31 sec / 4.46x
large: 6.32 sec / 2.34x
on cpu (a few years old 6c/12t i7), recognition with the tiny model takes 15 sec, or slightly worse than 1x. the small model takes 80 sec, a shockingly bad 0.18x speed.
the gpu perf is fine...and even a good bit better than the speed i get with cpu-driven vosk, in terms of a single transcribing process. the problem is parallelism. if i'm happy with the perf of something on 1-2 cpu cores, it's really easy to make one server with a big cpu do a whole lot of that thing. gpus get their power from their inherently parallel nature. the gpu is fast at what it's doing because it is already making use of its extreme number of processing units. i can't run a second transcribing process on the same gpu. well, i can, provided two instances of the model fit in gpu memory, but each will run at slightly less than half the speed than they would if they had exclusive gpu access. exactly as you'd expect if they were competing for resources.
the questions i am trying to answer: "can whisper provide a better transcription product than vosk, and is it better enough that it's worth throwing gpus at the problem?"
i think it only really works out if i can get a result that's a solid "yes" to the first half via the medium model. right now, i run 5 instances of vosk in parallel to match the number of sweeper threads. to maintain roughly the same throughput i have right now with the large model, i'd need a total of 5 gpus in the 1080ti +/- 20% perf range...and the motherboards and cpus and psus to host them. with the medium model i could possibly get away with 2 or 3 gpus at the cost of some extra transcription latency when traffic levels are high.
it'd be really nice if the small model is good enough. i might even be able to get away with one gpu. i've done enough analysis now to say that 78% of the time, the medium model is able to produce a result that meets one of these critieria
- is the best transcription (or tied for best) compared to small, large, and vosk
- is not the best transcription among the compared set, but has similar comprehensibility to the best transcription
i'm working on a similar evaluation for the small model. the small model is surprisingly good, being or tying the best transcription about 33% of the time.
vosk does maintain some edges over whisper. short clips whose language structures are highly domain specific are handled much better than vosk - dispatch notifications and status updates in particular. these don't have a lot of built-in context for whisper to latch on to, but i've told vosk to expect patterns like that. vosk also tends to get names of locations and roads correct more often than whisper, as the training data includes many of these. whisper does frequently outperform vosk on longer clips that have more integrated context and/or less specialized language structures. overall, vosk wins vs whisper about 14% of the time in this test set.
the whisper results have some nice features that would make me willing to accept a mild reduction in accuracy vs vosk. numbers are generally turned into digits. phonetic alphabet words are often turned into letters in correct-ish ways. whisper makes an attempt to capitalize and punctuate. these are all things vosk leaves to the side in favor of a straightforward space-separated list of lowercase words.
i am not ruling out a hybrid approach if i can come up with a good set of criteria for what to send to vosk vs whisper. clip duration would be a factor. dispatch and tactical talk groups are more likely to contain structured language. if i'm able to do that, it'd be great to have a second set of critieria to determine which clips would be better handled by the small model vs the medium model.
much more staring at transcription outputs is ahead before a decision can be made.
i've been working on a new ui that tries to be a middle ground between the near real time conversation-based listener i broadcast on twitch and the "can only listen after it happens" approach of the event browser. but that's not the point of the story, though it is cool as hell.
probably my biggest fault as a hobbyist programmer is half implemented things. this combines poorly with a tenuous relationship between hobby projects and source control...the ones i really care about might get a cron job that makes one commit with any files changed that day. anyway. occasionally there are half done things and then i stumble over them later. this is one of those times.
this particular effort had a few pieces, and it started with rewriting the playback bars of the "classic" ui to use canvas objects instead of a bunch of spans full of blank spaces with different background colors, because that was a lot of extra dom beat-up for not much reason except my comfort in treating a web page like a fixed font terminal display. as can be expected, i needed a timer loop to refresh the bars with current data.
now, i already knew about both setTimeout and requestAnimationFrame, but for whatever reason when i sat down to build that part a few weeks ago i stuck a "temporary" setInterval(updateStuff, 100) call in the init. which i then forgot about a week or two later when i added a requestAnimationFrame call to the end of updateStuff.
can you see the problem yet?
i didn't see the problem for a bit because modern browsers and computers are pretty damn fast. and the ui needed to be running for a bit for the problem to get really bad. when i'm constantly updating code/refreshing the page, it can be easy to miss things that don't immediately show themselves.
what i did notice when leaving this on in the background to play events as they happen is that data updates were taking a lot longer than they should...several seconds late according to the last updated indicator at the top of the ui. other things seemed to have odd delays too, but i latched on to the data update as that has a lot of moving parts...xhr, json processing, reconciling existing data with what the api has, etc. i added a lot of logging and time tracking, and found that everything inside the update process was taking exactly as much time as it should, but the runs of the update process were not happening when they should.
i did what any wannabe web dev would do next. i let a tab run til the update lag got bad, then i popped open dev tools and profiled for a few seconds. i saw a good 25% or more of my time was spent in display updates. this was not completely unexpected, there is a lot happening there, but it did seem higher than i expected. i also noted a lot of time being allocated to timer stuff, but didn't think much of it as i figured it was being counted against the timers because the timers were launching the display updates.
i knew my code was doing a lot of unnecessary refreshing of html components every frame that really only needed to update on certain events, so i reworked some stuff to drastically reduce the number of dom updates. this did make the ui a little more responsive, but the data update lag issue remained.
i went back to the profiler and found the story hadn't changed too much. but this time, i did zoom in on the timeline...i'd focused on the summary views previously. and i saw the critical thing. updateStuff calls were happening really, really often. way more than should be happening from requestAnimationFrame triggers. at 60fps frames are 16.67ms apart. these calls were happening sometimes multiple times per millisecond!
well that's clearly not correct. though i'd toggled updateStuff from requestAnimationFrame to setTimeout and back again a dozen times in the process, i decided to check the chain of events from the start...and there was that setInterval(updateStuff, 100) staring me in the face.
what was happening: every 100ms, setInterval was kicking off a run of updateStuff. which would then persist itself by enqueuing another run of itself via requestAnimationFrame. after just one minute, there's 600 updateStuff loops running. after an hour, 36,000. yeah, that might cause everything to slow down in js land, especially timer executions. i never noticed because updateStuff just brings the bars in line with current data...doing that more often doesn't change the output noticeably.
fixed the init, and now things are great. the other optimizations i did have their own benefits and needed to be done, so there's that. this one still made me feel pretty 🤦♂️.
i was going to lose the azure free ride in august anyway, but something happened that forced my hand to move towards the vosk solution i described in a prior post.
a few weeks ago, the machine that runs most of thie software related to this project had a weird hang. some aspects of operating system and my software were working, but a lot of things would just hang - including attempts to trace processes to understand why they're hanging. attempts to perform a clean shutdown also hung, so i had to hard boot the machine.
after that moment, i had a new problem. and i still don't know if there is any relationship between the original issue, the hard boot, and what happened after. the new problem didn't show up until 3-4 hours after the hard boot.
the new problem: lots of things, but particularly transcription, seemed to be taking really crazy amounts of time. but only sometimes, not consistently. tailing the log would normally show a constant stream of activity, but i was seeing pauses of anywhere from a few seconds to as much as 30-45 seconds.
i spent some time digging in from the python and os perspectives. the general story seemed to be that a tremendous amount of time was being spent on futex contention. nothing had changed in my software. there may have been some ubuntu updates applied on reboot. the python traces showed a bunch of waiting happening inside the azure sdk. watching the logs closely, it was clear that azure interactions were now blocking the rest of the back end app. each of the four sweepers should be able to each make a call to azure while the back end app continues to do other things, and that's the way it has operated since i began using azure last year.
but now everything just stops when it's waiting on an azure response. a very multithreaded app suddenly operates very single threaded. ingestion - which was only turned into a threaded thing a few months ago - was badly lagging under even moderate traffic. the sweepers just couldn't keep up when they're constantly blocking each other.
i wasn't convinced the problem was the azure service or the azure sdk. but, i had already developed everything for the new solution. i had a server ready to run and a client module already added to the back end. it would be simple for me to switch to vosk and back to azure just to see how things behaved.
with the local transcription solution, everything was flowing great.
i looked at my options. i could continue to investigate the azure issue, to fix a thing that in a month i won't be using anyway. or i could push forward with the new solution.
i'd already burned a few evenings chasing the azure issues. i decided to cut my losses and focus on making the local transcription as good as i can.
i've done three iterations now of re-tuning the models after adding a batch of corrected transcriptions to the mix. sampling ~500 clips per iteration, i've been able to measure a distinct improvement in transcription accuracy as correct patterns get more common in the tuning corpus.
using a 1-5 rating scale, transcriptions with a high (4 or 5) rating increased from 38% to 54%. transcriptions with a low (1 or 2) rating decreased from 31% to 21%. the mid-point 3 rating volume changed the least, 29% to 23%.
the system appears to work. i'll continue to fix transcriptions directly and with scripted methods for common mis-transcriptions in the existing data. it'll never be perfect, the audio is often far too dirty for fully reliable transcription. but the transcriptions are getting pretty darn readable a fair bit of the time.
the audio processing stage of the sweeper attempts to solve a few problems:
- silent / effectively silent clips due to broken radios, low signal, accidental transmission, etc
- alert tones, intended to notify of specific classes of incidents and/or which units should respond
- volume levels can be wildly inconsistent on the original clips
all three of these can negatively affect the listening experience. the tones at least add a bit of "i'm sitting in a 911 center" vibe but the novelty wears off quickly.
i also decided to put the wav to mp3 conversion in this stage. i figure it's all audio processing. if you don't like it go make your own. now, on to how these are addressed.
silence detection: an early version of this focused on clips that likely represent an extreme low signal or decode error condition, as i found many clips whose audio was just a long run of 0 values - pure unadulterated silence. the initial filter looked for non-zero values in the audio data, rejecting the clip if none was found. there was plenty of "effectively silent" traffic missed by that filter, particularly accidental transmissions with a small amount of noise present but no speech. this was replaced with a filter looking at the average amplitude across the clip and rejecting if it is below a threshold. as a quality check, the rejected clips are copied to a separate location. i periodically inspect a sampling of rejected clips. the occasional legitimate but quiet transmission does get missed, but the false positive rate is low enough to be acceptable.
tone detection: this required me to get more friendly with Fourier transforms than I had in the past. i initially tried to approach the problem from a different direction, thinking i could take the amplitude values and do math on them as if it was a waveform. specifically, i figured i'd subtract a synthetic waveform representing a sine tone of whatever freq from the audio data and look for a big drop in average amplitude. i spent a lot of time here on how to generate the synthetic tones and synchronizing the audio frames and a bunch of other crap that in retrospect was misguided. the funny thing is it kinda worked, it did frequently find the tones...but the false positive rate was unacceptably high. this was replaced with a bit of numpy magic to fft up a frequency spectrum. i then look for spikes in the values at / within a few hz of the input frequencies. tones are primarily only an issue on dispatch talkgroups, so only clips from dispatch channels are checked. each jurisdiction has a different set of tones. the frequency list is adjusted to align with the clip's origin. i keep and periodically inspect these rejects too, but the false positive rate is almost zero.
volume normalization: volume levels are corrected through straightforward normalization to a target average volume. nothing fancy here, just leveraging some aspects of pydub since i was already using it to read the raw audio data.
mp3 conversion: i use pydub's export method to do this, which is really just using subprocess to call ffmpeg under the hood. this allows passing additional params to ffmpeg. i use this to customize the mp3 options for a better size:quality balance than the defaults and apply bandpass filters to remove a bit of noise.
the audio processing stage returns either the rejection reason or the path of the exported mp3, depending on how the clip is adjudicated. the sweeper proceeds to insert a record into the database if the clip is accepted.
since last august, i've been using azure's free tier speech transcription to transcribe all clips >=2s in length. the free tier lasts 12 months. once that's up, i'd have to start paying a few hundred dollars a month for the privilege. and hey, it's good transcription service. just more than i'm willing to spend monthly for this project.
before i went down the azure path, i tried a few different transcription models that can run locally. i had mixed results, and ended up down a deep rabbit hole of trying to train an 8khz radio quality model from scratch and none of it ever worked anywhere near as well as i wanted it to. the azure solution just worked, and worked great, so i put local transcription aside for a while.
and now here we are, it's may. june, july, then the deadline. it's time to pick up the old threads.
the good news is i probably had one hell of a "you're holding it wrong" moment a year ago. i have since noticed that the resampling algorithm is really a make or break factor as to whether 8khz audio can be recognized by a model trained on 16khz audio.
i think i was using something pretty shit before. oops.
anyway, i noticed some pretty good results when i installed the latest version of vosk for python and plugged in the high accuracy english model, and using their recommended ffmpeg settings for the resample.
it was good, but still pretty far from azure's accuracy. and it was sloooow, with transcription often taking 2x the audio duration.
then i got a crazy idea. vosk models (at least those of the lgraph type) can have their grammar tuned with an input corpus. while azure's transcriptions are not perfect, could i use the giant pile of azure transcription data i have in the db to train vosk?
i started the experiment with about 6 months of transcriptions, all piled together with punctuation removed, numbers converted to words, everything lowercased, and words that the model doesn't know pulled out. i updated the grammar and tried a few tests...and it was better! sometimes it even got something that azure missed, though azure was still more accurate overall.
as a bonus, this seemed to give about a 4x improvement in transcription speed. i presume this is the revised grammar constraining the search paths or something like that. i dunno, i'm making this up as i go along.
with azure i can send along a list of words to bump the probability of, and i customize this for the type of talkgroup to get a little edge on accuracy. i realized i could do something similar with vosk, models with grammars tuned for different types of traffic.
to make things easier, i made a set of scripts that extracts a time range of calls from the db, splits by type, normalizes the transcriptions, and builds custom models for law, fire/ems, medical, and bus groups. initial results are looking pretty good.
it'll get even better with more accurate transcriptions as input, so now a free time activity will be fixing up transcriptions in the db to fix azure's errors. periodically i can re-run the grammar generator and theoretically get slightly better transcriptions over time.
thanks microsoft, the free stuff has been awesome to use. i may use the paid stuff in a more limited fashion, but i believe i have a way forward to reasonable local transcription.
as to the filesystem restructure from the previous post - it's running now, and i found a perf ceiling at 6 threads. 7 threads gives a net reduction in throughput. it'll probably take 30-35 days total to do everything.
in a previous post i mentioned that the number of files was becoming a scaling problem. i brought up the issue to a few people who tend to have opinions about storing things at scale and it generally sounds like what i'm doing - a plain ol' filesystem plus a db to index things / hold metadata - is reasonable. there are solutions to go much fancier but many of them would still be depositing individual file objects on a filesystem somewhere, just perhaps managed by something that isn't me or software i wrote.
the real problem is the number of files per directory. this is something i should have foreseen...at work i've long interacted with a file storage service that limits the number of files in any given directory through a directory splitting strategy. anyway, now there's at least one talkgroup dir with >1M files and several others in the 100K-1M range. these huge directories don't really matter for most uses of the system, as there's never a reason to list or stat all the files in a talkgroup dir. the wav->mp3 transcode step drops the result in the specified output dir. reads are all direct, getting path info from a DB call or the conversation API. unfortunately, backups and system management tasks can trip over the huge directories.
the plan: migrate from /[tg_id]/ to a date-based directory layout: /[tg_id]/[year]/[month]/[day]/ . for example /9002/2022/05/09/ for the files associated to talkgroup 9002 on May 5, 2022. This is expected to keep the count to a max of 2000-3000 files per directory, even on a busy talkgroup.
there's 3 major areas that need to be touched to make this work. i've now started work on two of them.
- historical rewrite: there are ~8M files with a corresponding number of db entries (well, now it corresponds, there were some dupe issues. that was recently cleaned up, though there's at least one source of a small trickle of dupes i need to track down). each of these files needs to be moved to its new location and the db needs to be updated with the new storage path. this is done in the sense that there is a series of functions that i can give a storage path to and it will determine if it is one that needs to move, do the move, and update the db. doing this one at a time is slow but is likely safest, and i really want to avoid being in the position of writing a lot of complicated fixup sql scripts. i am going to throw threads at it and see how far i can get with that. honestly i'm ok if this takes weeks, it's not urgent and everything gets slowly better along the way.
- sweeper changes: the sweeper needs to start depositing files in the date-based structure. as the sweeper generates the URLs seen in the live data APIs, several bits needed to be updated to ensure the front ends get correct paths. This was several small changes and had one 'oh shit' moment where nothing worked when I restarted the back end but it was an easy fix. This is effectively done, but I expect to find some random problems over the next several days as I find features broken in subtle ways.
- event rewrite: bit of a different case than with the individual file rewrites. the events table in the db holds records with data about several associated files, including the paths. this was done to avoid having to request this information from the DB every time, which could be hundreds or thousands of requests if done per file. 'stored report output' is not the worst analogy for an event record. events with files from before the sweeper change went live will need to have their paths updated. as the correct path can be inferred from the original path and filename, this shouldn't be hard to write something to iterate over events / the files referenced within and update the paths. i'm pretty sure i'm the only one who ever uses the event stuff in any kind of historical sense and everything going forward is fine because the sweeper is fine, so i'll fix this when i feel like it.
there's some other stuff to do too. the backups will need some manual cleanup and some of that might run a long time. i need to poke at some of the lesser-used parts of the UI and see what's broken when i look at stuff since the sweeper change or historical stuff that's been moved. and closely watch the metrics for any signs of weirdness.
guess we'll call this spring cleaning, or something.
as of this writing, i'm accumulating data from four p25 digital trunking radio systems. on each system, there can be several simultaneous transmissons. it's a lot of audio, and it's frequently concurrent audio. mixing it all together to overlap would be the most real-time experience, but also awful. playing every clip purely on a first come, first play basis is easier to listen to, but requires a lot of mental effort to track who's talking to who about what on which talk groups.
in a previous post i described how the sweeper uses the info it already has at the point of ingesting a clip to create 'conversations' out of clips that belong to the same talk group and are within some time boundary of each other. rather than clips, conversations are the primary unit of playback in the live interface. for a while, that was as far as things went. conversations were played back in order of start time.
in the interests of cohesive listening for larger events, i added a feature i called 'priority queueing' for lack of a better term. if there is a conversation in the queue that is from the same talkgroup as the currently playing conversation, and that upcoming conversation is within a time threshold, that conversaton is brought to the front of the queue. the number of queue jumps is tracked, and if the chain gets too long the priority treatment ends.
implemented based solely on talkgroup, this wasn't bad...but there were two problems.
- fire tac talk groups: in a fire/ems event involving more than a minimal number of responders, there will usually be one or more 'tac' talk groups assigned. this allows the responders on that incident to talk among themselves without interrupting the dispatch tg. following only the tac tg is an incomplete picture. the initial dispatch happens on the dispatch tg (duh), as well as units indicating their response or inability to respond. the incident commander will often hop back to the dispatch tg to pass a priority request to the dispatchers. and larger incidents will have multiple tac chanels for different needs - command, different crews, traffic control, helicopter coordination, etc. jumping way forward on the dispatch timeline then back several minutes on the tac timeline and walking forward again as each tg was priority chained wasn't great.
- law enforcement talk groups: these were getting just a bit too much priority. traffic stops mean there's a lot more background noise, lots of opportunity to chain. this was causing delays in playing more interesting traffic, and sometimes causing that other traffic to expire out of the queue before it could be played.
to fix the fire/ems issue, i added jurisdiction tracking. if the tg has the fire/ems tag, upcoming conversations can match by exact tg or by fire/ems tagged tg in the same jurisdiction. segments of larger events get chained together and the sequencing of playback is easier to follow.
to fix the police issue, i added a check to look at the next item in the queue. if that conversation is also for a law enforcement tg, regardless of jurisdiction, priority queueing is allowed. if it's anything else, the next item is played. this can also apply if a law enforcement chain is active and queue expiration moves something not law enforcement to the top. the chain breaks and the next conversation is played.
since adding frederick county, md's system to the mix, it's been far more common for the queue to build to levels where a lot of conversations are expiring out of the queue. continuing with the theme of de-prioritizing the often mundane law enforcement traffic, i've added an overflow mode. when the expiration rate gets too high, law enforcement traffic gets ignored entirely. it's still all being tracked in the background, but it's kept separate from the actively playing queue. when the expiration rate gets better, law enforcement traffic is allowed to return to the main queue.
this probably isn't the end of my journey to create my optimal live listening experience. that said, the combination of priority queueing and overflow mode are working nicely to keep playback reasonably close to when the audio was transmitted and focus on traffic that's more likely to be interesting to me.
the sweeper now uses one thread per radio system for a total of 4. that's enough granularity for the moment. i could theoretically go nuts with this approach and do one thread per talkgroup and probably still not have to modify any of the bits that mandate time-ordered clips.
i did it in a pretty simplistic way, just adding a pattern param to the sweep function. this pattern is matched against the filenames and only the matching ones are ingested by that sweeper run. then it was just making a list of patterns and spinning up a thread for each.
this mostly worked, but one thing broke fairly badly. i half anticipated it and should have just dealt with it from the beginning. there's a close_conversations function that gets called at the end of the sweeper. as the name suggests, it looks for conversations that have been inactive too long and closes them. it also cleans up old entries and empty talkgroups in the conversations object
close_conversations takes a parameter, 'active_tgs'. the sweeper tracks which talkgroups were seen during the run and passes this on to close_conversations. the idea of this is that talkgroups with activity in the current run should be excluded from closure rules. this prevents early closure due to a long clip being processed among other smaller clips.
with multiple threads running, each thread has their own view of what talkgroups are active - and these sets will never intersect given the way the work is split. i began to notice conversations were being ended early compared to past observations. which makes sense, as every thread was calling close_conversations independently and only passing its own talkgroups along.
the other problem was collisions when cleaning up the conversations object. two threads would identify something to delete, one would win the race, the other would throw an exception. in a sign that this overall effort was a good idea, at least only one sweeper thread would die! the others would keep going, til more collisions happen.
to fix this, i changed the sweeper threads to post recently seen talkgroups to a shared object. close_conversations uses this instead of a list passed in. also, close_conversations now runs on a fixed interval on its own thread.
that fixed the exceptions, and it made a dent in the early conversation end problem...but didn't fully resolve it. with the original sweeper changes and the later change to add a close_conversations thread, close_conversations is being invoked far more often than it was previously. i've bumped several of the time duration values used in the closing logic. the previous values were being inflated by the sweeper and close_conversations interactions prior to the threading work.
since adding the threading, i've noticed a number of periods where the ingestion rate is above what was possible before. overall, this is a success, and was far less painful than i expected. this is less a testament to me being good at this and more that python makes simple threading easy and the gil saves me from most problems.
originally, the sweeper was pretty simple. it watched the drop dir, it parsed the filenames, it moved the files to the appropriate talk group dirs, it recorded their locations and metadata in the db.
the sweeper's dependencies were light. it imported the config and database modules from the rest of the project but otherwise didn't need to know about much.
the simplicity was ruined, as often happens in software, by a feature. i wanted to assemble individual clips into conversations, representing a burst of activity on a talk group (and later groups of talk groups).
i first went after the conversation problem with db queries. i mean, that's what the db is for, right? and it's not so bad for historical stuff, especially if you run the queries once and store the results in some other table. but real time queries? too slow for that.
then it hit me - the sweeper sees all the same data that goes in the db. it holds it in memory right alongside the rest of the back end stuff. well, til it writes it to the db, then it discards it.
but what if it didn't discard it? what if the sweeper kept track of recent clips and assembled them into conversations? it can write those to the db too. or i could hold a buffer of recent conversations and use that as a somewhat more comfortable live listening experience.
and thus, the 'liveplayer' interface, the one on the twitch stream, was born. the front end watches the buffer via periodic api calls, looks for new conversations, and adds them to the playlist.
to make all that work, the sweeper got a bunch of densely nested logic to determine when to start, continue, and end conversations. the rules are different for different kinds of talk groups. if i used the same rules for law enforcement as for fire/ems, i'd have a string of 20+ segment conversations consisting of nothing but license/registration checks. too chunky for the live experience.
then i made things worse by adding an audio processing step to ingestion. this wasn't too bad. i put all that in a separate module. but some of the stuff i was doing there would determine whether or not the file would continue - excluding silence and tones, mostly. the sweeper has to inspect the result to know whether it should move the file to the destination and update the db, or just delete the file.
then transcription came into the picture. this drove a change to the input data - i had been letting the radio software give me mp3 files, but it could also give me wav, and tests showed much better recognition with the wav data.
so now the sweeper needs to call the transcription module, with the wav data available. and i still want to store a small mp3 rather than a somewhat larger wav. the audio processor gets updated to write an mp3 if it passes the silence/tone test. the sweeper still handles cleaning up the wav either way.
then i wanted a way to use the liveplayer, but for past time periods. so i stuffed a bunch of crap into the sweeper that allowed me to use the same conversation building stuff on any list of clips.
most recently, i added another layer of accumulation to group semi-adjacent conversations into events. this has some similar twisty logic to the conversation builder for similar reasons.
and now sweeper.py is 1000 lines long, it has a ton of internal and external dependencies, and the indent depth gets absolutely obscene. but it's just so convenient. every clip passes through, all the info is right there. so much juicy context already available at basically no resource cost.
i should honestly use this as an opportunity to rewrite significant chunks of the sweeper, but lol not gonna. we'll see how well i can chunk it up for threading in its current state first.
i recently added another remote listening post in frederick, maryland. the frederick county radio system is pretty active. alongside everything else, i'm now frequently exceeding the system's capacity for audio ingestion.
there's variance particularly in how long transcription takes, but on average the system can ingest about 1.7 seconds of audio in 1 real second. This is actually improved a bit compared to the very recent past - the changes described in the 2022-03-16 post brought this up from around 1.3 seconds per second. not doing a bunch of useless stat() calls is a good thing.
nothing gets lost, but the ingestion lag grows. the live listening interface discards conversations that are more than 15 minutes in the past (ambulance talk groups get 45 minutes because i love ambulance reports). if ingestion lag is over 15 minutes, every conversation is aged out as soon as it's made available to the listening interface. ironically, too much audio results in nothing to listen to.
still, it's a problem i'd like to address. i am considering more remote listening posts. i am also thinking about some other factors regarding capacity and stability.
the hardware running the ingestion and web stuff is a 2010 era Core i3. the motherboard/processor were originally purchased to be my home fileserver, which it did for 5 years. i then decided to get a nas and move fileserver stuff there, and repurposed the i3 as a dev project box. as it still had the disk array from its previous life, it also serves as a backup target for the nas. by the way, that nas since died and was replaced by another nas. this hardware has outlived the thing it was replaced by! talk about value out of a budget processor. that said, it is 12 year old hardware. that shit's going to fail eventually, and i don't know when that will happen.
the data is getting sizable. the sqlite db backing all this is now just shy of 2GB. the audio clips are stored as individual files, a decision originally intended to ensure i could always re-create the db by re-crawling the files. it'd be slow but doable, and it was something i used frequently in the beginning while stabilizing on a few schema decisions.
the number of clips is very large, approaching 7.5 million as of this writing. the clips are stored in dirs named by the talk group the clip belongs to. particularly active talk groups are likely to have several hundred thousand clips. 'du' and other operations take forever to return anything. at least 'df -i' shows i still have plenty of inode runway. still, i know i can't reasonably manage these files as structured on a regular filesystem.
that's ok right now because the db has all the metadata. the files never have to be enumerated because the db already knows their exact path. but i definitely couldn't just rebuild the db...well, i could, but it'd take weeks, and i'd probably have to get creative about some of it. even restoring from backup would take forever just because of the sheer number of file ops.
what to do about all this?
- sweeper parallelization: mentioned in the previous post, modify the sweeper to use threads, and split the work by radio system. clip time ordering is only important within a talkgroup or associated talkgroups, and that stays within the system boundary. with 4 systems, this could represent up to 4x more throughput. unlikely to be that large, there may be some other constraint. also, this is written in python, so the threads are "threads". that said, 70-80% of the ingestion time is spent waiting on the transcriptions to come back from azure. that's a pretty good target for python threading.
- hardware upgrade: at 1.5 years of age and such large data sizes, maybe it's time to give this project its own home. plenty of options here - refurb server, build something, etc. i'd want some fast, reliable ssd for the db itself...the db gets far more reads than writes and the writes are mostly additive so i'm not too worried about ssd write lifespan. the files themselves can go on spinning disks, as long as it's reasonably fast the latency is fine. given the age of the hardware and contention with other stuff on it, this is likely to give some substantial performance improvements though it's hard to say how much. it also gets the project off such old hardware and i can worry a bit less.
- some kind of file vault solution: there's a number of things out there that attempt to make the "giant pile of files makes my fs suck" problem go away. i need to review that space and consider what might be applicable for this situation. the simple layout has been extremely convenient so i'll be looking for things that replicate this experience as much as possible.
in terms of ordering these, the sweeper parallelization work will have to happen at some point, so i might as well go ahead and do it. plus that gives me time to consider file vault stuff. that could have implications for the hardware upgrade. i could end up punting the file vault decision and buying hardware first.
the sweeper change is substantial work and will require a fair bit of testing...i'll probably need to stand up a minimal replica of the environment. i figure it's probably a couple weekends' worth of effort. even with testing it'll be a nervous moment when the change goes live. the sweeper is a lot of dense code, some of it dating back to the beginning of the project. there's a lot of ways to break. hell, there's a couple rarely hit bugs in it i haven't gotten around to tracking down. one of those could blow up in all this.
so yeah. think i'll start with creating that test env.
the original problem: every once in a while, an audio file was getting truncated. more audio clearly existed at some point, because the transcription keeps going past where the audio stops, and the transcription data is sensible. real, understandable audio existed, and then it went away.
implementation note: the "sweeper" picks up .wav files from a directory, sends it off to azure for transcription, transcodes to a smaller .mp3 for long term storage, then deletes the .wav
first hypothesis: something is causing the wav to get deleted after transcription but before transcoding fully completes. this was quickly disproven through log analysis of truncation events.
second hypothesis: something screwy in some of the wav data is making the mp3 transcoding step bomb before the end of the file. could not find any log evidence of weird errors thrown by ffmpeg, but i wanted to capture the original wav for such an event and try to replicate the issue.
efforts moved towards trying to find ways to detect it in the sweeper code and move suspect .wav's to a location for analysis rather than deleting. first tried doing this by comparing the audio duration as inferred from the .wav file's size at the start of sweeping, to the value produced by pydub just before the transcode. this failed to detect the problem. later, tried adding an additional os.stat() call after all the processing and just before deletion.
this did give me a few candidate .wav files pretty quickly (and also foreshadowed the actual problem). those files were...just fine. i could transcribe and transcode them without issue. there's nothing wrong with the wav files. but the size reported by os.stat() is different at the end of the process...
implementation note: the directory watched by the sweeper is on a SMB/CIFS share, served up by a NAS. audio files are being dropped here by a few systems. knowing this was a potential source of metadata confusion, i'd added a guard a long time ago to the sweeper's file enumeration step. get the list of files and sizes, then get it again and exclude any files that changed size. let them go in the next run. a log search showed this guard hadn't triggered in the last 30 days.
hm. i began to wonder about filesystem metadata caching and realized how little i knew about how that works, how to monitor it, or how to control it.
third hypothesis: metadata caching is keeping the guard from working, and the sweeper is moving forward before the cache is updated with the final size of some files.
i'd need to do some reading to understand my options on the caching front. in the meantime, i added a quick hack to the size change guard: wait one second between the checks. log review showed that the guard was now catching things, and the os.stat() check at the end was no longer finding differences. the problem was fixed. but the story isn't over...
new problem: when lots of audio is coming in, the system backlogs much earlier and much worse than before the changes
implementation note: the ingestion process is serial and it tries to enforce files to be ingested in the time order they were created. downstream stuff that processes clips into conversations and conversations into events expects the data to be time ordered.
this problem was obvious to have. a step of the sweeper run that was taking a few milliseconds is now taking 1 second + a few milliseconds. if only there was some way to control the attribute caching...
from the mount.cifs man page:
actimeo=arg
The time (in seconds) that the CIFS client caches attributes of a file or directory before it requests attribute information from a server. During this period the changes that occur on the server remain undetected until the client checks the server again.
the default is 1 and a quick check of the active mount options confirms it is, indeed, 1. that explains why the 1 second wait was fixing the original issue.
hypothesis: my use case would be better suited with actimeo=0. then i could remove the 1 second wait! alas, actimeo is an integer, can't tell it to cache for less than one second.
i remounted with actimeo=0 and removed the 1sec wait. the guard was still working and i wasn't getting truncated files! and the time spent in file enumeration was still in the tens of milliseconds! progress!
-- one day later --
new problem: the ingestion backlog is growing, starting from around 7:30am. noticed it around noon when i went to listen a little between work meetings. backlog is almost 2 hours with a queue depth around 2000. file enum times are enormous.
hypothesis: whether or not the actimeo=0 setting led to the backlog, it's probably making it worse. i can't get back to low latency ingestion without reducing enum time, deleting pending work, or a lucky few hours of quiet on the airwaves.
stopgap plan: remount with actimeo=1. don't put the 1sec wait back, it'll be a while before the queue gets to recently created files anyway.
the queue backlog drained out and latency gets back to normal a couple hours later...this happened to be a rather heavy volume day due to some drill activities being run in one of the monitored jurisdictions. though i was happy the mitigation plan worked, i could not ignore that the system was now exposed to the original audio truncation issue. back to square one.
implementation note: the sweeper only processes up to 5 files on each run. the sweeper updates a lot of UI-facing data at the end of its runs, so long sweeper runs cause stale data on the UI. the sweeper would get a list of files, stat() each of them, then stat() them again and look for size changes, then sort the list by filename (the filenames start with a sortable timestamp) and take the first 5.
head-slap moment: i'm generating a list of files and i'm stat()'ing ALL of them. and i'm going to only ever process up to 5. so if there's 2000 files in the dir why am i stat()'ing 1995 more than i need to? and without attribute caching every single one of those calls is guaranteed to be expensive! the problem even gets worse as the file count increases! this is starting to make sense!
new plan: change the file enumeration to get the list of files, sort by filename, take the first 5, then do the back to back size check and move forward with the unchanged ones. remount with actimeo=0 and observe, then determine if a wait needs to be added between the size checks.
things looked great at first, but within a couple hours there was a truncated audio event. the size check was picking up differences even without a wait, but some were slipping through. i added a 10ms wait, and checked the logs a few hours later. no truncations, and i can live with 10ms bloat to sweeper run times.
an unrelated issue caused the ingestion to stop at midnight the next evening. when I restarted it the following morning, the large queue backlog drained pretty quickly in comparison to the previous event - about 30-40 minutes vs 2 hours. the current state of affairs is acceptable.
it is giving me ideas about splitting up some of the sweeper's work. while i don't have the desire to re-work the downstream bits that need to get time-ordered data, it mostly matters that it's consistent within a particular set of talkgroups. so i can probably split this up a bit and allow a degree of parallel sweeping without affecting the rest of the system.
as to the original issue and what was happening with truncation - pydub appears to trust the file size it sees at first and computes all the length right then. even if more data is available, it's not going to use it. since the original size change guard was nonfunctional due to the attribute cache, files could still have an incorrect size reported at that stage.