DjsBlog: Photo Timing Video -Video Stitching Detecting Gun Sound

Site Calendar

The video stitching app AthStitcher has been further updated to improve the gun race start detection. AthStitcher can rip the audio from the video and detect the loudness in sound frames. There was an issue where the video was silenced except for an added gunshot sound for the GUNSOUND mode of race start.

Code

See AthStitcher app code repository here: djaus2/PhotoTimingDjaus on GitHub.

About

These sound frames don’t align with the video frames but have a time stamp. If the video is started before the start of a race and the gun fire sound is recorded as the video audio, the race start can be determined from the positive edge of the gun sound. That is, the sudden spike in loudness as an indicator of race start. The loudness extraction code uses FFMpegCore, as on NuGet, as follows:

    var process = new Process
    {
        StartInfo = new ProcessStartInfo
        {
            FileName = "ffmpeg",
            Arguments = $"-i \"{inputPath}\" -filter_complex \"[0:a]astats=metadata=1:reset=1,ametadata=print:key=lavfi.astats.Overall.RMS_level\" -f null nul",
            RedirectStandardError = true, // Redirect stderr
            UseShellExecute = false,
            CreateNoWindow = true
        }
    };

    process.Start();

This generates data (which is saved as a file to be read back later) such as

[Parsed_ametadata_1 @ 000002665dd22b80] lavfi.astats.Overall.RMS_level=-inf
[Parsed_ametadata_1 @ 000002665dd22b80] frame:3    pts:3072    pts_time:0.0696599
[Parsed_ametadata_1 @ 000002665dd22b80] lavfi.astats.Overall.RMS_level=-inf
[Parsed_ametadata_1 @ 000002665dd22b80] frame:4    pts:4096    pts_time:0.0928798
[Parsed_ametadata_1 @ 000002665dd22b80] lavfi.astats.Overall.RMS_level=-inf
[Parsed_ametadata_1 @ 000002665dd22b80] frame:5    pts:5120    pts_time:0.1161
[Parsed_ametadata_1 @ 000002665dd22b80] lavfi.astats.Overall.RMS_level=-130.743093
[Parsed_ametadata_1 @ 000002665dd22b80] frame:6    pts:6144    pts_time:0.13932

The data is parsed to extract the frame number, time from start of video and the loudness in dB. The loudness is a logarithmic scale so -inf is very quiet and -2 is very loud. Note that each sound frame information takes 2 lines. The data is saved as a CSV file (one line per sound frame) such as:

Frame,PTS,PTS_Time,Volume
0,0,0,"-inf"
1,1024,0.02322,"-inf"
2,2048,0.0464399,"-inf"
3,3072,0.0696599,"-inf"
4,4096,0.0928798,"-inf"
5,5120,0.1161,"-inf"
....
12,12288,0.278639,-40.729261
13,13312,0.301859,-40.824955
14,14336,0.325079,-43.781575
15,15360,0.348299,-44.402752
....1
97,99328,2.252336,-37.267658
98,100352,2.275556,-37.303841
99,101376,2.298776,-6.616751
100,102400,2.321995,-2.16881
101,103424,2.345215,-2.94986
102,104448,2.368435,-2.640969
103,105472,2.391655,-2.719186
104,106496,2.414875,-3.441489
105,107520,2.438095,-4.572362
106,108544,2.461315,-7.539401
107,109568,2.484535,-9.858317
108,110592,2.507755,-12.040525
109,111616,2.530975,-15.757882
110,112640,2.554195,-17.298159
111,113664,2.577415,-19.362972

From inspection you can see the gun fire was at frame 99-100. Note that these are loudness so are logarithmic.

\(10^{-37.303841}\) is very small compared to \(10^{-2.16881}\)

It turns out when there is no sound the sound processing software returns “-inf” for the loudness. The app did a double.TryParse() of the loudness for each sound frame and obviously failed for these no sound frames. The app processing of this loudness data then skipped the sound frames that returned the “-inf” value.

The data is normalised such that the minimum loudness value is set to zero, the maximum value is set to 10 and everything else is shifted by the minimum and then scaled by the max such that values are in the range 0 to 10.

var min = loundnessData.Min();
var max = loundnessData.Max();
var range = max - min;
int Amplitude = 10;

// Normalize the data to a range of 0-Amp
var normalizedLoudnessData = loudnessData.Select(x => (double)(Amplitude * (x - min) / range)).ToArray();

The values are then exponentiated to amplify the gun sound peak. This exponentiation gives a clearer indication of the starting edge of the gun sound.

var exponentiatedData = normalizedLoudnessData.Select(x => Math.Round(Math.Pow(x, 10), 0)).ToArray();

The app’s loudness software returns a time from the video start with each loudness frame and so it was able to correctly determine the time of the gun fire from this subset of sound data.

As stated previously, there was an issue where the video was silenced except for an added gunshot sound for the GUNSOUND mode of race start.

There is a plot at the bottom of the stitched image of the sound loudness. What the app was doing was stretching the subset of the processed sound data point across the whole stitched image which is obviously incorrect. This was resolved as follows.

If “-inf” was detected the loudness was set to NAN.
In normalizing the data (min to max range processing) NAN are passed though.
The next step with the data is to exponentiate the loudness data, but for NAN set the value to zero.
There is now a value to plot for each video frame time at the bottom of the stitched image … And all is good!

var normalizedLoudnessData = loudnessData
    .Select(x => !double.IsNaN(x) ? (double)(Amplitude * (x - min) / range) : double.NaN)
    .ToArray();

var exponentiatedData = normalizedLoudnessData
    .Select(x => !double.IsNaN(x) ? Math.Round(Math.Pow(x, 10), 0) : 0)
    .ToArray();

Gun audio graph … at bottom

Note the detected gun time, green line and the determined value as circled in red in the table.

Gun time is determined by finding the sound frame that is first above a threshold that is a specific fraction of the maximum exponentiated loudness value. This is set in the app settings at 1/3. This is then used to look up the time from video start from the sound frame of that index:

 //thresholdRangeDivisor is set in the app settings as 3
 double thresholdLevel = audioData.Min() + (audioData.Max() - audioData.Min())/ (thresholdRangeDivisor);

 // Find the index of the first value greater than or equal to the threshold
 index = Array.FindIndex(audioData, x => x >= thresholdLevel);

 GunTime = Math.Round( loudnessTimeData[index],3);  //Using time from audio data

Footnote

Found I could could use LATEX syntax in markdown to display mathematical expressions (the exponentiations as above). See GitHub - LaTeX in Markdown, and Math Expressions in Markdown: Complete LaTeX and MathJax Guide. A complex example:

$$
J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2
$$

diaplays as: \(J(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2\)

Whilst this rendered OK with the Markdown editor in VS Code, it did not render as such when the Jekyll site ran. Needed to add the following to the page’s layout html file:

<script type="text/javascript" async
     src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
</script>

	Topic	Subtopic
Next: >	Jekyll	Equations using Latex

	*This Category Links*
*Category:*	Application Dev Index:	Application Dev
Next: >	NuGet	Managing Package Versions in an app - Directory.Packages.props file
< Prev:	Photo Timing Video	Video Stitching Latest

Blog on Microsoft Technologies esp. IoT

Blog on Microsoft Technologies esp. IoT

Photo Timing Video: Video Stitching Detecting Gun Sound

Code

About

Footnote