Method and apparatus for automatically identifying animal species from their vocalizations
Posted on July 25th, 2016
Note: This text would have been used in a patent application, but the Patent and Trademark Office has let us down. Lobbyists and Congress weakened patent protections with the America Invents Act, and patent trolls from big business simply litigate with individual inventors until they get what they want[1. Eden, Scott. “The Greatest American Invention”. Popular Mechanics, July/August 2016 p. 93-99]. At this writing, the PTO averages 16.1 months before it takes the first action on a patent application, and 25.7 months to process the average application. Over 500,000 patent applications are now awaiting examination[2. Patents Data at a Glance. USPTO.].
One. Method and apparatus for automatically identifying animal species from their vocalizations, comprising a portable device for recording and transmitting an animal vocalization, a web server for receiving an uploaded recording and for displaying a web page, a database, and server-side software for analyzing the recording.
Two. The portable device of claim 1 comprising a handheld computing device, such as a portable computer, tablet, or smart phone which has a microphone, geopositioning sensors, an Internet connection, and web browser capability.
Three. The web page of claim 1 further including:
- ability to accept and save a file the user has uploaded, transmitted as a POST;
- ability to compute a hash of the uploaded file, such as an MD5, to be used in tracking the file. In one embodiment, such information might be placed in hidden form fields;
- ability to parse a query string that was passed when the page was invoked, and which includes geopositioning information such as latitude and longitude, time and date as received from the device, and user identifying information, such as email address and/or Device ID. In one embodiment, such information might be placed in hidden form fields;
- ability to display a form with one or more questions about the source that made the recorded sound, such as whether the user thought the animal was a bird, mammal, amphibian or insect, or what taxonomic Order the bird might belong to, or what colors were seen on the animal, if any. In a preferred embodiment, such questions should be optional. If answered accurately, they contribute to the effective filtering of the “Twenty Questions” approach.
- a button on the form that launches the analysis process. In one embodiment, a click of the button would update the database with all of the information now available: the user’s latitude, longitude, date and time, email address, device ID, the MD5 of the uploaded file, and the answers the user provided to any questions asked.
- In one embodiment, once the information has been uploaded, the Target page might begin to refresh at intervals. At each refresh, the database could be checked to display interim results to the user, and finally, when processing is finished, the results might be displayed on this web page.
- In one embodiment, the results displayed include links to other pre-existing web pages for more information, sample sounds and photos, and include a likelihood score for each listed result.
- In one embodiment, the user is also sent an email with the results, including a questionnaire they can use to grade developers on the speed, accuracy, and utility of the results.
Four. The database of claim 1 comprising statistical information needed for automatically identifying animal species from their vocalizations, including:
- tables which store the frequency with which each species has been observed at each location on the planet in each month, the locations being rounded to representations of points on land about 50 miles apart.
- tables which store essential statistical parameters of reference audio recordings, each recording having been segmented into meaningful units prior to summarization;
- additional statistical information of higher complexity, such as the results of one-dimensional and two dimensional Fast Fourier Transforms on each segment;
- tables which store additional information, such as taxonomic classification of species, if users will be able to specify this, or the colors found in species, if users will be able to specify this.
Five. The server-side software of claim 1 further incorporating audio file segmenting capability including:
- algorithms to convert a submitted audio file to Pulse Code Modulation (PCM) format, regularly sampling the amplitude of the analog signal at uniform intervals, and quantizing each sample to the nearest value within a range of digital steps.
- algorithms which examine the mean and variance of the entire signal, developing a theory on the maximum amplitude of noise and minimum amplitude of signal, then applying that theory to break a recording into segments of signal, discarding background noise and saving each segment to disk;
Six. The server-side software of claim 1 further incorporating audio file fingerprinting capability including:
- algorithms to resample each segment at the same frequency, allowing for direct comparison between samples from different sources.
- algorithms to apply Fast Fourier Transformations (FFT) to each segment, creating both one-dimensional and two-dimensional arrays.
- algorithms to compute basic properties of such segments, such as segment duration, frequency statistics (such as minimum, maximum, mean, and standard deviation derived from the one-dimensional array values) and amplitude statistics (such as mean and standard deviation derived from peak frequency values) and write such information to a server-side database.
Seven. The server-side software of claim 1 further incorporating reasoning algorithms similar to “Twenty Questions”, including:
- algorithms which examine the database and determine what species are in a taxonomic category optionally provided by the user, and which are commonly found near the user at the time of year the recording was made;
- algorithms which examine the database and reduce the pool of candidate species by eliminating any which do not have coloration or other qualities optionally provided by the user;
- algorithms which score the remaining pool of candidate species on how closely each segment from the submitted recording matches segments in the reference segments table of the database in the computed one-dimensional arrays;
- algorithms which determine how well those reference segments with high one-dimensional array matches match on their two-dimensional arrays;
- algorithms which comine the two-dimensional array results to produce a score for each species, that score being the average two-dimensional match;
- algorithms which prepare a prose narrative of these results, including links to web pages with more information on the candidate species, and post this narrative to a database table being monitored by the web page of claim 3.
Field of the Invention
The present invention relates generally to audio signal processing systems and associated computer software methods. More specifically, the present invention relates to a system and method for automatically identifying wildlife from their recorded sounds.
Background of the Invention
Naturalists, bird watchers, and others often hear a bird before they see it, if they see it at all. A bird’s song or call is distinctive, and can usually provide an expert birder with certainty about an identification. But most of us are not expert birders, and the array of bird songs and calls can be bewildering.
For amateurs, there is a large market in books, CDs, and cell phone Apps that offer a chance to learn the sounds that each bird makes. It is then up to the birder to match what they’ve learned with what they’ve just heard, and reach a conclusion. This is a slow, difficult task, and developing expertise as a competent birder takes many years.
A tool that could accurately identify a bird from its sound would be welcomed by birders. But it would also have great utility in automated animal identification in ecological censusing, environmental monitoring, biodiversity assessment, and other roles.
Automated species identification seem feasible because animal vocalizations seem to vary much more between species than within a species.
If such a species identification tool existed, it might be improved to identify individual birds, or to begin to understand the meaning of specific calls and words that birds use. And it would naturally extend to recognizing mammals, amphibians, and insects who also have vocabularies and songs.
The Elements of Song
Bird song has been conceived as being composed of a hierarchy of elements. Lee et al (2006)[1. Chang-Hsing Lee, Yeuan-Kuen Lee, and Ren-Zhuang Huang “Automatic Recognition of Bird Songs using Cepstral Coefficients” Journal of Information Technology and Applications. Vol 1 No. 1 May 2006 pp 17-23.] suggest “The simplest individual sounds that birds produce are referred to as song elements or notes. A set of one or more elements that occur successively in a regular pattern is referred to as a song syllable. A sequence of one or more syllables that occurs repeatedly is regarded as a song motif or phrase. A particular combination of motifs that occur repeatedly constitutes a song type. Finally, a sequence of one or more motifs separated from other motif sequences by silent intervals of different duration is a song bout.”
Such a structuring of song sounds is not needed with “words”, the unit of expression used by birds when they are not singing. Words are sometimes referred to as “calls”, but because they may be uttered at any volume, including a whisper when in the nest box, the term “word” seems more appropriate.
Words are neglected by bird watchers, partly because the most common – alarm calls – all seem to sound the same. And they are neglected by researchers because algorithms don’t seem to do much better in sorting them than the human ear does. Because words are rarely discussed in the scientific literature, I need to discuss them from my personal experience of living for 18 years with a flock of pet cockatiels and as a feeder of backyard birds.
Words for a bird are expressions of self-identification, intent, emotional state, danger, instruction. They are not combined in sentences, not used in song, and are high in meaning. They are likely involuntary.
Self-identification: When geese fly, the male typically trails his mate, not directly behind her because she is then harder to see, and with his extra momentum, tailing could lead to a wreck if she suddenly slowed. In flight, each uses a word which, among other things, identifies themselves, so they can track each other and keep together. His word is distinctly lower than hers, and an amateur bird watcher can hear this difference from the ground. When a flock passes, all honking, that observer can tell that each sounds a bit different. Those differences are meaningful: each goose has their own pronunciation of the self-identification word, and so a goose in flight not only knows that other birds are around them, but knows which are around them.
Intent: social birds have a word for “take off”, used when the bird wishes to take off. In the case of an emergency, the bird may issue the word and simply go, without waiting for any response from others. But if there is no emergency, just a desire to move on to another location – perhaps because night is falling – then each bird wishing to take off is likely to use the take off word and listen to what others say. If a quorum expresses the word, then some take off, followed by others.
Emotional state: in our flock of cockatiels, I can distinguish words that mean ecstasy (used in anticipation of grooming and during grooming; very variable between individuals, and often done with eyes closed when ears are being rubbed), curiosity/amazement (can be expressed when riding on my shoulder, as I enter a new room; sounds a bit like “wow”), love (expressed when entering a room where male bird’s spouse or one of us human friends may be found; sounds like a falling chord), terror (can be expressed when a hawk flies at the window; sounds like “eek” or “shriek”), delight (used when an individual trips and recovers) and sometimes pain (expressed when a feather is pulled during mutual grooming, or if I run over a tail feather with my desk chair!)
Danger: Chickadees adjust their calls to reflect the degree of danger. If you fill the backyard bird feeder, you may hear “Chick-a-dee-dee-dee-dee”, a large hawk – clumsy in acrobatic maneuvers – might increase the number of dees, and a small hawk or owl – even more dangerous – might increase the dees even further – up to 23 in one case. Templeton et al (2005) [Templeton, C. N.; Greene, E.; Davis, K. (2005). “Allometry of alarm calls: black-capped chickadees encode information about predator size”. Science 308 (5730): 1934–7.doi:10.1126/science.1108841. PMID 15976305.] Other birds use differentiable alarm calls to designate predator from above (such as hawk) and predator from below (such as cat).
Instruction: When blue jays and crows find food, they call others to share in the feast (and danger of landing on the ground) before eating. When I step outside, I usually trigger a crow or jay to call others, since I’m often about to fill bird feeders. In the case of crows, one call may trigger distant calls, so that a large group, originally dispersed, may arrive. Crows not only know the self-identification sound of their mates, but can imitate it, so that when a pair becomes separated, one may call his spouse using her self-identification sound!
Every word spoken by my cockatiels reveals who they are by how that word is pronounced. If I can tell this, surely they can, too.
Some words are rarely used, others used frequently. Commonly used words will generally be better represented in the database of an automated recognition system, and should be expected to be more accurately identified.
Words are the verbal part of animal communication, but there is also much that is non-verbal, such as raising both wings in a stretch as greeting (done by both cockatiels and mourning doves, it helps reveal the underside of the wing, improving identification opportunities). But our focus here is only vocalization.
Previous Automation Efforts
There have been many attempts to automate the process of identifying an animal from its sound. A thorough literature review shows many promising directions, but all implementations suffer from false positives, particularly when conditions are challenging, such as background noise, syllables of plastic songs, and some calls. And when an algorithm shows reasonable success in distinguishing between two birds with very different songs, the accuracy of identification drops as additional birds are added.
Two cell phone applications which purport to identify North American birds by sound can do everything but make correct identifications.:
- Bird Song ID: USA is a cell phone application offered by IsoPerla.[2. http://us.isoperlaapps.com/BirdSongIdUSA.html email@example.com Office 173 3 Edgar Buildings George Street Bath BA1 2FJ] This product claims to identify 30 birds. Its identification consists of a list of 26 guesses, ordered by probability. In one of our tests, the correct bird appeared on the list in the 12th position; in all other tests, the correct bird did not appear in the list of results at all.
- Twigle Birds is a cell phone application offered by Avelgood Apps[3. http://www.twigle.co/features/] which claims to identify 50 North American birds. Its identification consists of a list of 5 guesses, ordered by probability. In none of our tests did the correct bird appear in the list of 5 guesses.
Suggested Methods for Automated Recognition
Researchers have explored methods of identifying birds from their sounds, without turning their methods into products. This brief review looks only at studies which report their identification success rate.
- Breaking bird sounds into syllables and then matching the amplitude and frequency of syllables with a reference set of syllables has been explored by Härmä (2003)[4. Aki Härmä “Automatic Identification of Bird Species Based on Sinusoidal Modeling of Syllables” Helsinki University of Technology, Laboratory of Acoustics and Audio Signal Processing P.O. Box 3000, FIN-02015, Espoo, FINLAND email: Aki.Harma@hut.fi https://www.researchgate.net/profile/Aki_Haermae2/publication/4015246_Automatic_identification_of_bird_species_based_on_sinusoidal_modeling_of_syllables/links/0046352a756329a778000000.pdf Published in: Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEE International Conference on (Volume:5 ) Date of Conference: 6-10 April 2003 Page(s): V – 545-8 vol.5 ISSN: 1520-6149 Print ISBN: 0-7803-7663-3]. As a component of more complex analyses, the method shows promise, for in a collection of the songs of 14 birds, most syllables match their source bird more often than they match any of the syllables of the other 13 birds. But Härmä reports the method could not classify 3 of the 14 birds, and the confidence in other conclusions was low. Härmä notes that “… almost all species feature non-tonal sounds like clicks and rattles which cannot be modeled with a simple sinusoidal model.” And as Agranat (2009)[5. Ian Agranat. “Automatically identifying animal species from their vocalizations.” http://www.wildlifeacoustics.com/images/documentation/Automatically-Identifying-Animal-Species-from-their-Vocalizations.pdf Wildlife Acoustics Inc. Concord, MA. March 2009.]notes, “we found that syllable matching does not scale to a large number of species, especially when several highly variable narrowband vocalizations are included in the mix…”
- Anderson et al. (1995)[6.Anderson et al., Automatic Recognition and Analysis of Birdsong Syllables from Continuous Recordings, Mar. 8, 1995, Department of Organismal Biology and Anatomy, University of Chicago.] tried sampling the digitized sound at a fixed rate, performing a series of 256-point Fast Fourier Transforms (FFTs) to convert to the frequency domain, detecting notes and silence intervals between notes, and comparing the FFT series corresponding to each note with templates of known notes classes using a Dynamic Time Warping (DTW) technique. Their high rate of false positives was the result of high variability between individual singers in many species, the difficulties their method had in addressing noise, and possibly a focus on the wrong parts of the frequency spectrum. In a subsequent study (Anderson et al., 1996) (S.E. Anderson, A.S. Dave, and D. Margoliash “Template-based automatic recognition of birdsong syllables from continuous recordings” J. Acoust Soc Am, Aug 1996, 100 (2 Pt 1): 1209-19.), and using low-clutter, low-noise recordings of just two species, accuracy ranged between 84% and 97%. Such accuracy would fall with more birds and lower quality recordings. The use of DTW appears to require expert knowledge in the selection of templates.
- McIlraith and Card (1997)[7. A. L. McIlraith and H. C. Card “Birdsong Recognition Using Backpropogation and Multivariate Statistics”, IEEE Transactions on Signal Processing Vol. 45 No. 11, November 1997] used two different methods for classifying birds from their vocalizations. One method sliced a song into frames of identical duration, computed 10 parameters per frame, then combined these parameters over the length of the song. The second method parsed the song into notes and determined the mean and standard deviation of both the duration of notes and duration of silent periods between the notes. On the songs of their six test birds, the researchers claimed 82-93% accurate identification. But neither method works well for short sounds that we might call “chirp” or “quack”. Neither method works well when comparing “Chicka-dee-dee-dee” with “Chicka-dee-dee-dee-dee-dee”. And because many bird species have overlapping spectral properties, particularly with alarm calls, these methods can not achieve high recognition rates across a large number of individual species.
- The use of hidden Markov models (HMMs) requires more training examples than DTW (Kogan and Margoliash, 1998[8. Joseph A. Kogan and Daniel Margoliash “Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models: A comparative study” J. Acoust. Soc. Am. 103, 2185 (1998); http://dx.doi.org/10.1121/1.421364 http://pondside.uchicago.edu/oba/Faculty/Margoliash/lab/pdfs/1997%20Kogan%20JASA.pdf]), and regularly misclassifies short-duration vocalizations and song units with variable structure. Their HMM technique included digitizing the bird song, extracting a time series of Mel Frequency Cepstral Coefficients (MFCC), and computing the probabilities that HMMs representing a known birdsong produced the observation sequence represented by the sequence of coefficients. The HMM method works reasonably well in distinguishing two species if individual notes are manually classified, a difficult, labor-intensive and time-consuming process, and if all background sounds are also added to the database. Furthermore, HMMs of a fixed number of states do not discriminate well among notes of variable durations. The simple bi-grammar of Kogan et al. correlates too coarsely to the structure of birdsongs to be able to distinguish among a large number of diverse species.
- Agranat (2009) manually segmented Macaulay Library recordings for 52 species represented by 550 individual recordings into 12,563 segments, to be used as a reference standard. For field recordins, a Wiener filter reduces stationary background noise, and a band-pass filter eliminates some frequencies, such as those of wind and traffic noise. The remaining frequencies are transformed from a linear scale to a log frequency scale, to reduce what may be redundant higher frequency harmonics.The remaining noise is then substantially eliminated by normallizing to a fixed dynamic range, the frequency bin with the highest energy level is set to equal the dynamic range, and any bin whose normalized power falls below the estimated background noise level for that bin is set to zero. Signals are then automatically segmented using amplitude levels. Agranat then uses classification algorithms based on Hidden Markov Models (HMMs) using spectral feature vectors similar to Mel Frequency Cepstral Coefficients (MFCCs). Unfortunately, detection rates only ranged from 37% to 63%.
- Lee et al (2006)[9. Chang-Hsing Lee, Yeuan-Kuen Lee, and Ren-Zhuang Huang. “Automatic Recognition of Bird Songs Using Cepstral Coefficients” Journal of Information Technology and Applications. Vol 1 No. 1 May 2006. pp 17-23. http://jita.csi.chu.edu.tw/Jita_web/publish/vol1_num1/05-20050044-text-sec.pdf] First, each syllable corresponding to a piece of vocalization is segmented. For each syllable, the averaged LPCCs (ALPCC) and averaged MFCCs (AMFCC) over all frames in a syllable are calculated as the vocalization features. Linear discriminant analysis (LDA) is exploited to increase the classification accuracy at a lower dimensional feature vector space. In our experiments, AMFCC usually outperforms ALPCC.
Many researchers acknowledge abandoning the search for recognition algorithms after encountering trouble.
- The variability of song within a species is part of the difficulty of developing recognition algorithms that produce accurate results. An automated recognition product called “WeBIRD” – an acronym for “Wisconsin Electronic Bird Identification Resource Database” – received enthsiastic attention on the Internet when it was announced. Time passed, and finally the developer reported this: “Last spring and early summer we tested WeBIRD in the field. Identification of resident and local species (i.e. those in Madison WI) presented no difficulties as expected. But when the first migratory birds started returning, WeBIRD did not perform well… One aspect of avian vocalization is that most species exhibit substantial variation in their songs (and calls). Moreover, this variation is structured geographically… if no “reasonably similar” songs exist in the database, accurate identification of the species is impossible (or at least statistically unlikely). Given that substantial geographic variation in bird vocalizations exists makes this a formidable problem.”[10. Chris Barncard “Smart Birding. A new birdsong app identifies feathered friends by their tweets”. http://grow.cals.wisc.edu/environment/smart-birding] The algorithms used by WeBIRD are unknown, but 5 or 6 years after its announcement, the product is still not shipping.
- Another published method is described by Kunkel (2005)[11. Kunkel, G., The Bird Song Project,1996-2005, http://www.bio.umass.edu/biology/kunkel/gjk/project/Project.htm] who extracts parameters for each note including the frequency of the note at it’s highest amplitude, the frequency modulation of the note as a series of up to three discrete upward or downward rates of change representing up to two inflection points, the duration of the note, and the duration of the silence period following the note. The parameters corresponding to notes of known bird songs are compiled into a matrix filter, and the matrix filter is applied to recordings of unknown bird songs to determine if the known bird song may be present in the sample. Kunkel’s approach is undermined by the fact that the songs of many birds contain very similar individual notes, and his analytic methods appear unable to address some of the complexities of some bird song. His hardware is mounted to a corner of his house, and no part of his system is suitable for use in the field or for identifying more than a handful of birds. Kunkel’s heroic continuous monitoring of local birds seems to have come to an end in 2005, 9 years after it was begun.
Problems with Prior Approaches
All prior approaches appear to suffer from a number of common problems:
- Most have begun by trying to distinguish between a few mono-syllabic vocalizations, only to find that their performance degrades as the number of species or the complexity of the vocalizations increase. Bird vocalizations range from those with narrowband whistled vocalizations with few distinctive spectral properties to broadband with complex spectral properties. Vocalizations may range for 100Hz to 10,000Hz, and may last several seconds or just a fraction of a second. A truly useful automated identifier must be able to handle any wildlife sounds, no matter how complex.
- Many researchers have been lured by the purity of the Macaulay Library at the Cornell Ornithology Lab. The recordings in this library are correctly identified, and most contain no background noise and no sounds of other species. Some researchers have chosen other libraries, only to discover that those libraries had once been Macaulay samples. When their algorithms encounter real-world recordings of birds – faint and cluttered with natural sounds such as wind and the sounds of other birds, passing airplanes, and the sounds of people talking – the algorithms break down. Human speech recognition assumes that the speech is into a microphone; a bird recording attempts to focus on a subject at some distance. What begins as a perfect signal at the bird’s syrinx is muted and muffled and masked by intervening trees and these other sounds. Effective algorithms must devote considerable attention to the task of distinguishing signal from noise.
- Algorithms must use the correct unit of analysis, the smallest unit that is meaningful to a bird. If a bird can combine the words or syllables “A”, “B”, and “C” into phrases like “AABC”, “ABBCCC”, and “ACABC”, then the unit of analysis must be those words, not those phrases. If a bird has freedom in a song to combine known words into new combinations, and to extend the song to any length, the song cannot be the unit of analysis as Chu and Blumstein (2011)[ 12. Wei Chu and Daniel T. Blumstein Noise robust bird song detection using syllable pattern-based hidden Markov models. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, 2011 pp 345-348. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5946411] have demonstrated. This is most true for “plastic songs”, but careful examination suggests that most instances of “stereotyped songs” show some plasticity.
- Most algorithms do not seem able to handle variations within a species. Anyone who lives with a small flock of birds as pets can distinguish individuals from the way they pronounce various words, which Da Silva et al (2006)[13. aria Luisa Da Silva & Jacques Vielliard Entropy Calculations For Measuring Bird Song Diversity: The Case Of The White-Vented Violet-Ear (COLIBRI Serrirostris) (AVES, Trochilidae) Razprave Iv. Razreda Sazu, Xlvii-3 (2006)] believe is “evidence of vocal learning and creative capacity.” Birds use sensory feedback to adjust their song in different contexts, for instance singing louder in noisy urban settings. And all birds appear to have subtle regional dialects, and during migration, the geese feeding in a Virginia pasture in winter might include some that summer in Canada, some non-migratory locals from Virginia, and some that spend their winters anywhere in between. So if there are two ways to pronounce the word “A”, then the database should contain samples of each, and the algorithms must not insist that a test word perfectly match them both. And since there will likely be hundreds of slightly different ways to pronounce “A”, the algorithms must be able to combine partial matches to produce a confident conclusion.
- The recordings used for the reference database should be of the highest possible quality, made with a high quality shotgun or parabolic microphone, rather than an omnidirectional microphone.
- The reference database must be generated by fully automated processes if it is to correctly identify all sounds made by all species under all circumstances. Many of the approaches in the literature depend on a manually generated database, something that is not imaginable if we are to identify the many different sounds made by each of the 1.5 million species of birds, mammals, amphibians, and insects that still exist.
- Implementations of recognition software for a cell phone must not expect that the cellphone has adequate processing power to evaluate a recording, or adequate capacity to store the reference database. If field recordings are to be identified in real-time, they will either need a reasonably capable self-contained laptop or they will need to use a cellphone or other recording device coupled to a remote server.
- Despite great efforts, great algorithms are not enough to completely substitute for the expert birder or wildlife biologist in identifying species by sound. All cases of machine identification should be available for public review and comment, and the feedback should be used to focus on the weaknesses of the algorithms and sometimes to use a submitted sample as an additional reference sound when machine and citizen scientists agree on an identification.
We consider that any workable invention for the automatic identification of animal vocalizations must possess these qualities:
- A client-server architecture, in which databases are stored on the server, and processing is done at the server. Identification requires too much storage space and too much processing power to expect it to be done on a remote hand-held device.
- An Internet connection (or equivalent) between client and server, to enable portability of the client. Recordings will be made in the field, not the office, and analyzed in an office, not the field.
- Use of geoposition and time of year to narrow the number of possible animal sources of a recording. Species are not uniformly distributed over the planet’s surface, and if geoposition and date are not used, impossible results will be normal.
- Meaningful segmentation of audio files, and comparison of unknown segments with reference segments. Approaches that segment arbitrarily, at fixed intervals, or that do not segment at all, will never be able to match plastic songs and complex vocalizations with a reference database.
- Repeated filtering of candidate species, beginning with the fastest filters to apply, and as the selection narrows, applying those filters to the remaining candidates that are most time consuming. This prioritized comparison order reduces the candidates much like the game of “Twenty Questions”.
- Complete automation of the process. Inventions which require experts to train the system in identification will simply be too labor-intensive, and will not be able to handle identify the many different sounds made by each of the 1.5 million species of birds, mammals, amphibians, and insects that vocalize.
Against these six criteria, we take a brief look at prior patents:
|Patent #||URL||Dec 30, 1899||Title||Criteria Not Met|
|US9177559 B2||https://www.google.com/patents/US9177559||Nov 3, 2015||Method and apparatus for analyzing animal vocalizations, extracting identification characteristics, and using databases of these characteristics for identifying the species of vocalizing animals||1,2,5 There will be difficulty extending the invention to multiple species. The authors write “If a new species that also sings 1-Section Songs with 3-Part Elements is added to the database, then some other identification parameter, such as pitch contour of the Elements, will be added to the Master Identification Parameter list for each species and the new list will serve to separate these two species song types.”|
|US9093120 B2||https://www.google.com/patents/US9093120||Jul 28, 2015||Audio fingerprint extraction by scaling in time and resampling||1,2,3,4,5,6 Discusses only fingerprinting. Does not address species identification.|
|US9058384 B2||https://www.google.com/patents/US9058384||Jun 16, 2015||System and method for identification of highly-variable vocalizations||3,4,5. Approach segments a recording into frequency bands, but not into temporal units of “words”; segmentation not used to distinguish signal and noise.|
|EP1661123 B1||https://www.google.com/patents/EP1661123B1||Jun 5, 2013||Method and apparatus for automatically identifying animal species from their vocalizations||1,2,3,4,5,6 Training requires a skilled operator, and new rules must be downloaded to the field.|
|US8140331 B2||https://www.google.com/patents/US8140331||Mar 20, 2012||Feature extraction for identification and classification of audio signals||3,4,5,6 Approach segments a recording into frequency bands, but not into temporal units of “words”; segmentation not used to distinguish signal and noise.|
|US7963254 B2||https://www.google.com/patents/US7963254Jun 21, 2011||Jun 21, 2011||Method and apparatus for the automatic identification of birds by their vocalizations||1,2,3,4,5 Assumes a hand-held computational device has adequate CPU and storage for all processing. Assumes that birds sing at bird feeders. Assumes that the sounds of taxonomic families can be identified, and that there is a similarity of sounds of all species of birds within a given family.|
|US7454334 B2||https://www.google.com/patents/US7454334||Nov 18, 2008||Method and apparatus for automatically identifying animal species from their vocalizations||1,2,3,4,5,6 Training requires a skilled operator, and new rules must be downloaded to the field.|
|US7377233 B2||https://www.google.com/patents/US7377233||May 27, 2008||Method and apparatus for the automatic identification of birds by their vocalizations||1,2,3,4,5 Assumes a hand-held computational device has adequate CPU and storage for all processing. Assumes that birds sing at bird feeders. Assumes that the sounds of taxonomic families can be identified, and that there is a similarity of sounds of all species of birds within a given family. Patent appears identical to US7963254.|
|EP1031228 A4||https://www.google.com/patents/EP1031228A4||Mar 30, 2005||Device and method for automatic identification of sound patterns made by animals||1,2,3,4,5 Does not attempt to identify vocalizations by species. Patent not awarded.|
|WO2005024782 A1||https://www.google.com/patents/WO2005024782A1||Mar 17, 2005||Method and apparatus for automatically identifying animal species from their vocalizations||1,2,3,4,5 Patent not awarded.|
|US20050049876 A1||https://www.google.com/patents/US20050049876||Mar 3, 2005||Method and apparatus for automatically identifying animal species from their vocalizations||1,2,3,4,5,6 Training requires a skilled operator, and new rules must be downloaded to the field. Patent not awarded.|
|US20040107104 A1||https://www.google.com/patents/US20040107104||Jun 3, 2004||Method and apparatus for automated identification of animal sounds||1,2,3,5 Patent not awarded.|
|US20030125946 A1||https://www.google.com/patents/US20030125946||Jul 3, 2003||Method and apparatus for recognizing animal species from an animal voice||1,2,3,4,5 Patent not awarded.|
|US6535131 B1||https://www.google.com/patents/US6535131||Mar 18, 2003||Device and method for automatic identification of sound patterns made by animals||1,2,3,4,5. Requires that a device be attached to the animal to be monitored. Attempts to determine if the animal is in distress, not the animal’s species.|
|WO2000013393 A1||https://www.google.com/patents/WO2000013393A1||Mar 9, 2000||Device and method for automatic identification of sound patterns made by animals||1,2,3,4,5 Patent not awarded.|
|US5956463 A||https://www.google.com/patents/US5956463||Sep 21, 1999||Audio monitoring system for assessing wildlife biodiversity||1,2,3,4,5 Assumes a “call” that need not be segmented, and uses segmentation to distinguish a single call from the remainder of the recording.|
|US5452364 A||https://www.google.com/patents/US5452364||Sep 19, 1995||System and method for monitoring wildlife||1,2,3,4,5 Author of patent appears to have had no idea how difficult it is to match a recording with reference entries in a database, suggesting it could be done by matching on frequency: “The central processing unit calculates the frequency of the vocalization based digital signal representative of the period of the signal and based on a set of these signals determines the species that emitted the vocalizations.”|
|CA2089597 A1||https://www.google.com/patents/CA2089597A1||Aug 17, 1994||Apparatus for audio identification of a bird||1,2,3,4,5 Patent not awarded.|
Brief Description of the Figures
FIG. 1 is a representation of a smart phone connected to a shotgun microphone, for use in capturing a bird or other wildlife sound;
FIG. 2 will be a block diagram providing an overview of the invention, showing cellphone sound capture, web site, and background processing; [done here: https://www.draw.io/]
FIG. 3 will be a block diagram of an embodiment of aspects of the invention.
This invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “compromising,” or “having,” “containing,” “involving”, and variations thereof herein, is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
Embodiments of aspects of the invention include methods and apparatus which obtain a suitable signal for analysis; optionally clean up or preprocess the signal, for example by use of noise reduction processes; extract interesting features of the signal for further analysis; and, compare the extracted features to a database or the like of features corresponding to various known animal species. Using the structure and methods of the exemplary embodiment, real-time performance in the field can be achieved. That is, a species can be identified by the device in the field, while the observer using the device is still observing the animal that is vocalizing, in the field.
The methods and apparatus described herein are suitable for identifying many different species, including but not limited to birds, mammals, insects, etc. Examples are given without limitation, with respect to birds. For this reason, a term such as “note” or “phrase” should be read to include analogs pertaining to other animal families, such as “tone”, “phoneme,” etc.
In a preferred embodiment, a portable computer, tablet, or smart phone is used to record an animal sound, and to transmit that recording, along with time and location information, to a server. A page on that server then continues the interaction with the user. At the server, the recording is screened, broken into segments, and a series of comparisons with a database of reference segments is performed.
The general approach taken is that of “Twenty Questions”, in which the first questions asked are the most efficiently answered, and rule out the majority of remaining alternatives. For example, the time and location information is first used to produce a short list of species that are likely to be found at that time and place. If the user has seen the bird or animal making the sound, information provided (on color, taxonomic category, etc.) may be used to further shorten the list of candidate species. Then simple parameters of each segment are used to find reference segments for the listed species which most closely match these parameters. From that shorter list, a one-dimensional array derived from a Fast Fourier Transform is compared, producing a still shorter list of candidates that might have made the sound. That list is then used to determine which two-dimensional arrays will be compared, producing a short list of species most likely to have produced the recorded sound, as well as confidence scores derived from the degree of match with reference samples. These results are presented to the user through the web page displayed on their device. The results include links to pages of previously recorded sounds and photos of each of the candidate species. The user may review the results, and provide feedback to the system. Any identifications that the user believes to be incorrect may be reviewed by a human, to allow improvements to the database or algorithms. Ideally another web page will also permit review by volunteer citizen scientists of recordings, results, and user responses, and allow for their own feedback.
Creating the Reference Database
A reference database of wildlife or bird sounds is needed, because the approach of this invention is to find the best match of a recording from an unknown bird or animal to recordings in the database, and from there to infer the species that made the sound. The general technique for creating the reference database follows these steps.
- Reference Recordings: Acquire a large collection of audio recordings of good quality, with only one predominant species per recording, and trustworthy identification of that species.
- Reference Segments: For each recording in this collection, create segments and statistical analyses of such segments.
In a preferred embodiment, we used a variety of commercial and non-commercial sources and assembled a collection of 102,825 recordings totaling 369,795,921,711 bytes. For each recording, we created meaningful segments according to the technique described in “Segmentation” below. We then did a number of statistical analyses of each segment, described in “Fingerprinting” below. These statistical values were both stored in a database table and as files on a Solid State Drive. The database table was provided with duration, frequency statistics (minimum, maximum, mean, and standard deviation) and amplitude statistics (mean and standard deviation).
Audio Capture and Upload
Embodiments of aspects of the present invention are described with reference to FIG. 1. Particular embodiments can be implemented in a laptop, tablet, smart phone, or other portable computer carried by the bird watcher or observer. The device should include a microphone (built-in or external), awareness of the time and date, and awareness of the user’s latitude and longitude.
Most such devices include a port for an external microphone, even if they are equipped with a built-in microphone. If connected, an external microphone will normally automatically bypass the built-in microphone. FIG. 1 contains such an external microphone, connected to the smart phone with a 3.5mm cable. In this embodiment, the cable from the microphone is split by a Y-connector, to allow the use of headphones to monitor the recording. This external microphone does not make the audio source much louder, but is desirable because it isolates the source, gives it a sharper focus, reduces background noise, and enhances the recording. The microphone, cell phone mount, and cables are available commercially from MXL Mics in the product “MM-VE001 Microphone Kit for Smartphones and Tablets”. Other products will also work with the preferred embodiment. In general, external microphones that are directional (either shotgun or parabolic) and that are equipped with a windscreen will maximize the desired signal and minimize noise.
In a preferred embodiment, the device includes software (“the App”) that is designed to allow the user to start and stop the device’s audio recorder. When stopped, the App displays a sonogram of the recording, and allows the user to play it back. If the user is satisfied with what they have recorded, they may press an “Analyze” button in the App. Clicking that button uploads the recording to a directory on a web server, as well as invoking a Target web page in a browser. In one embodiment, at the time of this invocation, the App passes the user’s latitude, longitude, email address, date and time, and device ID as a query string to the Target web page.
There may be users who wish to analyze recordings that have been previously made. In an embodiment, an additional web page permits the user to provide the latitude and longitude where a recording was made, perhaps the date and time when it was made, and perhaps some identifying information such as their email address. Such a page would contain a button that invoked a File Selection dialog, and when a valid file was selected, would upload it.
The Target web page handles the interactions between the server software and the user. It is displayed in a web browser window on the client device. In a preferred embodiment, it performs these tasks:
- accepts the file the user has uploaded, transmitted as a POST;
- computes the MD5 of the uploaded file, to be used in tracking the file. In one embodiment, such information might be placed in hidden form fields;
- parses a query string that was passed when the page was invoked, and which includes latitude, longitude, email address and Device ID. In one embodiment, such information might be placed in hidden form fields;
- displays a form with one or more questions about the critter that might have made the recorded sound, such as whether the user thought the animal was a bird, mammal, amphibian or insect, or what taxonomic Order the bird might belong to, or what colors were seen on the animal, if any. In a preferred embodiment, such questions should be optional. If answered accurately, they contribute to the effective filtering of the “Twenty Questions” approach.
- a button on the form that launches the analysis process. In one embodiment, a click of the button would update the database with all of the information now available: the user’s latitude, longitude, date and time, email address, device ID, the MD5 of the uploaded file, and the answers the user provided to any questions asked.
In one embodiment, once the information has been uploaded, the Target page might begin to refresh at intervals. At each refresh, the database could be checked to display interim results to the user. While the user waits for the analysis to begin, something like FIG. Xx [Clues.asp after upload.jpg], Guidelines for Better Recordings are displayed. When results are complete, they could be presented on this page. [need example and screen shot.!]
Code at Server: Initial Steps
Code runs on a machine with a network connection to the above web server.
It runs a loop, searching for any *.WAV file in the upload directory. If none are found, it sleeps 5 seconds and then looks again. If one is found, control transfers to the next step.
If a file is found, code first reads all *.VCF files found in the upload directory. Information found in a VCF is added to a table named AudiOh_Users, and the VCF is deleted.
Code finds the first remaining file in the uploads directory. If it is not a .WAV file or has a size of 0, it is deleted. If it is a .WAV file, its MD5 is computed. The file is renamed so that its basename is its MD5, and its extension remains .WAV, and it is moved to a directory where it will be processed. This process loops until there are no uploaded .WAV files remaining in the uploads directory.
Code then selects the record in the AudiOhUploads table which has the lowest record ID and no processed date, so that the code can work on oldest submissions first. A file whose name matches the filename stored in that record is selected for processing. Segmentation of this file then begins.
Any recording made in the field will contain a mix of signal and noise, the noise often containing a mix of people talking, wind against the microphone and other objects, passing planes and vehicles, and the sounds of other animals. In a preferred embodiment, digital filters are applied to the signal to reduce background noise and enhance the signal strength of candidate vocalizations.
We are told “Bird song consists of syllables, spectrally discrete sound elements within a song, lasting 10-100 ms, and separated by a min of 5 ms of silence.” (Okanoya and Yamaguchi, 1997) (Okanoya, K. & Yamaguchi, A. Adult Bengalese finches (Lonchura striata var. domestica) require real-time auditory feedback to produce normal song syntax. J. Neurobiol. 33, 343-356) However, A Downy Woodpecker call may contain squeeks of just 5 ms, so setting a minimum segment length of 10 ms would discard them all. So we have defined a segment as a sound having a minimum duration of 5 ms, surrounded by a silence of 5 ms or more. Of course, in the real world, while a bird’s syllable may be surrounded by the bird’s own silence, the ambient sounds of the environment will fill the gaps.
The code breaks the recording into segments. Segments are the basis of analysis, and amount to a phrase of a song, or a specific short call or word. Such words may be combined by a bird in most any way, and may be repeated any number of times, so our focus is simplification: identify the source of the words, and the source of the phrases will become evident. Segments are not only valuable because they correspond to sound elements that must be meaningful to the singer, but segmenting proves useful in identifying the various singers in the springtime morning chorus.
Segmentation begins by decoding the audio file if it is in a compressed format, converting it to Pulse Code Modulation (PCM) format, regularly sampling the amplitude of the analog signal at uniform intervals, and quantizing each sample to the nearest value within a range of digital steps. PCM may be thought of as a horizontal array, with time advancing from left to right, and values representing the loudness of the signal at each of these times.
The process of segmenting involves complex algorithms that determine what is noise, and what is signal, within a recording. Segmentation locates the brief quieter periods in an audio recording, and separates the recording into shorter recordings at each of these locations. Because background noise may be higher in one recording than another, and because background noise can vary from one part of a recording to another part, performing segmentation requires several passes through the audio file to determine what parameters reasonably separate signal from noise.
When building the reference database, information from each segment of each reference sound is added to a referencesegments table in the database. During the fingerprinting process, statistical summaries are added to this table.
The segmentation process is threaded, so that if more than one file is to be segmented, as when adding recordings to the database, many files are processed at once.
An audio fingerprint is a statistical summary of a sound sample. Audio fingerprints typically have a much smaller size than the original audio content and thus may be used as a convenient tool to identify, compare, and search for audio content.
To identify an unknown sound, analysis compares the fingerprints of segments of the unknown sound with fingerprints stored in the reference segments table. Because a given bird typically makes many different sounds, some multivariate techniques are inappropriate. For example, we can’t use canonical correlation to compare all reference segments for one species with all unknown segments provided by a user, because a user’s sample might contain the sounds of several birds, and the reference database will contain many different sounds from each bird. Variety in the reference database weakens the results until they are not useful.
The audio channel is streamed through a mixer, which resamples the input audio to 44100 Hz. This allows for direct comparison between samples from different sources.
A series of Fast Fourier Transformations (FFT) is performed on this audio stream. The FFTs may be thought of as a vertical array with a width of one frame and a height of 128 values, each value representing the loudness of a frequency. A two-dimensional array is created, with 128 FFT values at time frame 0, FFT values at time frame 1, etc.
In the preferred embodiment, the two-dimensional array is processed and a one-dimensional array is created. In creating the one-dimensional array, the code scans for the loudest frequency for each frame and adds this to the array, discarding all other qualities of the frame. So this array cannot answer the question “how loud are these frames?”, but rather “what is the loudest frequency of each frame?”
At this time, additional statistical properties of the segment are also determined, including calculated duration, frequency statistics (minimum, maximum, mean, and standard deviation derived from the one-dimensional array values) and amplitude statistics (mean and standard deviation derived from peak frequency values). When processing a reference sample, these statistical properties are stored in the ReferenceSegments database table for future use; when processing a test sample, the ReferenceSegments table is consulted. The one- and two-dimensional arrays are stored on disk, on a fast Solid State Drive.
Twenty Questions: The Reasoning Algorithms
The remainder of the code of our invention works something like a game of 20 questions, each question being designed to find a subset of the previous answers. The easiest procedures are done first, to conserve resources.
In a game of 20 questions, our questions are:
- What species are in the taxonomic category (Class, Order, or other category) optionally provided by the user?
- Is it in the ProximateCandidateList? Species most commonly found at this location at this time of year (and perhaps belonging to the specified taxonomic category) form the ProximateCandidateList. There is one such list per test recording.
- Any additional questions asked of the user, such as what colors were seen. Such questions can considerably narrow the number of candidates. For example, any songbird that is all red and is found on the East Coast of the U.S. is a cardinal. Answers to such questions are applied to the ProximateCandidateList, producing a ReducedProximateCandidateList.
- Is it in ParameterMatches? Species from ReducedProximateCandidateList that have a reference sound segment most closely matching a given segment of the submitted audio sample in duration, 4 frequency parameters, and 2 amplitude parameters form the ParameterMatches list. There is one such list per test segment.
- Is it in BestOneDMatches? Species from ParameterMatches that have the closest matches between the one-dimensional arrays of the references segments and the test segment form the BestOneDMatches list. There is one such list per test segment.
- Is it in BestTwoDMatches? Species from the BestOneDMatches list that have the closest matches between the two-dimensional arrays of the references segments and the test segment form the BestTwoDMatches list. There is one such list per test segment.
- Is it in OverallBestTwoDMatches? Each of the BestTwoDMatches lists may be arranged in a matrix, with species as row headers, each segment number as column headers, and the score from the BestTwoDMatches list for that species and segment as the value of each cell. Some cells may be empty because a species might reasonably match on some of its reference segments but not all. Such values may be combined to produce a total score column, and that column sorted by total score, creating an OverallBestTwoDMatches list – the essence of our identification.
Building ProximateCandidateList: a List of Proximate Candidates
To provide rapid searching for birds observed at or near an arbitrary location, we generate a specialized database table. We begin with data provided by the Global Biodiversity Information Facility. GBIF aggregates and “cleans” observations gathered by hundreds of organizations. Before creating our own tables, we further cleaned the data.
Because lines of longitude converge at the poles, the geographic distance between any two longitude numbers is less at higher latitudes (specifically, it varies as the cosine of the latitude). We use trigonometric expressions to transform coordinates to an integer that roughly corresponds to equal geographic areas at any latitude. We call these values latoid and lngoid.
Global coordinates are commonly expressed as degrees, minutes and seconds; in our database (as in GBIF and many others) these are stored as decimal fractions of degrees. One degree at the equator represents about 69 statute miles, so for our desired 50-mile resolution we employ a constant, 1.38, which is the ratio 69:50. And of course the trigonometry assumes that the earth is a perfect sphere. This process emphasizes speed and simplicity over unneeded precision.
Latoid and lngoid are thus derived as follows (the built-in cos() function takes its argument in radians; the cosine itself is of course independent of the units used to calculate it):
Latoid = Round(1.38 * latitude)
Lngoid = Round(1.38 * longitude * Cos(Radians(latitude)))
From approximately 250 million recorded observations (some representing whole flocks of dozens or even thousands of individuals), we generate a table of about five million rows identifying distinct species found in roughly equal geographic regions (about 50 miles square).
A list of species known to exist near a specified latitude and longitude can be derived by a query such as the following, where critterid is our species identifier, latoid1 and lngoid1 are the desired coordinates transformed as described above, and locus is a derived value representing the wanted proximity in units of our 50-mile resolution:
select distinct critterid from LivesNear
where latoid between (latoid1-locus) and (latoid1+locus)
and lngoid between (lngoid1-locus) and lngoid1+locus);
(The actual queries used incorporate joins to associate binomial names, common names and other information.)
For each of the resulting rows of the query, a row is created in a spreadsheet table in the code. If any species in this table have no referencesegments in the database, they are removed from the table. (Since the table is of the area’s most common birds, and the reference recordings favor the most common birds, this rarely happens.) The resulting rows of this query constitute what we will call the Proximate Candidate List – effectively a list of the species that most likely made this recording.
Building ParameterMatches: a Sublist of Best Matches on Simple Parameters
During fingerprinting (described above), the code has calculated duration, frequency statistics (minimum, maximum, mean, and standard deviation) and amplitude statistics (mean and standard deviation). For each segment, a SQL query then finds those reference segments that most closely match the segment to be identified. The query is complex, so a simplified version is shown here. Assume that the current segment has a duration D and a minimum frequency of M, and that DU and MF are the Duration and Minimum Frequency fields of the database table holding reference segments. The query starts like this:
select md5, abs(D-duration) as DUSimilarity, abs(M-MF) as MFSimilarity …
The query wishes to only look at sounds from critters in the Proximate Candidate List, so it includes this:
… from referencesegments where CritterID in ProximateCandidateList…
In order to find the segments which are most similar in each of our statistics (duration, the frequency statistics, and the amplitude statistics), we combine them – giving each parameter an equal weight – this way to produce a list with the most similar segments first:
… order by DUSimilarity+MFSimilarity+…
If we only want the top 50 best matches, we add:
… limit 50
Building BestOneDMatches: A Sublist of Best Matches on One-Dimensional Searches
The query that produced the ParameterMatches list started with the ProximateCandidateList, and narrowed the field of possible species that have made the test sound. It contains those species that are most commonly found at this location and this time of year and that also have the best basic audio parameter matches. The process of producing this subset was very fast. The next query asks “Which of these reference segments match our candidate segment when we look more carefully, using more detail?”
To do this efficiently, we use the one-dimensional array discussed previously. We compute it for each test segment, and compare each of these arrays with those one dimensional arrays which we previously produced for the reference segments in our ParameterMatches list.
We now determine the best matches for more complex qualities of our reference segments, by comparing the one-dimensional array of this segment with the one-dimensional arrays previously developed from reference segments which are on our list.
The array matching rolls the test segment’s one-dimensional array along the length of a reference segment’s array until a match is found. For instance, a test array of 345 would be found in a reference array of 012345 after rolling forward 3 elements.
The match need not be exact. Several parameters control how different two arrays may be and still be considered “the same”. Such pararameters include:
- Frequency Variation: This defines how different the numbers may be. With a value of 0, 124 will not be found in 012345. With a value of 2, 124 will be found, since 4 in the test array is within 2 units of the 3 in the second array.
- Frame Variation: This allows a test array to be said to match a reference segment array if any value matches the value in the corresponding frame, the previous frame, or the following frame.
- Difference Tolerance: This defines how many of our comparisons of the one-dimensional array of a test segment can fail to match the one-dimensional of a given reference segment. It is defined as a percentage, to allow more non-matches in longer audio clips. A high difference tolerance may be set, to increase the size of the list of matches if needed.
Once a comparison has been made, the difference between the two arrays has been computed, and can be used to describe the degree of match. Those with the closest matches on the comparison of one-dimensional arrays are included in BestOneDMatches, a sublist of best matches on one-dimensional searches for this test segment.
One such list is created for each segment. As each search completes, the thread sends the matching results back to the main thread, where it is added to the BestOneDMatches list. When all such searches have completed, this list is sorted by percentage match.
By tinkering with the Difference Tolerance, these lists may be of most any length, but in our preferred embodiment, only the top matches are retained for the next step.
Building BestTwoDMatches: A Sublist of Best Matches on Two-Dimensional Searches
The two-dimensional search is more precise then the one-dimensional peak search, but is also much more sensitive to both volume level and noise. Like one-dimensional searches, it is threaded in the preferred implementation, to run as fast as the CPU permits. The 2D array is first normalized (loudest value becomes 1 and other values raised with the same ratio as the loudest) and filtered (low volume background noise removed).
The code to perform this effort is almost identical to the 1D search, but where 1D compares a frame value, 2D search compares the frame’s column of values, and uses frequency variation rather than time variation to check the element of the column to the reference array’s surrounding elements.
For every frame (column), the code calls a sub-procedure that compares every single column instead of only one element (as with a one-dimensional search).
The code begins by getting the pointer to the first column’s first element. Then, as with a 1D search, it tries to match the values column-by-column. In the loop the code gets a pointer to the ‘search for array’ also and sends these two pointers to the sub-procedure that compares the 2 columns. After the sub-procedure returns, the two Pointers are incremented to point to the next row for both arrays.
When all the ‘search for’ columns are compared, the ‘search in’ pointer is incremented to point to the next column, the ‘search for’ Pointer is re-set to it’s first column. This is repeated until we reach the end of the ‘search in’ array.
Several parameters control the search:
- AmplitudeVariation: How much can a frequency value vary in amplitude or power and still be considered a match?
- AllowedDifferences: How many mismatches for the column will be permitted?
- CheckUnits: This is similar to the one-dimensional search’s ‘FrameVariation’, but in this case it is how many frequency deviations to allow.
To compare a test segment with a reference segment which has a different duration, there are two techniques that might be used: stretching and step-by-step matching. With stretching, the 1D and 2D array is shrunk or stretched in duration to compensate for duration variance using a simple nearest-neighbour procedure. Stretching is CPU intensive. The alternative might be a different 1D and 2D search algorithm that operated step-by-step, rather than scanning the array linearly. In this approach, where is a match either backward or forward in time, the procedure would continue the comparison from this match point.
The present invention offers several significant benefits.
- No strain on the CPU of the user’s portable computer, tablet, or smart phone. Because processing is done at the server, a fast CPU in the handheld device is not needed.
- No wasted storage. Because databases are stored at the server, there is nothing to choke the storage of the local handheld device. One app, which claims to identify only 50 birds, uses 233 Mb in a smartphone. Our own prototype, in contrast, uses under 13 Mb.
- Minimal product updates. The smart phone application need only record and transmit a captured sound, then display the results. Product revision occurs at the server, in the lab. Such revisions in the lab include changes to the database, to algorithms used in analysis, and to the interface that the app’s user has after submitting a sample.
- Independence of processes. To prevent backlogs in the processing of either new reference samples or new test samples, most processes have been designed to be independent. Multiple staff are able to update the reference database while multiple processes (many threads, many instances of the application, running on many machines) examine test samples and multiple processes make comparisons.
- Protection of intellectual property. All important IP remains in the lab.
- Additional precision possible with identification. In some cases, the analysis should be able to determine age, sex, or geographic dialect of the voice. (Yes, birds from New Jersey sound different than those from California.) And in most cases, the identification will be able to determine if the recording is song or call and other qualities that a birder might know (eg., “warning call in flight” or “Twitter calls of chicks”)
- Easy re-examination of reference database in the event that problems are found in any analytic stage.
Our prototype has been named AudiOh!™, and operates in Android systems. Its database tables were built from over 100,000 recordings of more than 8,000 species. Segmentation has created well over 8 million reference segments from those 100,000 recordings. Proximity information draws on tables with 250 million observations. It is able to identify the species of a typical recording faster than that recording can be played. In our early tests, in the list of candidates that had some match to the submitted recording, the first bird suggested was the correct one in 87% of the tests.
The present invention has been described relative to an illustrative embodiment. Since certain changes may be made in the above constructions without departing from the scope of the invention, it is intended that all matter contained in the above description or shown in the accompanying drawings be interpreted as illustrative and not in a limiting sense.
It is also to be understood that the following claims are to cover all generic and specific features of the invention described herein, and all statements of the scope of the invention which, as a matter of language, might be said to fall there between.
A cell phone is used to record a bird song or other animal vocalization. To identify the species that made the sound, the user then uploads it to a web site. GPS coordinates where the sound was recorded, the time and date the sound was recorded, and a token identifying the user are also automatically uploaded.
At the web site, software running in the server then breaks the recording into short segments of “words” or “syllables”, isolable from the recording because they are preceeded and followed by vary brief pauses. For each segment, various statistical qualities are computed, such as the duration, minimum frequency, maximum frequency, mean frequency, frequency standard deviation, mean amplitude, and amplitude standard deviation. In addition, more complex statistics such as a Fast Fourier Transform are created for each segment.
Server software then applies a series of filters, from coarse to fine, to narrow the results. Using such properties as taxonomic category optionally provided by the user, location and date of observation, a short list is created of birds most likely to have been available to produce the sound.
The qualities of a segment are then compared with qualities of the segments in the reference segments database, examining only entries for birds on the short list and finding the closest matches across all qualities. This is repeated for each segment, the results being stored in a matrix with species as rows, segments as columns, and cells representing the degree of match across qualities.
Once all segments have been so compared, the average degree of match in qualities may be found. For those with the best matches,more complex statistical comparisons may now be made, and the overall best matches reported to the user.
The user is provided with information on the cell phone on each of the reported matches, including pictures and audio recordings, and is asked to conclude which bird made the recorded sound. The feedback is used by developers to improve the algorithms and database. The results and recording are also provided to the user in an email.
|Cited Patent||Filing date||Publication date||Applicant||Title|
|US5452364||Dec 7, 1993||Sep 19, 1995||Bonham; Douglas M.||System and method for monitoring wildlife|
|US5956463||Oct 7, 1996||Sep 21, 1999||Ontario Hydro||Audio monitoring system for assessing wildlife biodiversity|
|US6546368||Jul 19, 2000||Apr 8, 2003||Identity Concepts, Llc||Subject identification aid using location|
|US7082394||Jun 25, 2002||Jul 25, 2006||Microsoft Corporation||Noise-robust feature extraction using multi-layer principal component analysis|
|US20010044719||May 21, 2001||Nov 22, 2001||Mitsubishi Electric Research Laboratories, Inc.||Method and system for recognizing, indexing, and searching acoustic signals|
|US20030125946||Feb 22, 2002||Jul 3, 2003||Wen-Hao Hsu||Method and apparatus for recognizing animal species from an animal voice|
|US20040107104||Dec 3, 2002||Jun 3, 2004||Schaphorst Richard A.||Method and apparatus for automated identification of animal sounds|
|CA2089597A1||Feb 16, 1993||Aug 17, 1994||Douglas G. Bain||Apparatus for audio identification of a bird|
|EP0629996A2||Jun 3, 1994||Dec 21, 1994||Ontario Hydro||Automated intelligent monitoring system|
|US8223980||Mar 26, 2010||Jul 17, 2012||Dooling Robert J||Method for modeling effects of anthropogenic noise on an animal’s perception of other sounds|
|US8457962||Aug 4, 2006||Jun 4, 2013||Lawrence P. Jones||Remote audio surveillance for detection and analysis of wildlife sounds|
|US8510104||Sep 14, 2012||Aug 13, 2013||Research In Motion Limited||System and method for low overhead frequency domain voice authentication|
|US8571259||Jun 17, 2009||Oct 29, 2013||Robert Allan Margolis||System and method for automatic identification of wildlife|
|US8599647||May 10, 2011||Dec 3, 2013||Wildlife Acoustics, Inc.||Method for listening to ultrasonic animal sounds|
|US8915215||Jun 17, 2013||Dec 23, 2014||Scott A. Helgeson||Method and apparatus for monitoring poultry in barns|
|US9093120||Feb 10, 2011||Jul 28, 2015||Yahoo! Inc.||Audio fingerprint extraction by scaling in time and resampling|
|US20100322483||Jun 17, 2009||Dec 23, 2010||Robert Allan Margolis||System and method for automatic identification of wildlife|
|US20110273964||Nov 10, 2011||Wildlife Acoustics, Inc.||Method for listening to ultrasonic animal sounds|
|US20120209612||Aug 16, 2012||Intonow||Extraction and Matching of Characteristic Fingerprints from Audio Signals|
|US20130013309||Jan 10, 2013||Research In Motion Limited||System and Method for Low Overhead Voice Authentication|
|US20130332165||Jun 6, 2012||Dec 12, 2013||Qualcomm Incorporated||Method and systems having improved speech recognition|
|WO2015017799A1||Aug 1, 2014||Feb 5, 2015||Philp Steven||Signal processing system for comparing a human-generated signal to a wildlife call signal|
|1||*||Anderson, S. Dave, A. Margoliash, D. “Template-based automatic recognition of birdsong syllables from continuous recordings.” J. Acoust. Soc. Am. 100, pt. 1, Aug. 1996.|
|2||Anderson, S.E., et al., Department of Organismal Biology and Anatomy, University of Chicago, Automatic Recognition and Analysis of Birdsong Syllables from Continuous Recordings, Mar. 8, 1995.|
|3||Anderson, S.E., et al., Department of Organismal Biology and Anatomy, University of Chicago, Speech Recognition Meets Bird Song: A Comparison of Statistics-Based and Template-Based Techniques, JASA, vol. 106, No. 4, Pt. 2, Oct. 1999.|
|4||Anonymous, The Basics of Microphones, Apr. 26, 2003, pp. 1-4, http://www.nrgresearch.com/microphonestutorial.htm.|
|5||*||Clemins, P. Johnson, M. “Application of speech recognition to african elephant vocalizations” Acoutics, Speech and Signal Processing vol. 1, Apr. 2003, pp. 484-487.|
|6||El Gayar, N. et al., Fuzzy Neural Network Models for High-Dimensional Data Clustering, ISFL ’97, Second International ICSC Sumposium on Fuzzy Logical and Applications ICSC Academic Press, Zurich, Switzerland, pp. 203-209, Feb. 12, 1997.|
|7||*||Franzen, A. Gu, I. “Classification of bird species by using key song searching: a comparative study” Systems, Man and Cybernetics, vol. 1, Oct. 2003, pp. 880-887.|
|8||*||Harma, A. “Automatic identification of bird species based on sinusoidal modelling of syllables.” IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 5, pp. 545-548, Apr. 2003.|
|9||Harma, Aki, “Automatic Identification of Bird Species Based pm Sinusoidal Modeling of Syllables“, IEEE Int. Conf. Acoustics, Speech, Signal Processing (ICASSP 2003), Hong Kong, Apr. 2003.|
|10||http://ourworld.compuserve.com/homepages/G-Kunkel/project/Project.htm Jun. 22, 2004.|
|11||*||Kogan, J. Maroliash, D. “Automated recognition of bird song elements from continuous recordings using dynamic time warping and HHM: A comparative study” J. Acoustic Soc. Am. 103, Apr. 1998.|
|12||Kogan, Joseph A. and Margoliash, Automated Recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models: A Comparative Study, J. Acoust. Soc. Am. (4), Apr. 1998.|
|13||Lleida, L. et al., Robust Continuous Speech Recognition System Based on a Microphone Array, IEEE International Conference on Seattle, WA, pp. 241-244, May 12, 1998.|
|14||Mcilraith, Alex L. and Card, Howard C., Birdsong Recognition Using Backpropagation and Multivariate Statistics, IEEE Transactions on Signal Processing, vol. 45, No. 11, Nov. 1997.|
|15||Suksmono, A.B., et al., Adaptive Image Coding Based on Vector Quantization Using SOFM-NN Algorithm, IEEE APCCAS (Asia-Pacific Conference on Chiangmai, Thailand, pp. 443-446, Nov. 1998.|
* Cited by examiner
|U.S. Classification||704/231, 704/270, 704/243, 119/713, 704/E17.002, 119/718, 704/246|
|International Classification||A01K15/00, G10L15/06, A01K45/00, G10L17/00, G10L15/00, A01K29/00|
|Cooperative Classification||G10L17/26, A01K29/005, A01K11/008, A01K45/00, A01K29/00|
|European Classification||A01K11/00C2, G10L17/26, A01K45/00, A01K29/00, A01K29/00B|