In real life, audio is mainly used in two scenes: voice and music. Voice is mainly used for communication, such as making phone calls. Now, due to the development of voice recognition, human-machine voice interaction is also an application of voice. Currently, it is on the tuyere, and many large manufacturers have launched smart speakers. Music is mainly used for appreciation, such as music playback. Here is a brief introduction to the basics of audio: Sampling and sampling frequency: Now it is the digital age. When processing audio, we must first convert the analog signal of audio into a digital signal. This is called A / D conversion. To turn the audio analog signal into a digital signal, you need to sample, or sampling. When you want to play audio, you need to convert digital signals into analog signals, which is called D / A conversion. The number of samples in one second is called the sampling frequency. According to the Nyquist sampling theorem, to reconstruct the original signal, the sampling frequency must be greater than twice the highest frequency in the signal. The higher the sampling frequency, the closer to the original signal, but it also increases the complexity of the calculation process. The frequency range that people can feel is 20HZ--20kHZ, the sampling frequency of general music is 44.1kHZ (according to Nyquist sampling theorem, the sampling frequency is more than twice the highest frequency in the signal), the higher can be 48kHZ and 96kHZ, but the average person I can't tell the difference by ear. Voice is mainly based on communication, and does not need to be as clear as music, and can be divided into narrowband and broadband. The narrowband frequency range is 300Hz--3400Hz, and the corresponding sampling frequency is 8000Hz; the wideband frequency range is 50Hz-7000Hz, and the corresponding sampling frequency is 16000Hz. The voice sampled with 16k is called high-definition voice. The mainstream voice sampling frequency is now 16kHz. Number of sampling bits: Digital signals are represented by 0 and 1. The number of sampling bits is how many bits of the sampled value are represented by 0 and 1, also called the sampling accuracy. The more bits used, the closer to the real sound. If it is represented by 8 bits, the sampling value range is -128--127, if it is represented by 16 bits, the sampling value range is -32768--32767. Now 16-bit sampling bits are generally used. Channel Usually voice only uses one channel. For music, it can be either mono (mono) or dual (ie left channel right channel, called stereo stereo), or multi-channel, called surround stereo, mostly used in theaters in. Audio capture and playback Generally, a special chip (usually called a codec chip) is used to collect audio, do A / D conversion, and then send the digital signal to the CPU for processing through the I2S bus (mainstream I2S bus, or other buses, such as PCM bus) (some are also available) (The codec chip and the CPU chip will be integrated into one chip). When it wants to play, the CPU will send the audio digital signal to the codec chip through the I2S bus, and then do D / A conversion to get the analog signal and then play it. This part is common to voice and music, but the sampling rate may be different, and the sampling rate of music is higher. Codec If the sampled value is directly saved or sent, it will take up a lot of storage space or a lot of traffic. Taking a 16-bit sampling rate for a 16-bit sampling mono as an example, there are 32000 (2 bytes * 16000) bytes in one second. It is usually necessary to compress the sampled digital signal before saving or sending it. Compressing the sampled values ​​is called encoding to form a bitstream. Reducing the bitstream to sampled values ​​is called decode, collectively called codec. Audio codec The audio sampling process is usually called pulse code modulation coding, that is, PCM (Pulse Code ModulaTIon) coding, and the sampling value is also called PCM value. In order to save storage space or send traffic, the PCM value will be compressed. There are currently three major technical standards organizations that develop compression standards: Some large companies or organizations also set compression standards, such as iLBC, OPUS. Lossless compression and lossy compression: Compression of PCM data without any damage is called lossless compression, but the degree of compression is not high. Loss of PCM data after compression is called lossy compression, which can be compressed to a few tenths at most, but the audio quality is poor. Audio pre-processing Audio processing refers to the processing of PCM data (also called linear data) to achieve the desired effect, such as echo cancellation. The processing of PCM data before audio encoding is called audio pre-processing, which is mainly used in speech to remove various interferences and make the sound clearer. It mainly includes echo cancellation, noise suppression, and gain control. The processing of audio decoded PCM data is called audio post-processing, which is mainly used in music to produce various sound effects and make the music more pleasant. There are mainly equalizer and reverb. Audio transmission This mainly refers to network transmission, which transmits audio data to the other party through the network. There are obvious differences between voice and music. For voice, real-time requirements are very high, mainly using RTP / UDP as the bearer. Because UDP is unreliable transmission, it will lose packets and disorder, etc., which affects the quality of voice. Compensation), FEC (forward error correction), retransmission, jitter buffer, etc. For music, it used to play local music files. In recent years, with the increase of network bandwidth, you can play music files in the cloud. When playing, the music file should be transferred to the player. Generally, it is downloaded while playing. Playing music does not have high real-time requirements. Generally, HTTP / TCP is used for carrying, so there is no problem of packet loss and disorder. When the software is implemented, it is not easy to transmit voice, especially in a wireless network environment. Personally think that it is the most difficult part in addition to the audio algorithm (the algorithm has a threshold and needs to be proficient in digital signal processing). The quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k broThe quick brown @The Rolek The quick brown @k broThe quick brown @TheoThe quick brown @The Rolek The quick brown @k brxxxxxxxxxxxxxxxx Guangdong ganzhou , https://www.cn-gangdao.com
a) ITU, mainly formulate the compression standard of wired voice (g series), such as g711 / g722 / g726 / g729.
b) 3GPP, which mainly develops wireless voice compression standards (amr series, etc.), including amr-nb / amr-wb. Later, ITU absorbed amr-wb and formed g722.2.
c) MPEG, mainly formulating music compression standards, including 11172-3, 13818-3 / 7, 14496-3, etc.