Time and Phase Coherence

Home > Support: Library > Speaker Design > Time and Phase Coherence - AIG Article



A one second waveform of handclaps from "Casa Quetzal," Home, by Keller Williams.


Time- and phase-coherent speaker design

By Roy Johnson, loudspeaker designer, Green Mountain Audio, Inc.


Let us be straightforward about it, as no magazine nor any other speaker manufacturer is telling you the following:

Sound is a variation of the air's pressure on your eardrum. When we take that one statement apart word by word, it becomes easy to see where speakers can get things wrong -- leaving you dissatisfied, irritated, and bored.

This article is concerned with the timing of those variations, for everything we gain from hearing occurs as time passes -- all inflections and nuances, emotions, grace, thunder... take their own times to occur.

Each and every sound is actually made up of many separate pure tones -- high and low on the scale. Each pure tone arrives and departs with its own timing and loudness.

A simple hand-clap has low tones generated by the palms meeting. It has higher tones from the fingers slapping. The tones' pitches high and low, and their relative loudness; and their relative timing are what say 'hand clap' to our minds. A speaker needs to make all tones come out with the right loudness and the right timing. When it does not, on handclaps you instead hear bacon frying, as one example.

Many speakers get the loudness mostly right from bass to treble.
Most all speakers get the timing wrong between bass and treble.
Typically, they send out their treble tones much sooner than bass tones, with each speaker design varying in how much time-delay it imposes. Time-delay is often called 'phase shift' and said 'not to matter'.

These time delays create many sonic problems often blamed on other gear in the system or on the recording. These frequency-dependant time delays occur nowhere else in the recording and playback chain, as a point of fact.

With each speaker injecting its own unique time delays from bass to treble, speakers all 'sound different', while still sounding 'like speakers' instead of the real thing. What follows are the basic causes of these time delays and how those are avoided or at least minimized. The result is called a time-coherent speaker design, such as we make.


First principles

To repeat that sentence above: Sound is a variation of the air pressure on the eardrum. A 'sound' is what we hear. What we perceive from 'that sound' depends on its unique variations of air pressure as time passes. Each sound is a set of many pure tones arriving and departing at their own times and loudnesses, which together make up the pattern unique to that 'sound.'

We memorize patterns, to learn 'that sound,' as well as learn about the context in which we are most likely to experience it. We then subconsciously decide at every moment which sounds we are most likely to hear and then which of those are most important to listen for.

Imagine the air molecules as countless billiard balls shooting all the time in every possible direction, at very high speeds, colliding very quickly with their many nearby neighbors, and each of those then banging off in a new direction. We feel this as heat and as the weather's barometric air pressure upon us. Sound is then rapid fluctuations of pressure overlaid onto the 'static' air pressure, to which the eardrum is delicately constructed to respond.

A speaker's cone shoves the nearest already-randomly-moving air molecules forwards. In turn, those immediately shove their neighbors, then return to their normal random motions. Those neighbors immediately shove their neighbors...and so on. The resulting 'wave' of higher air pressure thus moves away from the cone, exactly like a ripple through a crowd, at 340 meters per second (700+ mph).

When the cone pulls back, the air molecules next to it are pushed back in by other molecules randomly banging into them from behind. That retracting cone gave the closest molecules more space in which to move. This is a 'negative-pressure flow', and it also ripples over to your eardrum. Its arrival allows your eardrum to be pushed back out by the higher air pressure still trapped behind it. We do not hear negative pressures very well.

This image illustrates how the bunching up of molecules makes high pressure zones -- with molecules that came out of the now-depleted low-pressure zones. It also shows how the oscillation of molecules moving back and forth between high and low pressure zones is graphed as a 'wave' of higher-then-lower air pressure.


What you hear is the variation of the air pressure, swinging between positive and negative. Those variations in this image are steady -- regularly 'higher then lower' in pressure. Such a steady wave is produced by the pure 'ping' of a wine glass or any steady, pure tone. When the variation is rapid, we hear a high tone, and when it is slow, we hear a low tone. When the variation is large between high and low pressures, we hear a loud tone, and when incredibly small pressure variations, we hear a faint tone. When the variation becomes too large, our ears hurt and then our eardrums burst.

The variations in pressure from sounds which we care about are always quite irregular, as for the hand claps at the top of this page. The complexity of the waveform carries not only the overall tone of the sound (its 'character'), but also its message, the rhythm, the inflections, and any emotion.




Variation of the variations

When that variation is indeed perfectly regular, you will see a sine wave (the 'perfect wave') moving up and down on an oscilloscope, much like a water wave, and representing the microphone's diaphragm moving back and forth. There is no 'message' in hearing this one pure tone, unless it is perhaps a warning. Several sine wave tones of different 'wave lengths' mixed together make the beep of a pager and a few harsher-sounding 'square waves' added together make an alarm clock do its job.


No sound we truly care about is regular in its pressure variation. Speech, music, the cry of our baby, the rattles of our car, and the laugh of a friend are all complex sounds and change moment by moment. Each of those is composed of many single pure tones (sine waves) from low to high -- each starting and stopping at different times, arriving and departing with their own loudnesses. On the oscilloscope, complex sounds have highly irregular waveforms because that is how irregularly the microphone's diaphragm is being pushed around.

The microphone only 'knows' that irregular waveform, but we automatically extract the many small sounds inside it. We cannot see these directly on the oscilloscope, but they are 'in there.' When the loudness and/or the timing of any one of those constituent tones is changed, the complex shape of the waveform is changed and we hear a 'different sound.'


When the sound comes from your child, your car, or your favorite singer, each is presenting a unique but highly familiar mixture and pattern of tones. This is why your Volkswagon Beetle does not sound like a Ferrari and your child's voice cannot be mistaken for another child's voice.

Even in a noisy environment, you can immediately identify those sounds because you have come to know them well. That familiarity is what allowed your mind to instantly recognize them -- and filter them out of the background noise so that you can hear 'the message' in 'that familiar sound.'

You have been trained 'that sound' likely contains an important message. After all, it came from your child...or your car.





Two characteristics make familiar sounds identifiable: 1) The loudness of each of the many pure tones that lay inside that sound, and 2) The timing of when each tone arrived and departed. It is their individual loudness and timing which, when put together, make up that 'familiar sound.'

When a speaker achieves the right loudness of each tone, it has a 'flat' frequency response graph, or rightfully an 'amplitude response vs. frequency' graph. This graph would be a straight, horizontal line, where bass to treble runs from left to right along the bottom like the keys on a piano.

If that line was curved down to the right, you would hear too little treble, for example. If it went up and down at the left end, you would hear an irregular bass output, note by note.


When the speaker maintains the correct timing, it would produce the same type of 'flat-line' graph because any time delay between the speaker's bass and treble would always be the same, or constant. If that line went up and down, that would mark different time delays at different frequencies. Convention has it that any upward rise would indicate time delay, and any downward tilt would be from a time 'advance.' If the timing of various frequencies was changed, again the shape of the waveform on the oscilloscope would be changed.

In the graph below, note how tones A and B (blue and green) begin at the same instant. Combined, they produce the more-complex waveform 'A+B.' When we delay B at the beginning (red wave), we get a new shape of 'A+B,' which sounds completely different.


A bit of bad news: All frequency-response and time-delay (phase) graphs -- no matter how 'flat' -- are misleading for several reasons:

  • To avoid room echoes, the measurement microphone is placed much closer to the speaker than you would ever listen. This close position distorts the loudness of each tone and each one's time of arrival.

  • When the microphone is at a realistic listener position, it makes the room echoes part of the main sound, yet our minds integrate just some of those echoes while automatically filtering off many more.

  • If the speaker is moved outdoors, up in the air to avoid all reflections, our timing measurements would be more accurate, but the bass loudness is enormously reduced, so the frequency response graph would not be accurate.

I have developed proprietary methods of measurement that avoid these issues, and discussed them with the Publisher of >




Time-coherent speakers

Most speakers delay different tones by different amounts, which is a major reason they sound like 'speakers' instead of the real thing. When the speaker puts no delays onto any tone, it is a perfectly 'time coherent' design. We must now mention that any delay can also be called a 'phase shift,' because it is easy to compare 'the phase' of two different things in motion.

Take the example of two children -- each on a playground swing. Are they swinging precisely side-by-side, or is one leading the other? Their phase difference would either be zero (actually 'zero degrees') or some portion of a full (360 degree) cycle. We bring this up now because in either situation, we have no information about which one started first. So, the term 'phase' for their motional relationship is relative to them, and not to an absolute time on the stopwatch.

Two_waves_ out_of_phase

Phase, if not otherwise identified, should always be taken to be a relative term, so here we use it advisedly to avoid confusion as an absolute. We are concerned about when things exactly start and stop, such as a woofer and a tweeter -- and when we can get them to do that at the same moment, they are both then moving in a time-coherent fashion, and thus also in a phase-coherent one. The converse is not true, and that has caused many problems for aspiring speaker designers.

When a speaker is not time-coherent, it often allows the highs to come out too soon. This usually makes the aggressive sounds of rock far too aggressive. However, the common remark heard is, "These speakers are very revealing of what must be a bad recording."


This is what those 'extra-sharp edges' sound like at first listen, and it is all too easy to blame a recording, of course. The music most enjoyed on these speakers would likely be quite mellow and perhaps not have many tones in the treble range where the timing is going wrong.

In the image, left, compare the shape of the admittedly generic 'complex' waveform before and after its high-tone range is 'slid' to the right by 80 degrees. "80 degrees of what," you wonder. Thank you for asking. Eighty degrees of the full 360 degree cycle of the crossover-frequency tone. See what we mean by relative?

The most simple test for time-coherence involves sending the speaker a single positive pulse of electricity, just a little jolt if you will. That could come from briefly touching a small battery to the speaker's terminals, which would cause the woofer, midrange and tweeter to all jump forward and then return to rest.

When each one's beginning portion of the pulse arrives at the microphone at the same instant, the total pulse should appear quickly and die away smoothly. In this drawing, compare the effect of time delays on the final shape of the pulse.


When the timing between woofer, midrange, and tweeter is the same, you hear everything. There is no way to predict what that may be, because we do not listen to your records and soundtracks. But perhaps it is how the drummer flicks out suddenly to express excitement or enthusiasm, or the singer tosses off a note just so to make her point about that bum of a boyfriend. Now, the worst speakers produce time delays that are no more than a few thousandths of a second, which seems should be no big deal. However, our research consistently shows those time-domain distortions effectively hide many, many aspects of the performance necessary to our enjoyment. Ears are remarkably 'fast' devices!




Timing is pace and rhythm

Think about what it means for a musician to be 'on the beat.' If he or she is not, then the song certainly works in a different way, good or bad. Two researchers at Southern Oregon University, Kenneth A. Lindsay and Peter R. Norquist, decided to study Ray Charles' timings in a few of his famous songs. They measured the timing of his rhythm. In the 'musical fingerprint' of his rendition of the song, "Fever," below, his finger snaps set the beat with a precision of two and a half milliseconds -- less time than a honeybee takes to flap its wings once.


Each snap is a vertical line with time running from left to right. The distance between the peak of each snap is the time between snaps, and those are the peaks occurring 'on the money', within that +/- 2.5 milliseconds. The colors from dark red to yellow indicate the individual tones that make up each snap, from low to high. Yellow shows a particularly 'sharp' snap. The intensity of each color is the loudness of each frequency in that snap.

Now, Charles' timing precision of 2.5 milliseconds is very quick indeed, even for the best musicians, which is what the researchers noted -- an incredibly short period of time. However, does that number make sense without using the computer?


More: Their research paper > Discover magazine article >


Take a song having a fast, danceable beat. It would be written with each note as a 'quarter note' occurring perhaps 100 times per minute, each beat then happening every 0.6 seconds. But that quarter note is only the basic beat to which you bob your head. When a guitar solo is played, when a pianist runs the scale, or when a drummer taps the cymbals during that song, those notes can be written as 8th, 16th, 32nd, or even 64th notes, which occur 2, 4, 8, and 16 times more frequently than the quarter note.

Thus, in that song, a 32nd note would occur every 1/8th of every 0.6 second, or once about every 80 milliseconds. If a drummer, pianist, or guitarist is a little sloppy, perhaps on purpose, you would think we could easily hear a 10 percent difference in his timing, which would then be +/- 8 milliseconds.

If the speaker is not to distort that timing by its own 10 percent, that would mean it cannot inject a delay of more than +/- 0.8 milliseconds (+/- 80 millionths of a second).

This is a useful analysis, but not completely accurate when it comes to the speaker problems. We should not say that "The speaker should not distort the timing" but that "The speaker should not distort the timing of the individual tones that make up each finger snap." If a speaker delays the upper-tone range, this acts to spread out the 'sharpness' of each of Charles' finger snaps, making it much harder for us to hear the very moment the snap occurred, and thus how that impulse challenged and spurred on the other musicians. If the speaker delays the lowest bass, we often hear a sluggishness to the pace of the music. Reggae music becomes no fun -- it has no 'pulse.' If the speaker delays the upper bass, we often hear rhythm that is ill-defined and we just don't get what people feel from R&B music or gospel songs.

Perhaps there is no lower limit to how little the speaker can disturb the timing of any tone! We cannot really say, because we are not musicians. Even if some of us reading this were, it's likely none would be a world-class artist with the very best timing.




Timing is timbre

When two tones combine, we hear a third tone. When many tones combine, we hear a 'texture' to the sound. When those tones combine with a certain loudness and timing, we are hearing a distinctive texture, 'character', or timbre (tam-bur) to that sound, such as that of a piano. When we know that timbre well, we can identify the make of that piano.


The more precisely a speaker preserves the timing of every tone, the more accurate is the timbre of every instrument, voice, and sound effect. It also becomes easier for us to identify an instrument or voice and then follow its contribution to the overall sound. Again, perhaps there is no lower limit on how little the speaker can disturb the timing of any one tone before we hear it change the timbre. Timbre is actually a very complicated subject when it comes to its effect on how we perceive music.


Timing is space

All the small echoes that follow each sound are shadows of the main sound. Those can be obscured by shifts in timing that allow the main notes to 'hang on too long' and cover the smaller echoes over. The echoes of the first main sounds can be overcome by other main sounds that arrive too soon. In either situation, we hear less ambiance and spaciousness, because we hear less echo. There is less defined 'depth' between those artists 'closer to us' and ones farther back on stage. The overall sound image is flat and one-dimensional. "Wall of sound" is not an unusual comment. We are not certain there is any lower limit to the amount of timing distortion that obscures ambiance.


Timing is dynamics

When any transient occurs, from the whack of a drumstick to the yelp of a voice, each occurred as a compact set of tones over a brief span of time. If those tones' arrivals and departures are smeared out in time, then the loudness of any one peak is greatly reduced. The effect on music is to make it more bland. The speaker with the least smearing in the time domain would be the most dynamically lively, most dramatic, most sinister, most bold, and conversely, most graceful, because those sounds have their own small dynamic contrasts.




Where a speaker goes wrong

If it moves, time must pass. A woofer, a midrange driver, a tweeter -- each delays some of the sounds it can produce. This comes from two different factors: Each has a mass to move and has a suspension to let it move. Those two alone create natural time delays in getting any cone or dome to move, to change direction, to stop. They form the natural characteristics of any vibrating system. The time delays in speakers can be thought of as the woofer, midrange, and tweeter apparently laying farther away from you, as shown in the image here.


Mass as a culprit. When anything has mass, then time must pass while it gets moving, changes direction, and comes to a stop.

A Cadillac can never handle like a Corvette. So, the heavier that cone or dome, the less it responds to the high notes' rapid oscillations.

A heavy woofer cone does not put out highs very loudly simply because they are telling that cone to change direction before it even got fully going. This also means that as we go up the tone scale in that woofer, the higher tones come out later and later, because they got started later. Thus, the presence of mass rolls off the very highest tone range of a woofer or a tweeter, while simultaneously delaying that range more and more.


This is the same as saying the woofer's upper tone range moves away from us as we go up the scale. How far back? A foot, 30 centimeters, in the upper voice range if you could somehow 'see' where the woofer cone was 'acoustically' located.

How much in time? About one millisecond. The same situation occurs for the tweeter. Since it weighs only 0.5 grams versus the 50 grams of a large woofer, that makes its ultra-highs come out late, up in the 'sparkle' range of the treble, by part of a millisecond, by an inch or so (a few centimeters) back.

What is the lesson? When mass is reduced, the onset of time delays are pushed higher and higher up the scale. Zero mass would be one factor in achieving truly perfect reproduction.




Suspension as a culprit. A suspension allows the cone or dome to move, while keeping it centered as it strokes. The 'springiness' of that suspension wants to pull back on the cone or dome, trying to restore the cone/dome to its initial resting position.


Thus, we have a mass bouncing on a spring, and that interaction is an exchange of energy, between kinetic and potential, and such an exchange takes time. That suspension delays the motion more as we go down the scale, whether this is a woofer or a tweeter because the suspension resists more and more being stretched.

How much time delay? Many milliseconds for the woofer. Part of a millisecond for a tweeter. How far back does that sound? Acoustically, several feet (a meter or more) back from 'where the woofer was' in its lower-voice range. Several inches (centimeters) for the 'bass range' of the tweeter (high-voice) compared to its middle-treble range.

Perfection would be zero time delay in their respective low-frequency ranges, and that can only come from the woofer or tweeter, no matter their mass, having completely limp suspensions, perhaps made from tissue paper, with each also mounted in extremely large enclosures. But then, there would be nothing to keep the woofer cone (or the tweeter's dome) centered as it strokes and eventually it will jam. No tunes.

Mass is not always 'Mass'. When any cone or dome moves more and more rapidly (goes higher and higher up the scale), eventually both will go into some form of 'break up' at a particular frequency. At first the cone/dome rings, which means different masses (plural) are moving in different directions. Eventually, at just the right frequency, the outer part of the woofer's cone begins to stand still and so does the center portion of the tweeter's dome. Since they are standing still, there is less mass to move and the sound gets louder, possibly much louder if there is little damping in the cone or dome.

Besides distortion, ringing, and loudness changes, there are time delays that creep in as that cone-breakup tone range is approached, caused by the cone/dome starting to flex (as a suspension would flex).


Even if the breakup frequency occurs outside the operating range of the particular driver, such as in the treble for a midrange driver, it does affect sounds in the operating range, because it is triggered by them. So, when a breakup occurs, the actual cone/dome material must be internally well-damped so the cone or dome does not ring on and on.

Some designers electrically filter out the peak in loudness that a breakup creates. Such a filter in the speaker's crossover circuit is effective only on test tones, because the transients of music and voices in the lower ranges still excite that higher-frequency mechanical resonance. If you hit your car with a baseball bat, you hear the whole car respond with a 'thunk.' But you would then hear perhaps the dashboard rattle on, even it if was rubber mounted (a 'mechanical crossover'). After the dash was excited all at once by the whole car shuddering, it would ring on at its own resonant frequency.




Damping. When the suspension flexes, it has an internal friction. This damps the tendency of the cone or dome to keep on moving at its low-frequency resonance, such as in the low bass for a woofer. Any damping of that normally free-and-easy motion brings the exchange of energy between woofer-cone and suspension to a stop. Damping is created by friction, which turns the energy of motion into heat. Shock absorbers for your car add damping. Without them, you hit your head on the roof. Without proper damping for a woofer, the bass is 'boomy' and hangs on 'too long.' Guess what is wrong with all the car stereos that bother us?


When you combine a mass with a spring and damping, you have a 'Damped Harmonic Oscillator.' The simple math for this moving system is found in any first-year university Physics textbook. Automotive engineers use it to calculate how stiff the springs for the car should be, the size of the shocks, and how heavy the tires and wheels should be.

It allows aerospace engineers to calculate how much a wing will flex, the stress it feels, and how much of either is acceptable. One could adapt it to predict the response of a population rebounding from disease, disaster, or inflation.


Send in the signal. On the electrical side, a speaker's crossover circuit can introduce time delays, with the tweeter's signal getting through before the woofer's signal, and by a different amount at each frequency. The circuit does not have to do that, but most do because of the quantity and the interaction of the capacitors and inductors used in it.

There is only one crossover circuit that injects no timing differences between the signals it sends to the woofer and tweeter: A 'first-order' crossover. The simplest version consists only of one inductor on the way to the woofer and a capacitor on the way to the tweeter. The inductor keeps the highs out of the woofer while the capacitor prevents bass tones from entering the tweeter. This is the circuit we use.


The math for calculating the time delays produced by crossover circuits is complex (when one has not done it before). Anyone who would learn more about how the delays are produced, by how much, at what frequencies, and what delays sound like, should read my article published in Audio Ideas Guide in 1997: "Loudspeaker Phase Accuracy and Musical Timing."

What about speakers that are only phase-coherent? Phase coherence means simply that the twin peaks and valleys of the same pure test tone coming from two drivers (such as from a woofer and tweeter) line up at your ear. Make those two waves also start and stop at the same time and you have a speaker that is both phase and time coherent. While the term 'phase coherence' is used to market speakers, a phase-coherent speaker is seldom a time-coherent speaker. Time-coherent speakers are automatically phase-coherent.




Phase shift. As mentioned earlier, time delay is often called phase shift. If so, then it is measured in degrees, such as marked on a compass or a circle, because the 'phase' was shifted by X degrees of some full 360-degree cycle of a wave. If that wave was at 40Hz, a very low-bass tone, and delayed by 4 milliseconds (0.004 seconds), that is 16 percent of that 1/40th-second cycle, or 16 percent of the complete, 360-degree in-and-out cycle, or 57.6 degrees of phase shift. Since here it is a time delay, then this would be called a phase 'lag.' If, instead, it arrived too soon, that would be a phase 'lead.'

A perfect speaker would have zero degrees of phase shift at every frequency. Most speakers vary by at least 720 degrees across the 40Hz to 20kHz range. Ours produce no more than +/- 2 degrees of phase shift across the all-important middle-range from 200Hz to 8kHz. Down at 40Hz and up past 20kHz, our phase shift creeps up to about 90 degrees. Because we keep the main tone range free of time delays (phase shifts), ours are called 'time-coherent' speakers, and another term used is 'a minimum-phase speaker design.'


Polarity. If a microphone diaphragm moves in when the drum head pushes towards it, the microphone has correct polarity. If your speaker's woofer moves towards you when the drum head pushed outwards, then that woofer and your system, and the recording, and the microphone, and the mixer all have 'correct polarity.'

If the woofer sucks in instead, the polarity has been 'inverted' somewhere in the chain, perhaps at the microphone input on the mixer. You would then wire your speakers in reverse polarity, with the '+' wire going to the '-' terminal, and vice versa. Inverted polarity is often mistakenly called '180 degrees of phase shift.'

Many loudspeakers invert the polarity of just the tweeter, to 'fix' what is really 180 degrees of phase delay, relative to the woofer, which was likely caused by the crossover circuit inside that speaker. The resulting sound is different with the tweeter's wiring flip-flop, not necessarily better. Our speakers maintain correct polarity at all frequencies. Many speakers do not.


Accuracy in time is as important as accuracy in loudness

When the loudness is wrong for different tones, we can turn up the bass or treble. When the time-domain performance of the speaker is wrong, this cannot be fixed.

When the time for different tones to arrive is wrong, we hear smeared transients, less nuance, less artistry. Many existing distortions, in the recording or the speaker, are also greatly amplified (you do not want to see that math).

When the time for different tones to go away is wrong, we hear less sustain on a note, less of the concert hall, less emotion, and confusion in complex soundtracks.

When the time for different tones combining is wrong, we hear the wrong timbres -- a string bass sounds like an electric one, a Martin guitar sounds like a Yamaha, a Steinway like a Baldwin, and trumpets hurt our ears.

The goal should be to enjoy every record and soundtrack, and that is only possible when a speaker is accurate in the time domain. Music and sound change in time, and it is those changes we seek to understand and enjoy.

Just listen, and you will know when the speaker gets the timing right.