I have read many of the discussions and seen the ample measured evidence of how comb filtering alters the magnitude response of speakers. Comb filtering is presumed to be bad. However, in my experience I have rarely had anyone prefer the sound of a mono speaker over a spaced pair of speakers even when playing mono material. I admit that this is purely anecdotal and have not performed a controlled experiment.
As always it depends on what it is you're trying to achieve. I totally agree with you from within the confinements of my living room where the Haas or precedence effect is not a sensory inhibition prohibiting me from experiencing stereo. But in large scale sound reinforcement with subdivided speaker configurations, comb filtering compromises the ability to deliver the sound at mix position to the audience.
Whether comb filtering is audible and in what fashion depends on relative level and timing respectively. Let's explore the latter first.
For the sake of clarity I will leave out the superimposing effects of the Head Related Transfer Function, caused by the pinnae or outer ear, and the inter-aural level and time differences as was rightly pointed out by John Roberts.
Let's look at a 3 millisecond (approx. 3 feet or 1 meter of difference in path length) comb filter (well within the vocal range) caused by two sources at equal level. The first 6 peaks have a width ranging from 1 to 1/6 octave. The first 6 dips have a width ranging from 1 to 1/9 octave. Both caused by the second source arriving 1 to 6 wavelengths frequency dependently late. These are generally considered the most audible or tonal effects.
An octave wide hole sitting next to an octave wide peak followed by a 1/3 octave wide hole and a 1/2 octave wide peak makes for very strong tonal contrasts, affecting 30% of the audible spectrum. One big tree standing out on the horizon affects the contour of the landscape very much in contrast to dozens of little trees.
Beyond 6 wavelengths of delay the threshold of critical bandwidth (1/6 octave) is crossed, where most people are no longer able to reliably detect
tonal changes caused by the narrow peaks and dips. This is the area of confusion where the sound becomes spacious and terms like "width" and "size" come into mind.
Past 24 wavelengths of delay there's the potential to experience echo's depending on the state of the signals. In a steady state situation this won't be audible because there's no interruption or pause in the signal to hear the echo. But with e.g. percussive signals this is likely to become very audible.
In other words the high frequencies will "snap" apart and reveal themselves, in terms of source (both speaker or reflection) localization, much sooner in contrast to low frequencies that are inherently more "elastic".
For example slap-back echo on stage from a balcony face (provided it was not prevented by speaker orientation or splay angles) reveals itself first in the high frequency part of the spectrum. Even though a sizable balcony face might very well be able to reflect mid and arguably even low frequencies.
Whether all these temporal phenomena are audible depends on relative level. When the level difference exceeds 10 dB the ripple (difference between the peaks and dips of the comb filter) is reduced to 6 dB or less and you're in the isolation zone where timing becomes less important.
From approx. 4 to 10 dB level differences you're in the combination zone with ripple going on from 12 to 6 dB respectively. This is the transition area from the isolation zone into the combing zone.
With relative level differences of less than 4 dB there's ripple in excess of 12 dB. This is referred to as the combing zone where the effects of misalignment are most audible and timing is vital. After all when aligned, summation is maximum but so is cancellation when misaligned.
These progressions are the driving force behind e.g. a properly designed coupled point source configuration where directivity and splay (level) versus physical displacement (time) are carefully balanced in an attempt to reduce the area of confusion, the combination zone, to as little as 1 octave. 1/9th of the audible spectrum.
So no, adjacent speakers doesn't necessarily imply comb filtering per se, when properly designed.
Comb filtering is however a degradation of the signal and will reduce signal to noise ratio ergo speech intelligibility or clarity as was rightly pointed out by some. At each frequency of the spectrum where the direct sound has been canceled out (complete cancellation is virtually impossible in real life) by a reflection or signal from another speaker, all that remains is noise (HVAC, audience, reverb, etc., etc.).
Again the number of reflections and/or signals from other speakers and their timing will determine the frequency extent of the comb filtering and relative level the severity and thus the signal to noise ratio.
To summarize, if you want to deliver the sound at the mixing position to the audience with the least amount of artifacts and enable the majority of the audience to witness the same experience, controlled isolation by design, not overlap, will minimize comb filtering, tonal coloration, echo's and maximize SNR, clarity and speech intelligibility. And this is just scratching the surface...
Pick your poison...
If you're interested in exploring all possible combinations of comb filtering between two sources (level, time and ripple). You can download my phase calculator for free at my website
https://www.merlijnvanveen.nl.
EDIT: In order to avoid misunderstandings, I feel it's fair to point out that pretty much everything above is AFAIK credited to Bob "6o6" McCarthy and his book "Sound Systems: Design and Optimization". Please correct me if I'm wrong.