Attention facilitates communication by enabling selective listening to sound sources of interest. However, little is known about why attentional selection succeeds in some conditions but fails in others. While neurophysiology implicates multiplicative feature gains in selective attention, it is unclear whether such gains can explain real-world attention-driven behavior. To investigate these issues, we optimized an artificial neural network with stimulus-computable, feature-based gains to recognize a cued talker's speech from binaural audio in "cocktail party" scenarios. Though not trained to mimic humans, the model matched human performance across diverse real-world conditions, exhibiting selection based both on voice qualities and spatial location. It also predicted novel attentional effects that we confirmed in human experiments, and exhibited signatures of "late selection" like those seen in human auditory cortex. The results suggest that human-like attentional strategies naturally arise from optimization of feature gains for selective listening, offering a normative account of the mechanisms - and limitations - of auditory attention.