Attention facilitates communication by enabling selective listening to sound sources of interest. However, little is known about why attentional selection succeeds in some conditions but fails in others. While neurophysiology implicates multiplicative feature gains in selective attention, it is unclear whether such gains can explain real-world attention-driven behaviour. Here we optimized an artificial neural network with stimulus-computable feature gains to recognize a cued talker's speech from binaural audio in 'cocktail party' scenarios. Though not trained to mimic humans, the model produced human-like performance across diverse real-world conditions, exhibiting selection based both on voice qualities and on spatial location as well as selection failures in conditions where humans tended to fail. It also predicted novel attentional effects that we confirmed in human experiments, and exhibited signatures of 'late selection' like those seen in human auditory cortex. The results suggest that human-like attentional strategies naturally arise from the optimization of feature gains for selective listening.