In contemporary machine learning, realistic models exhibit increasing nonconvexity and overwhelming overparameterization. This nonconvex nature often leads to numerous undesirable or "spurious" local solutions, while overparameterization exacerbates the risk of overfitting. Yet, simple “short-sighted” algorithms, such as gradient descent (GD) or its variants, often find the needle in the haystack: they converge to the correct,...