Large language models like ChatGPT and other forms of artificial intelligence are quickly reshaping the world. But in many cases, these systems are imposing a viewpoint based on the biases and assumptions found in the material—often reams and reams of internet data—used to train them. Additionally, there are significant concerns around the privacy implications of training these models on vast amount of personal data, as well as questions about their robustness and reliability when applied to high-stakes real-world scenarios.
At the University of Wisconsin-Madison, engineering undergraduates are learning how to remedy these issues in ECE/ISYE 570: Ethics for Data Engineers, the culminating course for the undergraduate certificate in engineering data analytics. The course is designed and led by Kangwook Lee, an assistant professor of electrical and computer engineering with a background in large language models and trustworthy machine learning.
For the type of ethics considered in this course, students don’t contemplate Plato or debate philosophy. Instead, they delve into equations and program codes to learn how engineers can ensure the ethical use of data for machine learning and artificial intelligence. “In most of the courses on machine learning and data science, students learn how to make the programs more effective and efficient,” says Lee. “However, those traditional courses do not cover the other important aspects deep enough.”
His course confronts those aspects of machine learning. “Besides accuracy, engineers should pay attention to whether the models are biased, whether they compromise data privacy, and whether they are going to behave reliably as designed at test time,” he says.
While the class is heavily theoretical in the first two-thirds of the semester, students also experience more hands-on, project-like learning components in the final third, allowing them to apply the ethical principles they have learned to practical data engineering challenges.
Individually, students are expected to create their own large language model (LLM) and implement algorithms to improve fairness, data privacy, and reliability. They may develop algorithms that remedy these issues at the level of data, training algorithms, or models and see how their choices affect the way the LLMs work.
The AI and machine learning spaces are developing rapidly, and Lee says the point of the class is valuable, even if the details change over time. “Probably the technical details of what they learn here might be less relevant in 10 years,” he says. “But the fact that they have to keep paying attention to different dimensions of machine learning to ensure ethical use of data—that’s going to hold true indefinitely.”
Lee says his main message to students is to face the challenge head on. Currently, he sees many companies taking stopgap measures to fix issues with fairness. Instead, he thinks they should take the time and energy to build systems that are not broken from the outset. “Solve these problems responsibly. Don’t just put wrappers on them or put ‘lipstick on a pig’ to make the problems look fixed,” he says. “If the fundamental problem still exists, it will materialize in different scenarios.”
Featured image caption: Assistant Professor Kangwook Lee (center) and students discuss the day’s lecture in the new Ethics for Data Engineers class. Credit: Jason Daley.