You might expect us to begin this article with a definition of data ethics. Instead, however, let’s start by examining the two words on their own.
Data has been around since humans first learned how to record information — though its storage medium has evolved considerably in its shift from stone engravings to solid-state drives. The “big” in big data is a product of humanity’s step into the digital realm, where the power of bulk silicon lets engineers and data scientists record and analyze every bit of data a person sheds in their digital travels. With so much data at our fingertips, we rely on a thin set of cultural norms to help us decide what is right or wrong — which, of course, brings us to the word ethics.
There are many definitions of ethics. Most contain words like moral, right, wrong, principles, rules, or — controversially — rights and obligations. Perhaps the shortest definition of ethics describes it in terms of actions affecting people. Data ethics is no different.
Data ethics refers to the values, conventions, and challenges inherent to collecting, storing, and analyzing data, as well as how these actions impact both the people whose data is collected and those around them.
In this blog, we’ll take a look at the general context around data ethics. We’ll then open some prevalent misconceptions about data and examine the sometimes-subtle ways their effects propagate through data systems and society. Finally, we’ll incorporate some insight from Diana Pfeil’s appearance on the Data Brew podcast and look at a few ways data scientists can uphold strong data ethics.
Why is Data Ethics Important?
Let’s consider the impact data has on an individual. It can take less than ten seconds to take a photo of a stranger and send it to a friend. We may debate whether the photo alone is an invasion of this person’s privacy. Views on the ethical dimension of public spaces are subject to a wider cultural discussion, so we could decide that our photo alone is acceptable.
But, this debate is decisively settled when we neglect to remove the metadata from our photo. Among other identifiers, location data is a particularly personal component of someone’s digital footprint. Its impact, though, however serious, is limited by that individual and their direct network — and by slow, serial, human communications.
When we consider the impact of data on a much larger scale, like social media, instant messaging applications, and networking platforms, the risk of having personal, confidential data shared without consent grows significantly. And, with that growth comes a massive increase in the number of individuals that the unethical distribution of data would impact.
According to Domo’s Data Never Sleeps 8.0 Report, during any 15 seconds in 2020, we would have seen a massive amount of data created. Some of these numbers define populations in the billions. Among them:
- 36,750 photos were uploaded to Facebook.
- 10,416,667 WhatsApp messages were shared.
- 125 hours of video were uploaded to YouTube by users.
- 86,806 Instagram stories were posted.
- 17,361 LinkedIn jobs were applied for.
- 101,111 hours of video were streamed from Netflix.
- 119,863 engagements were made with Reddit content.
The pace of data generation is increasing at an exponential rate, especially as the Internet of Things (IoT) exits its infancy. Alexa is live and listening all the time. Smart energy knows every time we flick a switch and what time we cook our supper. The potential impact of unethical data use is staggering when we consider the number of people that can be affected in 15 seconds.
It raises a few questions. Who is the beneficiary of that mountain of data? Who owns it? Can we make corrections to our own data? And for that matter, would we ever be able to find our own data?
Data ethics asks how we can reduce the impact of our actions — manifested here as data — on people. It asks us if we can collect, store, and process data better, to minimize and mitigate that impact.
Misconceptions About Data
In her EmTech MIT presentation, Big Data Gets Personal, Kate Crawford highlights three misconceptions about data that raise ethical questions.
The first misconception is that data is objective. She gives an example of an analysis of tweets during and after Hurricane Sandy in Manhattan. Her team analyzed 25 million tweets in four days. It should have been easy to draw accurate conclusions about which were the worst-hit areas — yet the worst-hit areas were without power or had fewer cell phones, and so were missed in the analyses.
The second misconception is that data doesn’t discriminate. A study of Facebook users’ “likes” found that we can predict their ethnic background, religious beliefs, sexuality, and drug or alcohol use — in some categories, with 95 percent accuracy. Correlations like this can be made to infer protected attributes and discriminate based on them in an indirect way.
Crawford warns that much of this information is available to everyone without consent and that it can be used to discriminate in things like employer background checks. Althought a smaller employer or landlord might find it more trouble than it’s worth to try to make these sorts of predictions about an individual, larger institutions certainly have the resources to analyze new hires or tenants and create a “solution” that discriminates without the knowledge of those being analyzed.
The third misconception is that anonymity exists. Our daily movements are quite unique. Collating just four points on our travels provides enough data to identify 95 percent of people.
Everything is personally identifiable information (PII). Privacy economist Alessandro Acquisti, for example, was able to infer private information from a facial image by combining public and nonpublic datasets.
Another fallacy is that big data always provides the best route to accurate predictions. Technology ethnographer Tricia Wang says the human factor in data causes data sets to become complex, unpredictable, and difficult to model. She observes that the committed belief that we can measure the unmeasurable gives rise to a quantification bias.
Mathematician and data scientist Cathy O’Neil poses the question, “What if algorithms are wrong?” This question is rooted in the fact that our algorithms need two things: prior data, and a measure of success. Past data already contains past bias, and our measures of success are subjective at best. This means that our algorithms merely automate the status quo — a humbling thought for any data scientist.
It’s a fact that the law is unable to keep pace with technology. So, if we’re unable to police our data practices, we need to rely on data ethics to guide our data principles.
Combating Bias with Data Ethics
A few years ago, people believed that machines would make unbiased and nondiscriminatory decisions. We now know that this isn’t the case. The concept of secure-by-design software answered the need for software security. In the same way, ethical-by-design data and algorithms will follow.
The challenge, of course, is that data is money and power — from which stems our mandate, as data scientists, to give the most accurate results we can. Diluting that result isn’t in our nature. So, we absolve ourselves of the responsibility to be ethical because it’s “not our data.”
Of course, this is a generalization, but disassociation from the ethical dimension of data handling is common in most organizations. Cameron Kerry shows that the concept of consent in the online arena is out of date. Acquisti suggests that data transparency can’t provide protection.
Does data ethics have an answer?
Diana Pfeil, a global leader in machine learning and optimization, offers several answers in an episode of the Data Brew podcast. As the head of research and development at digital rights technology company Pex, she favors a technical solution to the issue of data bias, which makes sense — we’re appealing to data scientists to drop bias from their models. The law will appeal to businesses once it catches up.
Pfeil offers several ethical imperatives and insights for eliminating bias in automated systems:
- Data scientists must not include protected attributes — or any proxies for protected attributes. If protected attributes are absolutely necessary, data scientists must ensure that they have a strong business justification.
- Model designers should add a fairness metric alongside the usual accuracy metric, which can be robustly implemented in the practice of adversarial debiasing. Rather than requiring a model to serve two potentially competing objectives, adversarial debiasing uses two models: an optimizing model in line with our requirements and a second model that tests the first for bias. If bias is found, the models’ weighing of parameters is adjusted until it describes the most “fair” solution.
- Transparency is a basic tenet of AI ethics. People should know the reason for a decision.
- We should have a carefully tuned positive feedback loop in our models, which should adjust in response to wrong predictions — and this should be a reliable channel of communication or redress for the individuals whose data we use.
We can add to Pfeil’s argument by noting that AI is iterative, with a complex feedback mechanism. AI practitioners must be cautious that their feedback loops, positive or negative, don’t make bias problems worse.
Elvan Aydemir offers an all-too-familiar example of a misbehaving positive feedback loop in AI: You buy an item that’s typically a one-off purchase, then get inundated with advertising for the same item. Algorithms meant to showcase products to those most interested in them can easily instead bombard individuals with irrelevant advertising.
But the consequences of misbehaving positive feedback loops can be much more severe than a flurry of unwanted ads. Aydemir demonstrates how the use of crime AI can contribute to much larger problems, such as over-policing. Less affluent areas are more policed, and so more arrests are made in these areas. Arrests are used as a statistical proxy for crime, so more arrests mean more police, and so on.
Pfeil suggests that similar questions exist around privacy and fairness. We can expect a shift in the legislative processes for algorithms. It’s already beginning in regulated sectors, and it’ll likely spread to the rest of the economy with the rising tide of digitization.
The law represents a practical moral minimum — ethics approximates something closer to the moral maximum. On issues of fairness, the law is far behind seeing — let alone addressing — advancements in machine learning, and it may never actually reach parity. But, at some point, the law will definitely reach where we stand now. We already see the impact of privacy legislation on organizations around the globe. According to Pfeil, Europe has already started looking at regulating algorithms and outcomes.
Challenges to Fair AI
Pfeil also expands on how data ethics addresses some of the challenges facing the development of fair AI today. One of these challenges is encoding. She details an example from the credit sector, where income is a valid tool for determining the viability of credit — yet income encodes so much of a society’s bias. On an individual credit application, income level is imperative. At scale, on the level of populations, it’s not a fair attribute.
AI’s ability to scale is a challenge facing data scientists. Asking ethical questions at all stages of the machine learning pipeline will help to ensure fair algorithms.
Pfeil also aims to solve the even deeper problem of explainability, which has ballooned beyond the scale of human comprehension. Somewhat worryingly, the field of AI often treats inquiries about how with what seems like a shrug and a reference to a black box. Pfeil says that explainability and transparency are basic tenets of AI ethics. She describes the problem of explainability as caused by AI’s young age. It will develop over time, but for the moment, it remains a challenge to explain certain decisions AI makes.
Building Ethical Data Solutions
Because it remains difficult to quantify the impact of our actions — even if we weren’t striving for an ethical solution and instead settled for a biased outcome — we should be careful to examine our data inputs carefully for biases. What can we do before we’re forced to do something?
As in everything else in business, we need a plan, and our diagnostic triage should begin with privacy. At the most basic level, we need to remain clear of the legal consequences of only loosely adhering to privacy regulations.
We can start by making sure that we analyze the attributes we use as inputs for our models, examining them carefully for bias. They should have a justifiable role in a business decision and not exist as proxies for protected attributes. We can minimize secondary effects of all kinds by simply ensuring that we only use attributes we deem critical to our business.
We can also employ adversarial methods to debias our models. This variation on the use of generative adversarial networks lets us maximize our chosen metric — in most cases, accuracy — while also maximizing the adversary model’s confusion and stifling its attempts to guess the protected attribute. This approach creates solutions that are not only fair, but also resilient against efforts to reintroduce bias by malicious parties.
Our diagnostic triage should then test our transparency and explainability: If we can’t explain an outcome, the model doesn’t stand up to ethical scrutiny.
Transparency could take one of a few forms, but a relatively straightforward and balanced solution would be to help users understand the top five contributors to a decision. This gives us good insight into the biases that we may be affected by, without the burden of creating experts on linear regression and decision tree ensembles out of the general public. And, if we’ve constructed our systems correctly, those biases should be amenable with a simple notification from the user, stating that the predictions served to them were incorrect.
Above all, let’s always keep in mind that we should strive for the moral maximum. The law is, almost universally, a lagging indicator of the ethical course of action.
At this time, that course of action is to build awareness of ethical issues and ensure we can measure our outcomes against ethical expectations. Let’s establish a culture where our data scientists can explain their outcomes and the reasons our models make the decisions they make.
Watch the Data Brew Podcast to learn more about data ethics for your organization.
If you’re interested in developing expert technical content that performs, let’s have a conversation today.