Skip to content

A project to simulate various differential privacy scenarios using OpenDp.

Notifications You must be signed in to change notification settings

sastava007/Differential-Privacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Differential Privacy

Differential Privacy is one of the most innovative methods of cyber security that works with aggregated user data to extract the information while keeping the data of individual users entirely private by introducing randomness into the process of data retrieval.

“Differential privacy” describes a promise, made by a data holder, or curator, to a data subject: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”

Need of DP

Let's consider a simple use case where we’re curating (or managing) a sensitive database and would like to release some statistics from this data to the public. However, we have to ensure that it’s impossible for an adversary to reverse-engineer the sensitive data from what we’ve released given enough computation power and time.

The problem of statistical disclosure control revealing accurate statistics about a population while preserving the privacy of individuals has a venerable history. Netflix​ in 2006, announced a challenge for improving their recommendation algorithm by releasing 100 million anonymized movie ratings. Although the data sets were constructed to preserve customer privacy, still two researchers from the UT Texas, Austin were able to identify individual users by matching the data sets with film ratings on the IMDb.

Case Study

1. An analysis of Mental Health in Tech Survey with/ without preserving Data Privacy

The purpose of this demo is to showcase the utility of ​OpenDP​ differential privacy framework by making statistical queries to data with and without privacy-preserving mechanisms. As we compare query results side-by-side, we show that conclusions about the data are similar in both settings: without a privacy-preserving mechanism, and with differential privacy mechanism.

Data Set & Overview of each Attribute

The mental health in tech survey data set is released by ​ OSMI​ which consists of 27 questions, answered by 1,259 volunteers. The data used in the analysis were preprocessed i.e the original age, gender, and country variables were mapped into categories for the analysis.

Age​ : 21-30yo (0), 31-40yo (1), 41-50yo (2), 51-60yo (3), 60yo+ (4).

Gender​ : Male/Man (1), Female/Woman(2), all other inputs (0).

Country​ : United States (1), United Kingdom (2), Canada (3), other countries (0).

remote_work​ : Binary value that indicates if participant works remotely more than 50% of the time.

family_history​ : Binary value that indicates if the participant has a family history of mental illness.

Treatment : Binary value that indicates if the participant have sought treatment for mental illness.

Now, we will make statistical queries on different variables to generate a comparative analysis of results obtained from data with/ without differential privacy mechanisms.

  1. Age

     The true age distribution is: ​ [478, 554, 149, 26, 6]
     The age histogram obtained using DP is: ​ [458 555 148 48 25]
    

  1. Country

     The true country distribution is: ​ [240, 732, 175, 66]
     The country histogram obtained using DP is: ​ [257, 810, 186, 63]
    

  1. Gender

     The true gender distribution is: ​ [16, 955, 242]
     The gender histogram obtained using DP is: ​ [52, 990, 238]
    

  1. Remote Work

     The true remote work distribution is: ​ [360, 853]
     The remote work histogram obtained using DP is: ​ [355, 898]
    

  1. Family History

     The true count of participants with a family history of mental illness is: ​ [480, 733]
     The family history histogram obtained using DP is: ​ [517, 709]
    

  1. Treatment

     The true count of participants diagnosed with mental illness is: ​ [619, 598]
     The mental illness obtained using DP is: ​ [604, 612]
    

2. Usefulness of Differential Privacy in handling the attack on an individual’s data

In this demo, we will examine perhaps the simplest possible attack on an individual's private data and what the differential privacy can do to mitigate it. We are considering a dataset of 10,000 people having attributes like (name, sex, age, education, income, married, race).

Consider an attacker who knows everything about the data except for the person of interest's (POI) income, which is considered private. They can back out the individual's income very easily, just by asking for the mean overall mean income.

POI_income = overall_mean * n_obs ​ - ​ known_mean * known_obs

But if the attackers were made to interact with the data through differential privacy and was given a privacy budget of ​ ε = 1​ . Now, they should use tighter data bounds than they know are actually in the data in order to get a less noisy estimate and need to update their known_mean accordingly.

Upon executing the code in the attached Jupyter notebook the result I got are as follows:

Known Mean Income: 26886.001600160016
Observed Mean Income: 26883.930944271226
Estimated POI Income: 6179.4427122677835
True POI Income: 6000.0

Installation Guide

  • Fork & Clone this repository, please star too.
  • Enter the respective folder, and run python3 -m venv venv and . venv/bin/activate
  • Install all the dependencies pip install -r requirements.txt
  • If you're running locally pip install jupyterlab
  • Launch locallyjupyter-lab

Note: I've worked upon this project as a part of my junior year Information System Security coursework at the Indian Institute of Information Technology(IIIT), Gwalior in the supervision of Dr. Debanjan Sadhya.

[Presentation Slides] [Project Report]