PowerPoint presentation

52
Statistical Statistical Confidentiality: Is Confidentiality: Is Synthetic Data the Synthetic Data the Answer? Answer? George George Duncan Duncan 2006 February 13 2006 February 13

Transcript of PowerPoint presentation

Page 1: PowerPoint presentation

Statistical Confidentiality: Statistical Confidentiality: Is Synthetic Data the Is Synthetic Data the

Answer?Answer?George George DuncanDuncan

2006 February 132006 February 13

Page 2: PowerPoint presentation

Acknowledging ColleaguesAcknowledging Colleagues Diane Lambert, GoogleDiane Lambert, Google Stephen Fienberg, Carnegie MellonStephen Fienberg, Carnegie Mellon Stephen Roehrig, Carnegie MellonStephen Roehrig, Carnegie Mellon Lynne Stokes, Southern MethodistLynne Stokes, Southern Methodist Sallie Keller-McNulty, RiceSallie Keller-McNulty, Rice Mark Elliot, Manchester, UKMark Elliot, Manchester, UK JJ Salazar, Universidad de La Laguna, SpainJJ Salazar, Universidad de La Laguna, Spain

Page 3: PowerPoint presentation

Acknowledging Current Acknowledging Current FundingFunding

NSFNSF, NISS Digital Government II, , NISS Digital Government II, Data Data Confidentialty, Data Quality and Data Confidentialty, Data Quality and Data Integration for Federal Databases: Integration for Federal Databases: Foundations to Software PrototypesFoundations to Software Prototypes

Agency Partners:Agency Partners:Bureau of Labor StatisticsBureau of Labor StatisticsBureau of Transportation StatisticsBureau of Transportation StatisticsCensus BureauCensus BureauNational Agricultural Statistics ServiceNational Agricultural Statistics ServiceNational Center for Education StatisticsNational Center for Education Statistics

Page 4: PowerPoint presentation

Questions AddressedQuestions Addressed What’s the R-U confidentiality map?What’s the R-U confidentiality map? What are synthetic data?What are synthetic data? Can the research community benefit Can the research community benefit

from synthetic data?from synthetic data? Source data—the Gold Standard?Source data—the Gold Standard? How should we evaluate a How should we evaluate a

synthesizer?synthesizer?

Page 5: PowerPoint presentation

Brokering Role of the Brokering Role of the Information OrganizationInformation Organization

Respondent

DATA

CAPTURE

Respondent

Policy AnalystDecision Maker

Media

Researcher

DataSnooper

DISSEMINTION

Page 6: PowerPoint presentation

Why Confidentiality MattersWhy Confidentiality Matters Ethical: Keeping Ethical: Keeping

promises; basic value promises; basic value tied to privacy tied to privacy concerns of solitude, concerns of solitude, autonomy and autonomy and individualityindividuality

Pragmatic: Without Pragmatic: Without confidentiality, confidentiality, respondent may not respondent may not provide data; worse, provide data; worse, may provide may provide inaccurate datainaccurate data

Legal: Required Legal: Required under lawunder law

Page 7: PowerPoint presentation

Confidentiality AuditConfidentiality Audit Sensitive objectsSensitive objects

Attribute valuesAttribute values Relationships Relationships

Susceptible dataSusceptible data Geographical detailGeographical detail Longitudinal or panel structureLongitudinal or panel structure OutliersOutliers Many attribute variablesMany attribute variables Detailed attribute variablesDetailed attribute variables Census versus survey/sample Census versus survey/sample Existence of linkable external databasesExistence of linkable external databases

Page 8: PowerPoint presentation

RestrictedData

RestrictedAccess

Making It SafeMaking It Safe

Page 9: PowerPoint presentation

RESTRICTED ACCESSRESTRICTED ACCESS Special Sworn Special Sworn

EmployeeEmployee Census BureauCensus Bureau

Licensed ResearchersLicensed Researchers National Center for National Center for

Education StatisticsEducation Statistics External SitesExternal Sites

California Census California Census Research Data CenterResearch Data Center

Page 10: PowerPoint presentation

On Line AccessOn Line Access

Page 11: PowerPoint presentation

On Line AccessOn Line Access

Restricted Access

Restricted Data

Restricted Access

Page 12: PowerPoint presentation
Page 13: PowerPoint presentation

Matrix MaskingMatrix Masking

Transforming the source data (Transforming the source data (XX)) to the disseminated data (to the disseminated data (YY))

SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations

Y=AXB + C

Page 14: PowerPoint presentation

Matrix MaskingMatrix Masking

Transforming the original data Transforming the original data ((XX)) to the disseminated data (to the disseminated data (YY))

SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations

Y=AXB + C

source data matrix

with n records and p attributesn pX

Page 15: PowerPoint presentation

Matrix MaskingMatrix Masking

Transforming the original data (Transforming the original data (XX)) to the disseminated data (to the disseminated data (YY))

SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations

Y=AXB + C

Row operator,so record transformation

Column operator, so attribute transformation

Additiveperturbation

Page 16: PowerPoint presentation
Page 17: PowerPoint presentation

Use X to estimate

XF

Generate samples from

XF̂

Page 18: PowerPoint presentation

Origins of the Synthetic Data Origins of the Synthetic Data IdeaIdea

Computer Science:Computer Science: Liew, C. K., Choi, U. J., and Liew, C. J. Liew, C. K., Choi, U. J., and Liew, C. J.

(1985) A data distortion by probability (1985) A data distortion by probability distribution, distribution, ACM Transactions on Database ACM Transactions on Database SystemsSystems 1010 395-411 395-411

Statistics:Statistics: Rubin, D. B. (1993), Satisfying Rubin, D. B. (1993), Satisfying

confidentiality constraints through the use confidentiality constraints through the use of synthetic multiply-imputed microdata, of synthetic multiply-imputed microdata, Journal of Official StatisticsJournal of Official Statistics 9191 461-468 461-468

Page 19: PowerPoint presentation

Further DevelopmentsFurther Developments Fienberg, S. E., Makov, U. E. and Steele, R. J. Fienberg, S. E., Makov, U. E. and Steele, R. J.

(1998) Disclosure limitation using perturbation (1998) Disclosure limitation using perturbation and related methods for categorical data. and related methods for categorical data. Journal Journal of Official Statisticsof Official Statistics 1414 347-360 347-360

Kennickell, Arthur B. (1999) Multiple imputation Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. and disclosure protection. Statistical Data Statistical Data Protection ’98Protection ’98 Lisbon 381-400 Lisbon 381-400

Now attention of other authors, particularly Little, Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, WoodcockRaghunathan, Reiter, Rubin, Abowd, Woodcock

My latest bibliography on SD has 31 entriesMy latest bibliography on SD has 31 entries

Page 20: PowerPoint presentation

What was the original What was the original purpose?purpose?

Public-use microdata file to allow user to make valid inferences about population parameters using straightforward statistical tools while protecting confidentiality (Rubin 1993)

Page 21: PowerPoint presentation

One Person’s AssessmentOne Person’s Assessment“… “… synthetic data sets which have all of the synthetic data sets which have all of the

statistical properties of the original data set, statistical properties of the original data set, but have entirely false data - made-up data, but have entirely false data - made-up data, so that you cannot break confidentiality so that you cannot break confidentiality because, in fact, any data set, any data because, in fact, any data set, any data record you have is a synthetic data record. …record you have is a synthetic data record. …

… … possibly the way of the future for lots of possibly the way of the future for lots of very, very confidential data, and maybe very, very confidential data, and maybe because the … the ability to protect because the … the ability to protect confidentiality … is being eroded by the confidentiality … is being eroded by the internet …this is probably where we are internet …this is probably where we are going to be driven to, although, I hope not.going to be driven to, although, I hope not.

---Norman Bradburn (2003)---Norman Bradburn (2003)

Page 22: PowerPoint presentation

Use X to estimate

XF

Generate samples from

How should we get the synthesizer?

XF̂

Page 23: PowerPoint presentation

Less-Ambitious Data-Use Less-Ambitious Data-Use PurposesPurposes

““Gain familiarity with the dataset structure, Gain familiarity with the dataset structure, develop code, and estimate analytical models—develop code, and estimate analytical models—compare against “gold standard file” compare against “gold standard file”

(Abowd and Lane 2003, Abowd 2005)(Abowd and Lane 2003, Abowd 2005)

“…“…people can send in their sort of model. They people can send in their sort of model. They can make up the synthetic data. You can go back, can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and so forth, and then after you’ve got everything and get your codes all right and get your SAS and get your codes all right and get your SAS Codes right, and then send it in and they will run Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the data - the real data, and they’ll send you back the results.” the results.”

(Bradburn 2003)(Bradburn 2003)

Page 24: PowerPoint presentation

R-U Confidentiality MapR-U Confidentiality Map

No Data

Data Utility U

Disclosure

Risk R

Original Data Maximum Tolerable

Risk

Released Data

Page 25: PowerPoint presentation

Disclosure Limitation Disclosure Limitation ParametersParameters

Specify extent of disclosure limitationSpecify extent of disclosure limitation Disclosure risk and data utility vary with Disclosure risk and data utility vary with

these parameter valuesthese parameter values Top-coding limitTop-coding limit Standard deviation of additive noiseStandard deviation of additive noise

Interpretation for synthetic dataInterpretation for synthetic data Extent released data are synthetic—partial Extent released data are synthetic—partial

synthetic data (Little, 1993)synthetic data (Little, 1993) Extent synthetic data matches source data Extent synthetic data matches source data

(e.g., outliers)(e.g., outliers)

Page 26: PowerPoint presentation

Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality? Synthetic data record not

respondent’s actual data record, so identity disclosure is impossible

Attribute disclosure can happen Particularly with extreme values, it

may be possible to re-identify a source record

Page 27: PowerPoint presentation

Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?

If simulated individuals have data values virtually identical to source individuals, possibility of both identity and attribute disclosure

(Fienberg 1997, 2003) If quasi-identifier attributes are

synthesized, re-identification can happen if data snooper can link an external identified data source using the quasi-identifier attributes

(Domingo-Ferrer et al 2005)

Page 28: PowerPoint presentation

Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?

Because a synthetic data record is not any Because a synthetic data record is not any respondent’s actual data record, identity respondent’s actual data record, identity disclosure is directly impossibledisclosure is directly impossible

Attribute disclosure is still possibleAttribute disclosure is still possible But, particularly with extreme values, it may But, particularly with extreme values, it may

still be possible to re-identify a source recordstill be possible to re-identify a source record Some simulated individuals may have data Some simulated individuals may have data

values virtually identical to original sample values virtually identical to original sample individuals, so the possibility of both identity individuals, so the possibility of both identity and attribute disclosure remain (Fienberg and attribute disclosure remain (Fienberg 1997, 2003)1997, 2003)

Not fully, but it can appreciably lower disclosure risk

Page 29: PowerPoint presentation

Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is

synthesizedsynthesized Sophisticated research users must Sophisticated research users must

help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)

Page 30: PowerPoint presentation

Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is

synthesizedsynthesized Sophisticated research users must Sophisticated research users must

help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)

If we do it right

Page 31: PowerPoint presentation

Synthesizer BuildSynthesizer Build Synthesizer build involves Synthesizer build involves

constructing a statistical modelconstructing a statistical model But… model purpose not the usualBut… model purpose not the usual Not prediction, control or scientific Not prediction, control or scientific

understandingunderstanding Usual model construction exploits Usual model construction exploits

Occam’s Razor and seeks parsimonyOccam’s Razor and seeks parsimony

Page 32: PowerPoint presentation

Careful with Occam’s RazorCareful with Occam’s Razor "Everything should be made as simple "Everything should be made as simple

as possible, but not one bit simpler." as possible, but not one bit simpler." -- -- Albert EinsteinAlbert Einstein "Seek simplicity, and distrust it.“"Seek simplicity, and distrust it.“ -- -- Alfred North Alfred North

WhiteheadWhitehead

Page 33: PowerPoint presentation

Source Data not 24 Karat Gold Source Data not 24 Karat Gold Standard?Standard?

Steve Fienberg has notedSteve Fienberg has noted Sampled population often not target populationSampled population often not target population Coding errors, imputed missing dataCoding errors, imputed missing data

Do we really want to duplicate the statistical Do we really want to duplicate the statistical results obtainable from the source data? results obtainable from the source data? Match source dataMatch source data

Or, do we want to obtain statistical Or, do we want to obtain statistical inferences equally valid as those from the inferences equally valid as those from the source data? source data? Match source data goalMatch source data goal

Page 34: PowerPoint presentation

What posterior predictive What posterior predictive distribution for synthetic data?distribution for synthetic data?

““In actual implementations, the correct In actual implementations, the correct posterior predictive distribution is not posterior predictive distribution is not known, and an imputer-constructed known, and an imputer-constructed approximation is used.”approximation is used.” Jerry Reiter (2002)Jerry Reiter (2002)

What sampling distributions?What sampling distributions? What priors work best? What priors work best? What if the data analyst uses a prior very What if the data analyst uses a prior very

different from the synthesizer?different from the synthesizer?

Page 35: PowerPoint presentation

X

Y

13012011010090807060

400

350

300

250

200

Scatterplot of Y vs X

Page 36: PowerPoint presentation

Regression Analysis: Y versus X, X-squared

The regression equation isY = 6.61 + 3.05 X + 0.00062 X-squared

Predictor Coef SE Coef T PConstant 6.605 9.829 0.67 0.507X 3.0516 0.2044 14.93 0.000X-squared 0.000621 0.001046 0.59 0.558

S = 1.62190 R-Sq = 99.9%

Page 37: PowerPoint presentation

The regression equation isY = 0.88 + 3.17 X

Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000

S = 1.60303 R-Sq = 99.9%

Page 38: PowerPoint presentation

What should we use to What should we use to generate the synthetic data?generate the synthetic data?

Descriptive Statistics: X, Y

Variable N Mean StDev X 30 98.65 15.73 Y 30 313.85 9.12

Page 39: PowerPoint presentation

X

Perc

ent

14013012011010090807060

99

9590

80706050403020

105

1

Mean

0.952

98.65StDev 15.73N 30AD 0.154P-Value

Probability Plot of XNormal

Page 40: PowerPoint presentation

Usual Modeling Approach (non-Usual Modeling Approach (non-informative Bayes)informative Bayes)

Take Take

2

2

3198.65, 15.7330

31| 0.88 3.17 , 1.6030

X N

Y X x N x

Page 41: PowerPoint presentation

Sim X

Sim

Y

13012011010090807060

400

350

300

250

200

Scatterplot of Sim Y vs Sim X

Page 42: PowerPoint presentation

The regression equation isSim Y = 3.39 + 3.14 Sim X

Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000

S = 1.55825 R-Sq = 99.9%

Page 43: PowerPoint presentation

Compare with the “Gold Compare with the “Gold Standard” AnalysisStandard” Analysis

Based on Source DataBased on Source Data Based on Simulated Based on Simulated DataData

The regression equation isY = 0.88 + 3.17 X

Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000

S = 1.60303 R-Sq = 99.9%

The regression equation isSim Y = 3.39 + 3.14 Sim X

Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000

S = 1.55825 R-Sq = 99.9%

Page 44: PowerPoint presentation

RealityReality28 3 .001 ,

~ (0,1)three outliers

Y X XN

Page 45: PowerPoint presentation

So What’s So Bad?So What’s So Bad? Lost quadratic effectLost quadratic effect

Think of analyst with positive prior on Think of analyst with positive prior on thisthis

Lost outliersLost outliers

Page 46: PowerPoint presentation

Data Utility: Inference-Valid?

What does inference valid mean?What does inference valid mean? Same results as with original dataSame results as with original data Equal inference capability as original Equal inference capability as original

data? (Think like post-19data? (Think like post-19thth century century statistician)statistician)

Page 47: PowerPoint presentation

Is Inference-Valid Synthetic Data Possible?

““How robust are inferences to How robust are inferences to mis-specifications in the model mis-specifications in the model used to draw synthetic data?” used to draw synthetic data?” Jerry ReiterJerry Reiter

Method used in imputation must Method used in imputation must foresee complete-data analysesforesee complete-data analyses http://www.multiple-imputation.comhttp://www.multiple-imputation.com//

Page 48: PowerPoint presentation

Implementation is HardImplementation is Hard Model development time-consuming and

human-resource demanding, typically needing domain knowledge and statistical skills

Model is a simplification of reality—an incomplete image

Model selection/parameterization subjective

Data users’ models and methods more and more sophisticated (Bucher & Vckovski, 1995)

Page 49: PowerPoint presentation

Multivariate DifficultiesMultivariate Difficulties Capturing multivariate statistical

characteristics is time consuming Dandekar (2004)

Difficult to model joint distribution for several variables, especially in the presence of categorical variables Singh, Yu, and Dunteman (2003)

Page 50: PowerPoint presentation

Sample Survey DataSample Survey Data Generate synthetic data for sampled unitsGenerate synthetic data for sampled units

More disclosure riskMore disclosure risk Data utility?Data utility?

Generate synthetic data for population Generate synthetic data for population unitsunits Less disclosure riskLess disclosure risk Data utility?Data utility?

Preserve structure of sampling design?Preserve structure of sampling design? Singh, Yu, and Dunteman (2003)

Page 51: PowerPoint presentation

Usual Hard Problems Remain Usual Hard Problems Remain Hard!Hard!

Geographical detailGeographical detail Synthetic data for sampled units?Synthetic data for sampled units?

Longitudinal dataLongitudinal data Preserve complex relationshipsPreserve complex relationships Approximate ala Abowd and Woodcock Approximate ala Abowd and Woodcock

(2001)(2001) Target known to be in sampleTarget known to be in sample

Synthetic data for sampled units?Synthetic data for sampled units?

Page 52: PowerPoint presentation

Final MessagesFinal Messages Follow the R-U confidentiality mapFollow the R-U confidentiality map Don’t accept the source data as the Don’t accept the source data as the

Gold StandardGold Standard In sculpting a synthesizer, Occam’s In sculpting a synthesizer, Occam’s

Razor cuts too deeplyRazor cuts too deeply Implementing synthetic data is hard, Implementing synthetic data is hard,

so no panacea for microdata releaseso no panacea for microdata release