PowerPoint presentation
-
Upload
datacenters -
Category
Technology
-
view
397 -
download
0
Transcript of PowerPoint presentation
Statistical Confidentiality: Statistical Confidentiality: Is Synthetic Data the Is Synthetic Data the
Answer?Answer?George George DuncanDuncan
2006 February 132006 February 13
Acknowledging ColleaguesAcknowledging Colleagues Diane Lambert, GoogleDiane Lambert, Google Stephen Fienberg, Carnegie MellonStephen Fienberg, Carnegie Mellon Stephen Roehrig, Carnegie MellonStephen Roehrig, Carnegie Mellon Lynne Stokes, Southern MethodistLynne Stokes, Southern Methodist Sallie Keller-McNulty, RiceSallie Keller-McNulty, Rice Mark Elliot, Manchester, UKMark Elliot, Manchester, UK JJ Salazar, Universidad de La Laguna, SpainJJ Salazar, Universidad de La Laguna, Spain
Acknowledging Current Acknowledging Current FundingFunding
NSFNSF, NISS Digital Government II, , NISS Digital Government II, Data Data Confidentialty, Data Quality and Data Confidentialty, Data Quality and Data Integration for Federal Databases: Integration for Federal Databases: Foundations to Software PrototypesFoundations to Software Prototypes
Agency Partners:Agency Partners:Bureau of Labor StatisticsBureau of Labor StatisticsBureau of Transportation StatisticsBureau of Transportation StatisticsCensus BureauCensus BureauNational Agricultural Statistics ServiceNational Agricultural Statistics ServiceNational Center for Education StatisticsNational Center for Education Statistics
Questions AddressedQuestions Addressed What’s the R-U confidentiality map?What’s the R-U confidentiality map? What are synthetic data?What are synthetic data? Can the research community benefit Can the research community benefit
from synthetic data?from synthetic data? Source data—the Gold Standard?Source data—the Gold Standard? How should we evaluate a How should we evaluate a
synthesizer?synthesizer?
Brokering Role of the Brokering Role of the Information OrganizationInformation Organization
Respondent
DATA
CAPTURE
Respondent
Policy AnalystDecision Maker
Media
Researcher
DataSnooper
DISSEMINTION
Why Confidentiality MattersWhy Confidentiality Matters Ethical: Keeping Ethical: Keeping
promises; basic value promises; basic value tied to privacy tied to privacy concerns of solitude, concerns of solitude, autonomy and autonomy and individualityindividuality
Pragmatic: Without Pragmatic: Without confidentiality, confidentiality, respondent may not respondent may not provide data; worse, provide data; worse, may provide may provide inaccurate datainaccurate data
Legal: Required Legal: Required under lawunder law
Confidentiality AuditConfidentiality Audit Sensitive objectsSensitive objects
Attribute valuesAttribute values Relationships Relationships
Susceptible dataSusceptible data Geographical detailGeographical detail Longitudinal or panel structureLongitudinal or panel structure OutliersOutliers Many attribute variablesMany attribute variables Detailed attribute variablesDetailed attribute variables Census versus survey/sample Census versus survey/sample Existence of linkable external databasesExistence of linkable external databases
RestrictedData
RestrictedAccess
Making It SafeMaking It Safe
RESTRICTED ACCESSRESTRICTED ACCESS Special Sworn Special Sworn
EmployeeEmployee Census BureauCensus Bureau
Licensed ResearchersLicensed Researchers National Center for National Center for
Education StatisticsEducation Statistics External SitesExternal Sites
California Census California Census Research Data CenterResearch Data Center
On Line AccessOn Line Access
On Line AccessOn Line Access
Restricted Access
Restricted Data
Restricted Access
Matrix MaskingMatrix Masking
Transforming the source data (Transforming the source data (XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
Matrix MaskingMatrix Masking
Transforming the original data Transforming the original data ((XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
source data matrix
with n records and p attributesn pX
Matrix MaskingMatrix Masking
Transforming the original data (Transforming the original data (XX)) to the disseminated data (to the disseminated data (YY))
SuppressionsSuppressions PerturbationsPerturbations SamplingsSamplings AggregationsAggregations
Y=AXB + C
Row operator,so record transformation
Column operator, so attribute transformation
Additiveperturbation
Use X to estimate
XF
Generate samples from
XF̂
Origins of the Synthetic Data Origins of the Synthetic Data IdeaIdea
Computer Science:Computer Science: Liew, C. K., Choi, U. J., and Liew, C. J. Liew, C. K., Choi, U. J., and Liew, C. J.
(1985) A data distortion by probability (1985) A data distortion by probability distribution, distribution, ACM Transactions on Database ACM Transactions on Database SystemsSystems 1010 395-411 395-411
Statistics:Statistics: Rubin, D. B. (1993), Satisfying Rubin, D. B. (1993), Satisfying
confidentiality constraints through the use confidentiality constraints through the use of synthetic multiply-imputed microdata, of synthetic multiply-imputed microdata, Journal of Official StatisticsJournal of Official Statistics 9191 461-468 461-468
Further DevelopmentsFurther Developments Fienberg, S. E., Makov, U. E. and Steele, R. J. Fienberg, S. E., Makov, U. E. and Steele, R. J.
(1998) Disclosure limitation using perturbation (1998) Disclosure limitation using perturbation and related methods for categorical data. and related methods for categorical data. Journal Journal of Official Statisticsof Official Statistics 1414 347-360 347-360
Kennickell, Arthur B. (1999) Multiple imputation Kennickell, Arthur B. (1999) Multiple imputation and disclosure protection. and disclosure protection. Statistical Data Statistical Data Protection ’98Protection ’98 Lisbon 381-400 Lisbon 381-400
Now attention of other authors, particularly Little, Now attention of other authors, particularly Little, Raghunathan, Reiter, Rubin, Abowd, WoodcockRaghunathan, Reiter, Rubin, Abowd, Woodcock
My latest bibliography on SD has 31 entriesMy latest bibliography on SD has 31 entries
What was the original What was the original purpose?purpose?
Public-use microdata file to allow user to make valid inferences about population parameters using straightforward statistical tools while protecting confidentiality (Rubin 1993)
One Person’s AssessmentOne Person’s Assessment“… “… synthetic data sets which have all of the synthetic data sets which have all of the
statistical properties of the original data set, statistical properties of the original data set, but have entirely false data - made-up data, but have entirely false data - made-up data, so that you cannot break confidentiality so that you cannot break confidentiality because, in fact, any data set, any data because, in fact, any data set, any data record you have is a synthetic data record. …record you have is a synthetic data record. …
… … possibly the way of the future for lots of possibly the way of the future for lots of very, very confidential data, and maybe very, very confidential data, and maybe because the … the ability to protect because the … the ability to protect confidentiality … is being eroded by the confidentiality … is being eroded by the internet …this is probably where we are internet …this is probably where we are going to be driven to, although, I hope not.going to be driven to, although, I hope not.
---Norman Bradburn (2003)---Norman Bradburn (2003)
Use X to estimate
XF
Generate samples from
How should we get the synthesizer?
XF̂
Less-Ambitious Data-Use Less-Ambitious Data-Use PurposesPurposes
““Gain familiarity with the dataset structure, Gain familiarity with the dataset structure, develop code, and estimate analytical models—develop code, and estimate analytical models—compare against “gold standard file” compare against “gold standard file”
(Abowd and Lane 2003, Abowd 2005)(Abowd and Lane 2003, Abowd 2005)
“…“…people can send in their sort of model. They people can send in their sort of model. They can make up the synthetic data. You can go back, can make up the synthetic data. You can go back, you can run things, sharpen up your hypotheses you can run things, sharpen up your hypotheses and so forth, and then after you’ve got everything and so forth, and then after you’ve got everything and get your codes all right and get your SAS and get your codes all right and get your SAS Codes right, and then send it in and they will run Codes right, and then send it in and they will run the data - the real data, and they’ll send you back the data - the real data, and they’ll send you back the results.” the results.”
(Bradburn 2003)(Bradburn 2003)
R-U Confidentiality MapR-U Confidentiality Map
No Data
Data Utility U
Disclosure
Risk R
Original Data Maximum Tolerable
Risk
Released Data
Disclosure Limitation Disclosure Limitation ParametersParameters
Specify extent of disclosure limitationSpecify extent of disclosure limitation Disclosure risk and data utility vary with Disclosure risk and data utility vary with
these parameter valuesthese parameter values Top-coding limitTop-coding limit Standard deviation of additive noiseStandard deviation of additive noise
Interpretation for synthetic dataInterpretation for synthetic data Extent released data are synthetic—partial Extent released data are synthetic—partial
synthetic data (Little, 1993)synthetic data (Little, 1993) Extent synthetic data matches source data Extent synthetic data matches source data
(e.g., outliers)(e.g., outliers)
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality? Synthetic data record not
respondent’s actual data record, so identity disclosure is impossible
Attribute disclosure can happen Particularly with extreme values, it
may be possible to re-identify a source record
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?
If simulated individuals have data values virtually identical to source individuals, possibility of both identity and attribute disclosure
(Fienberg 1997, 2003) If quasi-identifier attributes are
synthesized, re-identification can happen if data snooper can link an external identified data source using the quasi-identifier attributes
(Domingo-Ferrer et al 2005)
Does Synthetic Data Does Synthetic Data Guarantee Confidentiality?Guarantee Confidentiality?
Because a synthetic data record is not any Because a synthetic data record is not any respondent’s actual data record, identity respondent’s actual data record, identity disclosure is directly impossibledisclosure is directly impossible
Attribute disclosure is still possibleAttribute disclosure is still possible But, particularly with extreme values, it may But, particularly with extreme values, it may
still be possible to re-identify a source recordstill be possible to re-identify a source record Some simulated individuals may have data Some simulated individuals may have data
values virtually identical to original sample values virtually identical to original sample individuals, so the possibility of both identity individuals, so the possibility of both identity and attribute disclosure remain (Fienberg and attribute disclosure remain (Fienberg 1997, 2003)1997, 2003)
Not fully, but it can appreciably lower disclosure risk
Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is
synthesizedsynthesized Sophisticated research users must Sophisticated research users must
help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)
Are Synthetic Data Valid?Are Synthetic Data Valid? Not unless we are careful in how it is Not unless we are careful in how it is
synthesizedsynthesized Sophisticated research users must Sophisticated research users must
help develop the synthesizers in help develop the synthesizers in order to promote and improve order to promote and improve analytic validity (Abowd)analytic validity (Abowd)
If we do it right
Synthesizer BuildSynthesizer Build Synthesizer build involves Synthesizer build involves
constructing a statistical modelconstructing a statistical model But… model purpose not the usualBut… model purpose not the usual Not prediction, control or scientific Not prediction, control or scientific
understandingunderstanding Usual model construction exploits Usual model construction exploits
Occam’s Razor and seeks parsimonyOccam’s Razor and seeks parsimony
Careful with Occam’s RazorCareful with Occam’s Razor "Everything should be made as simple "Everything should be made as simple
as possible, but not one bit simpler." as possible, but not one bit simpler." -- -- Albert EinsteinAlbert Einstein "Seek simplicity, and distrust it.“"Seek simplicity, and distrust it.“ -- -- Alfred North Alfred North
WhiteheadWhitehead
Source Data not 24 Karat Gold Source Data not 24 Karat Gold Standard?Standard?
Steve Fienberg has notedSteve Fienberg has noted Sampled population often not target populationSampled population often not target population Coding errors, imputed missing dataCoding errors, imputed missing data
Do we really want to duplicate the statistical Do we really want to duplicate the statistical results obtainable from the source data? results obtainable from the source data? Match source dataMatch source data
Or, do we want to obtain statistical Or, do we want to obtain statistical inferences equally valid as those from the inferences equally valid as those from the source data? source data? Match source data goalMatch source data goal
What posterior predictive What posterior predictive distribution for synthetic data?distribution for synthetic data?
““In actual implementations, the correct In actual implementations, the correct posterior predictive distribution is not posterior predictive distribution is not known, and an imputer-constructed known, and an imputer-constructed approximation is used.”approximation is used.” Jerry Reiter (2002)Jerry Reiter (2002)
What sampling distributions?What sampling distributions? What priors work best? What priors work best? What if the data analyst uses a prior very What if the data analyst uses a prior very
different from the synthesizer?different from the synthesizer?
X
Y
13012011010090807060
400
350
300
250
200
Scatterplot of Y vs X
Regression Analysis: Y versus X, X-squared
The regression equation isY = 6.61 + 3.05 X + 0.00062 X-squared
Predictor Coef SE Coef T PConstant 6.605 9.829 0.67 0.507X 3.0516 0.2044 14.93 0.000X-squared 0.000621 0.001046 0.59 0.558
S = 1.62190 R-Sq = 99.9%
The regression equation isY = 0.88 + 3.17 X
Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000
S = 1.60303 R-Sq = 99.9%
What should we use to What should we use to generate the synthetic data?generate the synthetic data?
Descriptive Statistics: X, Y
Variable N Mean StDev X 30 98.65 15.73 Y 30 313.85 9.12
X
Perc
ent
14013012011010090807060
99
9590
80706050403020
105
1
Mean
0.952
98.65StDev 15.73N 30AD 0.154P-Value
Probability Plot of XNormal
Usual Modeling Approach (non-Usual Modeling Approach (non-informative Bayes)informative Bayes)
Take Take
2
2
3198.65, 15.7330
31| 0.88 3.17 , 1.6030
X N
Y X x N x
Sim X
Sim
Y
13012011010090807060
400
350
300
250
200
Scatterplot of Sim Y vs Sim X
The regression equation isSim Y = 3.39 + 3.14 Sim X
Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000
S = 1.55825 R-Sq = 99.9%
Compare with the “Gold Compare with the “Gold Standard” AnalysisStandard” Analysis
Based on Source DataBased on Source Data Based on Simulated Based on Simulated DataData
The regression equation isY = 0.88 + 3.17 X
Predictor Coef SE Coef T PConstant 0.881 1.890 0.47 0.645X 3.17236 0.01892 167.64 0.000
S = 1.60303 R-Sq = 99.9%
The regression equation isSim Y = 3.39 + 3.14 Sim X
Predictor Coef SE Coef T PConstant 3.393 1.810 1.87 0.071Sim X 3.14138 0.01921 163.56 0.000
S = 1.55825 R-Sq = 99.9%
RealityReality28 3 .001 ,
~ (0,1)three outliers
Y X XN
So What’s So Bad?So What’s So Bad? Lost quadratic effectLost quadratic effect
Think of analyst with positive prior on Think of analyst with positive prior on thisthis
Lost outliersLost outliers
Data Utility: Inference-Valid?
What does inference valid mean?What does inference valid mean? Same results as with original dataSame results as with original data Equal inference capability as original Equal inference capability as original
data? (Think like post-19data? (Think like post-19thth century century statistician)statistician)
Is Inference-Valid Synthetic Data Possible?
““How robust are inferences to How robust are inferences to mis-specifications in the model mis-specifications in the model used to draw synthetic data?” used to draw synthetic data?” Jerry ReiterJerry Reiter
Method used in imputation must Method used in imputation must foresee complete-data analysesforesee complete-data analyses http://www.multiple-imputation.comhttp://www.multiple-imputation.com//
Implementation is HardImplementation is Hard Model development time-consuming and
human-resource demanding, typically needing domain knowledge and statistical skills
Model is a simplification of reality—an incomplete image
Model selection/parameterization subjective
Data users’ models and methods more and more sophisticated (Bucher & Vckovski, 1995)
Multivariate DifficultiesMultivariate Difficulties Capturing multivariate statistical
characteristics is time consuming Dandekar (2004)
Difficult to model joint distribution for several variables, especially in the presence of categorical variables Singh, Yu, and Dunteman (2003)
Sample Survey DataSample Survey Data Generate synthetic data for sampled unitsGenerate synthetic data for sampled units
More disclosure riskMore disclosure risk Data utility?Data utility?
Generate synthetic data for population Generate synthetic data for population unitsunits Less disclosure riskLess disclosure risk Data utility?Data utility?
Preserve structure of sampling design?Preserve structure of sampling design? Singh, Yu, and Dunteman (2003)
Usual Hard Problems Remain Usual Hard Problems Remain Hard!Hard!
Geographical detailGeographical detail Synthetic data for sampled units?Synthetic data for sampled units?
Longitudinal dataLongitudinal data Preserve complex relationshipsPreserve complex relationships Approximate ala Abowd and Woodcock Approximate ala Abowd and Woodcock
(2001)(2001) Target known to be in sampleTarget known to be in sample
Synthetic data for sampled units?Synthetic data for sampled units?
Final MessagesFinal Messages Follow the R-U confidentiality mapFollow the R-U confidentiality map Don’t accept the source data as the Don’t accept the source data as the
Gold StandardGold Standard In sculpting a synthesizer, Occam’s In sculpting a synthesizer, Occam’s
Razor cuts too deeplyRazor cuts too deeply Implementing synthetic data is hard, Implementing synthetic data is hard,
so no panacea for microdata releaseso no panacea for microdata release