eBooks

Theory of Sample Surveys with R

0311
2015
978-3-8385-4328-4
978-3-8252-4328-9
UTB 
Andreas Behr

Das englischsprachige Buch vermittelt Kenntnisse der wichtigsten Methoden der modernen design-basierten Stichprobentheorie. Alle Methoden werden mit Hilfe von numerischen Beispielen illustriert und deren Umsetzung mit Hilfe der statistischen Software R dargestellt. Zahlreiche empirische Beispiele und Simulationen helfen, die Eigenschaften von Schätzfunktionen zu beurteilen. Zur Vertiefung des Verständnisses enthält jedes Kapitel Übungsaufgaben.

<?page no="1"?> Eine Arbeitsgemeinschaft der Verlage Böhlau Verlag · Wien · Köln · Weimar Verlag Barbara Budrich · Opladen · Toronto facultas · Wien Wilhelm Fink · Paderborn A. Francke Verlag · Tübingen Haupt Verlag · Bern Verlag Julius Klinkhardt · Bad Heilbrunn Mohr Siebeck · Tübingen Nomos Verlagsgesellschaft · Baden-Baden Ernst Reinhardt Verlag · München · Basel Ferdinand Schöningh · Paderborn Eugen Ulmer Verlag · Stuttgart UVK Verlagsgesellschaft · Konstanz, mit UVK / Lucius · München Vandenhoeck & Ruprecht · Göttingen · Bristol Waxmann · Münster · New York utb 4328 <?page no="2"?> Andreas Behr Theory of Sample Surveys with R UVK Verlagsgesellschaft mbH · Konstanz mit UVK/ Lucius · München <?page no="3"?> Prof. Dr. Andreas Behr lehrt Statistik an der Universität Duisburg-Essen. Lob und Kritik Wir freuen uns darüber, dass Sie sich für ein UTB-Lehrbuch entschieden haben und hoffen, dass Sie dieses Buch in Ihrem Studium sinnvoll unterstützt. Für Lob und Kritik haben wir stets ein offenes Ohr: Schreiben Sie uns einfach eine E-Mail an das Lektorat (wirtschaft@uvk.de). Online-Angebote oder elektronische Ausgaben sind erhältlich unter www.utb-shop.de. Bibliografische Information der Deutschen Bibliothek Die Deutsche Bibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über <http: / / dnb.ddb.de> abrufbar. Das Werk einschließlich aller seiner Teile ist urheberrechtlich geschützt. Jede Verwertung außerhalb der engen Grenzen des Urheberrechtsgesetzes ist ohne Zustimmung des Verlages unzulässig und strafbar. Das gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen und die Einspeicherung und Verarbeitung in elektronischen Systemen. © UVK Verlagsgesellschaft mbH, Konstanz und München 2015 Lektorat: Rainer Berger Einbandgestaltung: Atelier Reichert, Stuttgart Umschlagmotiv: © bluedesign - Fotolia.com Druck und Bindung: fgb freiburger graphische betriebe, Freiburg UVK Verlagsgesellschaft mbH Schützenstr. 24 · 78462 Konstanz Tel. 07531/ 9053-0 · Fax 07531/ 9053-98 www.uvk.de UTB-Nr. 4 8 ISBN 978-3-8252-4328-9 <?page no="4"?> Preface The book is based upon my lecture notes developed for the course in sampling theory at Münster University and University Essen- Duisburg. Some basic knowledge of statistical methods, regression analysis and the R programming environment would be helpful for understanding the text. I am grateful to Ullrich Rendtel, Götz Rohwer and Ulrich Pötter helping me to understand the basics of sampling theory and the feedback from many students of my courses. I would like to thank Christoph Schiwy for providing the layout of the manuscript using Latex and knitr and Katja Theune and Jurij Weinblat for reading the manuscript. Münster, December 2014 Andreas Behr Some hints for solutions of the exercises which accompany all chapters and the data files used in the text can be downloaded at www.uvk-lucius.de/ behr. <?page no="6"?> Contents List of Figures 13 1 Introduction 15 1.1 Sources of randomness . . . . . . . . . . . . . . . . 16 1.1.1 Stochastic model . . . . . . . . . . . . . . . 16 1.1.2 Design approach . . . . . . . . . . . . . . . 17 1.2 Surveys . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2.1 Characteristics of surveys . . . . . . . . . . 17 1.2.2 Sampling frame . . . . . . . . . . . . . . . . 19 1.2.3 Probability sampling . . . . . . . . . . . . . 19 1.2.4 Sampling and inference . . . . . . . . . . . 19 1.3 Specific designs . . . . . . . . . . . . . . . . . . . . 20 1.3.1 Simple random sampling . . . . . . . . . . . 20 1.3.2 Stratified sampling . . . . . . . . . . . . . . 20 1.3.3 Cluster sampling . . . . . . . . . . . . . . . 21 1.4 Outline of the book . . . . . . . . . . . . . . . . . . 21 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . 23 2 Introduction to R 25 2.1 Some R basics . . . . . . . . . . . . . . . . . . . . . 26 2.1.1 Object orientation . . . . . . . . . . . . . . 26 2.1.2 Dataframes . . . . . . . . . . . . . . . . . . 26 2.1.3 Sequences, replications, conditions and loops . . . . . . . . . . . . . . . . . . . 27 2.1.4 Matrices . . . . . . . . . . . . . . . . . . . . 30 2.1.5 Storing and reading data files . . . . . . . . 31 2.1.6 Probability distributions . . . . . . . . . . . 32 2.1.7 Graphics . . . . . . . . . . . . . . . . . . . 33 2.1.8 Linear regression . . . . . . . . . . . . . . . 34 2.2 Sampling from a population . . . . . . . . . . . . . 36 2.2.1 Enumeration of samples . . . . . . . . . . . 36 2.2.2 The sample() function . . . . . . . . . . . . 37 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . 38 7 <?page no="7"?> 8 Contents 3 Inclusion probabilities 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Some notation . . . . . . . . . . . . . . . . . . . . 42 3.3 Inclusion indicator I . . . . . . . . . . . . . . . . . 43 3.4 A small example . . . . . . . . . . . . . . . . . . . 44 3.5 Inclusion probabilities π . . . . . . . . . . . . . . . 45 3.6 Obtaining inclusion probabilities with R . . . . . . 46 3.7 Simple random sampling (SI) . . . . . . . . . . . . 47 3.8 Properties of the inclusion indicator . . . . . . . . 50 3.8.1 The expected value of the inclusion indicator . . . . . . . . . . . . . . . . . . . 51 3.8.2 The variance of the inclusion indicator . . . 51 3.8.3 The covariance of the inclusion indicator . . 51 3.8.4 Properties of the covariance . . . . . . . . . 52 3.8.5 Covariance matrix and sums of sums . . . . 53 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . 54 4 Estimation 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Estimating functions and estimators . . . . . . . . 58 4.3 Properties of estimation functions . . . . . . . . . . 58 4.4 The π -estimator . . . . . . . . . . . . . . . . . . . 59 4.4.1 Properties of the π -estimator . . . . . . . . 59 4.4.2 Expected value and variance of the π -estimator . . . . . . . . . . . . . . . . . . 59 4.4.3 An alternative expression of the variance . 62 4.4.4 The Yates-Grundy variance of the total . . 62 4.5 Estimation using R . . . . . . . . . . . . . . . . . . 64 4.5.1 A small numerical example . . . . . . . . . 64 4.5.2 An empirical example: PSID . . . . . . . . 68 4.6 Generating samples with unequal inclusion probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6.1 Probabilities proportional to size (PPS) . . 70 4.6.2 The Sampford-algorithm . . . . . . . . . . . 72 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . 73 5 Simple sampling 75 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Some general estimation functions . . . . . . . . . 76 5.2.1 The π -estimator for the total . . . . . . . . 76 5.2.2 The π -estimator for the mean . . . . . . . . 76 5.2.3 The π -estimator for a proportion . . . . . . 77 <?page no="8"?> Contents 9 5.3 Simple random sampling . . . . . . . . . . . . . . . 77 5.3.1 The π -estimator for the total (SI) . . . . . 78 5.3.2 The π -estimator for the mean (SI) . . . . . 79 5.3.3 The π -estimator for a proportion (SI) . . . 80 5.4 Some examples using R . . . . . . . . . . . . . . . 81 5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . 86 6 Confidence intervals 89 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 90 6.2 Chebyshev-inequality . . . . . . . . . . . . . . . . . 92 6.2.1 Derivation of the Chebyshev-inequality . . . 92 6.2.2 Application of the Chebyshevinequality . . . . . . . . . . . . . . . . . . . 94 6.3 Confidence intervals based on a specific sample . . 94 6.4 Some general remarks . . . . . . . . . . . . . . . . 97 6.4.1 No approximate normality . . . . . . . . . . 97 6.4.2 Simplified variance estimators . . . . . . . . 97 6.4.3 Effect of simplification in the simple random sampling case . . . . . . . . . . . . . . . . . 98 6.4.4 Effect of simplification for the general π estimator . . . . . . . . . . . . . . . . . . . 99 6.4.5 Effect of simplification in stratified and clustered samples . . . . . . . . . . . . . . . . . 99 6.4.6 Bootstrap . . . . . . . . . . . . . . . . . . . 100 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . 101 7 Stratified sampling 103 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 104 7.2 Some notation and an example . . . . . . . . . . . 104 7.2.1 Notation . . . . . . . . . . . . . . . . . . . . 104 7.2.2 Example: Sectors of employment as strata . 105 7.3 Estimation of the total . . . . . . . . . . . . . . . . 108 7.3.1 Simple random sampling within strata . . . 108 7.3.2 Example: simple random sampling within sectors . . . . . . . . . . . . . . . . . . . . . 110 7.4 Choosing the sample size for individual strata . . . 110 7.4.1 The minimization problem . . . . . . . . . 111 7.4.2 The Cauchy-Schwarz inequality . . . . . . . 112 7.4.3 Solving the minimization problem . . . . . 113 7.4.4 Example: optimal sampling size within sectors . . . . . . . . . . . . . . . . . . . . . 115 <?page no="9"?> 10 Contents 7.5 Sample allocation and efficiency . . . . . . . . . . . 116 7.5.1 Variance comparisons based on the variance decomposition . . . . . . . . . . . . . . . . 117 7.5.2 No stratification versus proportional allocation . . . . . . . . . . . . . . . . . . . 118 7.5.3 Proportional allocation versus optimal allocation . . . . . . . . . . . . . . . . . . . . . 118 7.5.4 No stratification versus optimal allocation . 119 7.5.5 An efficiency comparison with R . . . . . . 120 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . 123 8 Cluster sampling 125 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 126 8.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . 126 8.2.1 Clustering the population . . . . . . . . . . 126 8.2.2 Artificially clustering the PSID sample . . . 127 8.2.3 Sampling clusters . . . . . . . . . . . . . . . 127 8.2.4 Inclusion probabilities . . . . . . . . . . . . 128 8.3 Estimating the population total . . . . . . . . . . . 129 8.3.1 The π -estimator of the population total . . 129 8.3.2 Variance of the π -estimator of the population total . . . . . . . . . . . . . . . . . . . . . . 130 8.4 Simple random sampling of clusters (SIC) . . . . . 131 8.4.1 The π -estimator of the population total . . 131 8.4.2 The π -estimator in the PSID example . . . 131 8.4.3 Variance of the π -estimator of the population total . . . . . . . . . . . . . . . . . . . . . . 132 8.4.4 Variance of the mean estimator in the PSID example . . . . . . . . . . . . . . . . . . . . 133 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . 135 9 Auxiliary variables 137 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 138 9.2 The ratio estimator . . . . . . . . . . . . . . . . . . 139 9.2.1 Example of the ratio estimator using PSID data . . . . . . . . . . . . . . . . . . . . . . 140 9.2.2 Taylor series expansion . . . . . . . . . . . 142 9.2.3 The approximate variance of the ratio estimator . . . . . . . . . . . . . . . . . . . . . 145 9.2.4 Estimating the approximate variance of the ratio estimator using PSID data . . . . . . 146 <?page no="10"?> Contents 11 9.2.5 Comparison of the ratio estimator with the simple π -estimator . . . . . . . . . . . . . . 148 9.2.6 The ratio estimator in the regression context . . . . . . . . . . . . . . . . . . . . 149 9.2.7 The linear regression model under specific heteroskedasticity assumption . . . . . . . . 151 9.3 The difference estimator . . . . . . . . . . . . . . . 152 9.3.1 The difference estimator using regression notation . . . . . . . . . . . . . . . . . . . . 152 9.3.2 Properties of the difference estimator . . . . 153 9.3.3 The difference estimator of average wage using the PSID data . . . . . . . . . . . . . 154 9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . 158 10 Regression 161 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 162 10.1.1 Regression without intercept . . . . . . . . 162 10.1.2 Regression with intercept . . . . . . . . . . 165 10.1.3 Multivariate linear regression with intercept 167 10.2 Variance of the parameter estimators . . . . . . . . 169 10.2.1 Linear approximation of the π -estimator . . . . . . . . . . . . . . . . . . 169 10.2.2 The variance of the linear approximation of the π -estimator . . . . . . . . . . . . . . . . 171 10.2.3 Simple regression through the origin . . . . 175 10.2.4 Simple regression with intercept . . . . . . 176 10.2.5 Simple regression with intercept and simple random sampling . . . . . . . . . . . . . . . 177 10.2.6 Wage regression with PSID data and simple random sampling . . . . . . . . . . . . . . . 177 10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . 180 Indices 183 Functions Index . . . . . . . . . . . . . . . . . . . . . . . 183 Subject Index . . . . . . . . . . . . . . . . . . . . . . . . 185 <?page no="12"?> List of Figures 2.1 Standard graphics with R . . . . . . . . . . . . . . 34 3.1 Increasing number of combinations . . . . . . . . . 49 4.1 Approx. dist. of the estimation function . . . . . . 69 5.1 Approx. dist. of the estimation functions . . . . . 83 5.2 Approx. dist. of the estimating functions . . . . . 85 6.1 Approx. dist. of the sample mean . . . . . . . . . . 91 6.2 95%-intervals (normal assum. and Chebyshev) . . 96 7.1 Sectoral mean wages and standard deviations . . . 107 7.2 Approx. dist. of estimating functions . . . . . . . . 122 8.1 Distribution of the cluster totals . . . . . . . . . . 128 8.2 Approx. dist. of the mean estimator . . . . . . . . 132 8.3 Approx. dist. of the variance of the mean estimator 135 9.1 Years of education and wage . . . . . . . . . . . . . 139 9.2 Approx. dist. of the ratio estimator . . . . . . . . 141 9.3 Approx. dist. of variance estimator . . . . . . . . . 148 9.4 Homogeneous regression of wage on years of education150 9.5 Distribution of differences . . . . . . . . . . . . . . 155 9.6 Comparison . . . . . . . . . . . . . . . . . . . . . . 156 10.1 Approx. dist. . . . . . . . . . . . . . . . . . . . . . 180 13 <?page no="14"?> 1 Introduction Survey samples provide the most important sources of information in the social sciences. Basic to all statistical considerations is the random selection of elements of the population into the sample. The sampling design specifies the specific procedure of sampling. While in a strict sense our knowledge will be confined to the elements in the sample, knowledge of the properties of specific sampling designs is indispensable to plan, carry out and analyse survey samples. 1.1 Sources of randomness 16 1.1.1 Stochastic model 16 1.1.2 Design approach 17 1.2 Surveys 17 1.2.1 Characteristics of surveys 17 1.2.2 Sampling frame 19 1.2.3 Probability sampling 19 1.2.4 Sampling and inference 19 1.3 Specific designs 20 1.3.1 Simple random sampling 20 1.3.2 Stratified sampling 20 1.3.3 Cluster sampling 21 1.4 Outline of the book 21 1.5 Exercises 23 <?page no="15"?> 16 1 Introduction 1.1 Sources of randomness Introductory courses in probability calculus discuss properties of random variates. Basic to the idea of randomness is the random generator G . In most applications of probability calculus in the social sciences, characteristics of individuals, e.g. the working status or the income, are regarded as realizations of random variates. A stochastic model can be regarded as a detailed description of a complicated random generator. In the design approach applied in survey sampling the random process is strictly confined to the random sampling of elements from the population. The characteristics of the elements (e.g. their income) are treated as fixed. The difference of the modelling and the design approach can be illustrated by means of simple random experiments, casting dices and dimes and drawing balls from urns. 1.1.1 Stochastic model Stochastic models are specific random generators that can repeatedly be used to generate realizations of random variates. A very specific understanding of social reality, most common in economics, regards this reality as being the product of the application of random generators. Assume we have a fair dice and will provide each person an amount of money equal to the number obtained from casting the dice (in 1000 euro) and call that amount income. Therefore, the income of a person is a random variate having a specific probability distribution. E.g., the expected income is 3500 euro and so is the average of two generated incomes. If we have the impression that this model is not a realistic description of the social process in which incomes are determined, we can improve (complicate) the random process. E.g., we can additionally cast a dime if the person is male. If head comes up, the income is increased by 1000 euro, if tail comes up by 2000 euro. Being still not satisfied with the model, we can throw additionally a dime if the person has an academic degree and increase the income by 1000 euro if head comes and by 2000 euro if tail. Obviously, we can proceed in complicating the model but the main point is that income is generated by means of a (perhaps rather complicated) random number generator and therefore a random variate. Moreover, we can generate as many incomes as we want making use of the random generator repeatedly. Alternatively, as is most common in econometrics, one can imagine the income of a person being a linear <?page no="16"?> 1.2 Surveys 17 combination of her characteristics, e.g. age, years of education and so on, to which a realization of a normally distributed random variate is added. 1.1.2 Design approach In the design approach, the characteristics of the individuals, e.g. their working status or their income, are treated as fixed. Assume a population of six individuals having incomes of 1000, 2000, . . ., 6000 euro. We do not speculate why person number six earns 6000 whereas person number one only earns 1000 euro. Now assume that we do not know the income of the six persons but that we can sample randomly two persons and get information about their income. To obtain a sample, we number six otherwise identical balls, put them in an urn, and draw blindly two balls. The two numbers obtained refer to two of the six persons and we will be informed about their incomes. Obviously, the incomes we observe depend on the sample we happen to draw. The expected average income of the two persons sampled is 3500 euro just as in the example of stochastic modelling. However, the difference is, in the sampling example the incomes are treated as fixed. Therefore, the income is not regarded as a random variate but as a fixed property of the persons. It is only random, which specific persons will be included in the sample. 1.2 Surveys Surveys based on random selection are important sources of information about conditions and changes in society. In Germany, e.g. the micro census (Mikrozensus) carried out every year by the German Federal Statistical Office (Statistisches Bundesamt) includes about 1 million individuals. The German Socio Economic Panel (GSOP) carried out by the German Institute for Economic Research (DIW Berlin) is an important source of information about social and economic conditions in Germany and is used in a large number of scientific analyses. 1.2.1 Characteristics of surveys A survey sample denotes a statistical inquiry and analysis that meets several important requirements. <?page no="17"?> 18 1 Introduction 1. The interest is confined to a well-defined population denoted by U . A census would include all elements of U but is very rarely carried out because of cost and time considerations. 2. Instead of questioning all elements of U , a sample s ⊂ U is sampled and information for the elements in s is obtained. 3. To apply random calculus, the individuals of the sample have to be selected by making use of (pseudo) random numbers, usually generated by means of implemented random generators. We denote a random generator by G . 4. A most simple random generator can be seen in an urn filled with different balls, which are drawn blindly, that is independently of what is written on the balls or their colour. Therefore, the ideal vision of obtaining a random sample is an urn with a ball for each element of the population marked with a non-ambiguous identifier and the blind draw of a specified number of balls. 5. The sample obtained making use of a random generator is called a random sample. We will restrict the discussion in this text towards random samples. 6. The sampling frame contains information to identify all elements in the population. Ideally, we can think of the sampling frame being a file, which uniquely assigns a nonambiguous identifier to each element of the population. We will abstract from the fact that, in practice, it is often difficult to obtain a complete list for a defined population. 7. The variables of interest are often quite numerous but we denote a representative variable of interest by Y . 8. The distribution of variable Y in the population can be characterized by different parameters, e.g. total (sum), mean, standard deviation, and so on. As we do not observe Y for all members of the population but only for the members of the sample, we cannot calculate the true parameters for the population. Instead, we try to estimate the population parameters based on the information provided by the sample. 9. We try to grasp the extent of the expected estimation error by providing estimates of the variance of the estimation functions. <?page no="18"?> 1.2 Surveys 19 1.2.2 Sampling frame The sampling frame ideally is a list containing a non-ambiguous identifier for all elements of the population. Furthermore, information on how the elements can be contacted must be included in the frame. As an example one can think of a complete and up to date register maintained by the city’s registration office including e.g. name, date of birth and address of all inhabitants. Note that in practice a complete and up to date register of the target population is seldom available due to non-registered inhabitants, incomplete registration of persons moving in or out of the relevant area and so on. Furthermore, even if a potential useful register exists, data privacy or prohibitive costs may prevent its use. 1.2.3 Probability sampling Throughout this text, we focus on probability sampling. The sampling space S = {s 1 , s 2 , . . . , s M } consists of the M different samples that can possibly be drawn from the population U . The sampling design associates a probability P( S = s ) = p ( s ) to each of the M possible samples. 1.2.4 Sampling and inference People often associate some miraculous capabilities with survey sampling. However, from the N elements in the population we will know the specific values y only for the n elements contained in the sample, and therefore our knowledge will be restricted to these sampled elements. About all the N − n elements of the population, which have not been sampled, we cannot say anything specific. Therefore, the relevant question is not what can be said about the elements which have not been sampled but rather in what specific way we should carry out the sampling, which estimating functions with their specific characteristics we should apply, and what we can expect to happen in doing so. Hence, we focus on the procedure of sampling and on the general characteristics of estimating functions. A simple trick will be helpful to learn about the properties of sampling designs and estimating functions: We counterfactually consider a population as completely known and observe the outcomes when drawing samples and applying estimating functions to these samples. <?page no="19"?> 20 1 Introduction 1.3 Specific designs The sampling design, which associates a probability p ( s ) to all possible samples contained in S , determines for all population elements the probability for inclusion in a sample. 1.3.1 Simple random sampling The most straightforward example of random sampling is simple random sampling. If we denote the number of elements in the population with N and the number of elements to be drawn as sample size n , there are |S| = M = ( N n ) possible samples. Each sample has the same probability to be drawn, which therefore is p ( s ) = 1 ( N n ) . As each of the N elements in the population has the same probability to be drawn and n elements are drawn, the probability for each element of the population to be drawn is n/ N . We will have a closer look at simple random sampling in chapter 5. 1.3.2 Stratified sampling In practice, simple random sampling is seldom used, as usually there is ex ante information available which may be used to increase the efficiency. E.g., consider sampling from the population of a city. Most often there is information available for different areas of the city, e.g. better and less well-offareas. This information can be used to divide the population into strata, i.e. areas in the example, and then to draw random samples from all strata (areas). Stratification guarantees that elements from all strata will be sampled and therefore prevents extreme samples which would provide a strongly biased impression of the population (e.g. a sample containing only persons from better-offareas). Furthermore, stratification increases the precision of estimation functions. We will have a closer look at stratified sampling in chapter 7. <?page no="20"?> 1.4 Outline of the book 21 1.3.3 Cluster sampling In practice, limiting the costs of a survey is considered most important. If e.g. interviewers can conduct several interviews in narrow areas, the costs will be less compared to interviewing the same number of persons scattered in a wide area since cost of carriage and the required amount of time can be reduced. This cost reducing principle is the basic intention of cluster sampling. Several individual elements of the population are pooled to a cluster, e.g. all persons living in the same apartment house. Now, some of the clusters are sampled randomly and all of the elements within the drawn clusters will be interviewed. Typically, population elements pooled in a cluster will often be alike in several respects. As sampling one cluster implies sampling several similar elements, the results obtained from different samples (with different clusters being sampled) will vary more compared e.g. to a sample drawn with simple random sampling. Hence, there is usually a trade-offin cluster sampling, reduced costs on the one hand and higher variance in sampling results on the other hand. We will have a closer look at cluster sampling in chapter 8. 1.4 Outline of the book In chapter 2, we provide a very brief introduction to the statistical software R. R is open source and available at no cost. It contains an immense variety of implemented statistical methods. All methods discussed in this text will be exemplified using small artificial numerical examples but also using a real data set to see the methods working in practice with R. In chapter 3, we introduce the notation for survey sampling used throughout the text and introduce the inclusion indicator as well as inclusion probabilities. In chapter 4, we discuss estimation functions for population parameters of interest, mainly for the population total and for the population mean. Chapter 5 discusses the simplifications that result when simple random sampling is applied. Chapter 6 is about the calculation of confidence intervals. While it is impossible to make probabilistic statements based on a realized sample, we nevertheless try to characterize different procedures of calculating intervals. <?page no="21"?> 22 1 Introduction Stratified sampling is discussed in chapter 7 and cluster sampling in chapter 8. Sometimes information on an auxiliary variable, say X , is available for all elements of the population. If the variable of interest Y correlates with auxiliary variable X , the estimation of population parameters of Y may benefit from using information on X . Ways to use auxiliary information, e.g. in the form of ratio estimators, are discussed in section 9. Regression analysis is the workhorse in empirical research. The treatment of regression analysis in the design approach differs strongly from the treatment in the model approach and most often, both concepts are regretfully confused. In chapter 10, we discuss the regression in the design approach and point out similarities and differences towards (stochastic) regression in the usual model approach. <?page no="22"?> 1.5 Exercises 23 1.5 Exercises Assume that a sample survey shall be carried out to find out how satisfied students are with their faculty. 1. How would you define the population? 2. Would you consider a census of all students or rather a sample survey? (Why? ) 3. How would you operationalize “being satisfied with their faculty”? 4. What is a sampling frame and how could one be obtained in the example? 5. How could a random sample be obtained? 6. Would you consider stratified sampling? How would you define sensible strata? Would you expect the results of potential samples to vary more or less compared to simple random sampling? 7. Would you consider clustered sampling? How would you define sensible clusters? Would you expect clustered sampling to be more or less efficient compared to simple random sampling? 8. How do you consider the idea of obtaining a sample from alumni? <?page no="24"?> 2 Introduction to R In this chapter, we provide a very concise introduction to R. Furthermore, we briefly introduce important commands to generate random samples from a population. 2.1 Some R basics 26 2.1.1 Object orientation 26 2.1.2 Dataframes 26 2.1.3 Sequences, replications, conditions and loops 27 2.1.4 Matrices 30 2.1.5 Storing and reading data files 31 2.1.6 Probability distributions 32 2.1.7 Graphics 33 2.1.8 Linear regression 34 2.2 Sampling from a population 36 2.2.1 Enumeration of samples 36 2.2.2 The sample() function 37 2.3 Exercises 38 <?page no="25"?> 26 2 Introduction to R 2.1 Some R basics 2.1.1 Object orientation R is an object oriented programming environment for data manipulation, calculation, and graphical display. Objects (e.g. vectors, matrices, tables, and so on) can be given some content using the assignment symbol <- . Functions can be called with arguments given in brackets. E.g., c () combines some elements which we name for instance a. a <c (1,2) We can print the object simply by calling its name, apply functions to it, and do some arithmetics: a ## [1] 1 2 mean (a) ## [1] 1.5 a+1 ## [1] 2 3 The inbuilt help-function can be called using a question mark. ? mean 2.1.2 Dataframes Data are stored in a dataframe, which contains the values and some attributes. As a simple example, we create a small dataframe d using the command data . frame : d <data.frame (x= c ("A","B","C"), y= c (22,35,41), z= c (0,1,0)) d ## x y z ## 1 A 22 0 ## 2 B 35 1 ## 3 C 41 0 Variables of the dataframe d can be selected either by using $ or by using indices: <?page no="26"?> 2.1 Some R basics 27 d$x ## [1] A B C ## Levels: A B C d[1] ## x ## 1 A ## 2 B ## 3 C Individual elements (or parts of the dataframe) can be selected using square brackets [ ]: d[1, 2] ## [1] 22 d[2: 3,1: 2] ## x y ## 2 B 35 ## 3 C 41 d[,2] ## [1] 22 35 41 d[2,] ## x y z ## 2 B 35 1 2.1.3 Sequences, replications, conditions and loops Sequences can be generated using the command seq (from,to, by ). seq (0.8,0.9,0.02) ## [1] 0.80 0.82 0.84 0.86 0.88 0.90 A short form for integers and step size 1 is 1: 5 ## [1] 1 2 3 4 5 Replications can be generated by rep (1: 2,3) ## [1] 1 2 1 2 1 2 rep (1: 2,each=3) ## [1] 1 1 1 2 2 2 <?page no="27"?> 28 2 Introduction to R rep (1: 2,2: 1) ## [1] 1 1 2 For testing a condition before carrying out a command, if or if ... else can be used i <- 1 if(i==1) print ("i=1") ## [1] "i=1" if(i==2) print ("i=2") else print ("i! =2") ## [1] "i! =2" For conditional element selection the vectorized version ifelse can be used i <- -1: 1 ifelse (i<0,NA,i^0.5) ## [1] NA 0 1 Loops are often used for control structures. An example is the following for -loop which uses the paste -command for combining objects: for (x in 3: 5){ print ( paste ("The value of x is",x)) } ## [1] "The value of x is 3" ## [1] "The value of x is 4" ## [1] "The value of x is 5" In many cases we can avoid loops by using the functions apply , lapply or tapply . Using apply (X, MARGIN, FUN) we apply the function specified with the argument FUN towards rows (MAR- GIN=1) or columns (MARGIN=2) of X . To illustrate the working of the command apply we construct a matrix and calculate the sums of the elements in each column: x <matrix (1: 9,ncol=3,byrow=F) x ## [,1] [,2] [,3] ## [1,] 1 4 7 ## [2,] 2 5 8 ## [3,] 3 6 9 apply (X=x,MARGIN=2,FUN=sum) ## [1] 6 15 24 <?page no="28"?> 2.1 Some R basics 29 An abbreviation making use of the ordering of the arguments within the command is apply (x, 2, sum) ## [1] 6 15 24 Note that we can also define a function, which we apply to columns or rows of X . E.g., we may want to select the a-th smallest element of a vector: mysort <function(y,a){ temp <sort (y) temp[a] } # or equivalently mysort <function(y,a) sort (y)[a] # applying the function to exemplary vector mysort ( c (3,7,9,6),2) ## [1] 6 Choosing a = 2, we apply the function towards the columns of X (here: X = y ). The elements of the matrix are realisations of a normal random variate generated using the command rnorm : y <matrix ( rnorm (9),ncol=3,byrow=F) y ## [,1] [,2] [,3] ## [1,] 0.99115 0.3703 -2.0021 ## [2,] -1.38132 1.2762 1.3222 ## [3,] 0.02235 -3.2169 0.3284 apply (y, 2, mysort, 2) ## [1] 0.02235 0.37027 0.32841 The function tapply (X, INDEX, FUN) applies a function specified with FUN towards the elements in groups of X , which are specified using INDEX. x <- 1: 20 i <rep (1: 5,4) tapply (x,i,mean) ## 1 2 3 4 5 ## 8.5 9.5 10.5 11.5 12.5 The function lapply (X, FUN) applies a function specified with FUN towards the elements of list X . Lists can be used to store objects of different type or dimension in a single object. <?page no="29"?> 30 2 Introduction to R x <list (1: 3, c (T,F)) lapply (x,mean) ## [[1]] ## [1] 2 ## ## [[2]] ## [1] 0.5 2.1.4 Matrices Contrary to numeric vectors, matrices have a dimension, which is a number of rows and columns: y <- 1: 3 dim (y) ## NULL x <as.matrix (y) x ## [,1] ## [1,] 1 ## [2,] 2 ## [3,] 3 dim (x) ## [1] 3 1 x <matrix (1: 3) dim (x) ## [1] 3 1 Multiplication applied to vectors is element wise a <- 1: 2 b <- 3: 4 a*b ## [1] 3 8 For matrix multiplication we have to use the symbol %*%: a <matrix (1: 2, ncol=1) b <matrix (3: 4, ncol=2) a%*%b ## [,1] [,2] ## [1,] 3 4 ## [2,] 6 8 <?page no="30"?> 2.1 Some R basics 31 The transpose of a matrix we obtain using t () . The inverse of a matrix solves the equation A −1 · A = I : A <matrix (1: 4,nrow=2) A ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 solve (A) ## [,1] [,2] ## [1,] -2 1.5 ## [2,] 1 -0.5 To generate an identity matrix of dimension n × n we use diag (n): diag (2) ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 1 Using diag () we can also extract diagonal elements of a matrix: diag ( matrix (1: 4,nrow=2)) ## [1] 1 4 and generate a diagonal matrix with specified elements on its diagonal: diag (1: 2) ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 2 2.1.5 Storing and reading data files We save the dataframe d as file test.csv using the function write . table and read it into the workspace using read . table : d <data.frame (x= c ("A","B","C"), y= c (22,35,41), z= c (0,1,0)) write.table (d,"test.csv") d <read.table ("test.csv") d <?page no="31"?> 32 2 Introduction to R ## x y z ## 1 A 22 0 ## 2 B 35 1 ## 3 C 41 0 The default folder the file is saved in can be obtained using getwd () and changed using setwd () . One can also use the specification that is compatible with german defaults (semicolon as delimiter and comma for decimal points) in Excel: write.csv2 (d,"test.csv",row.names=F) read.csv2 ("test.csv") ## x y z ## 1 A 22 0 ## 2 B 35 1 ## 3 C 41 0 2.1.6 Probability distributions Many probability distributions are implemented in R, either in the base library or in special packages. Note that probability distributions describe specific random experiments or processes. Empirical data resulting from observed economic situations will never follow exactly theoretical distributions. Nevertheless, sometimes distributions (e.g. histograms or density estimates) of empirical data look rather similar to theoretical distributions. Using probability distributions with R necessitates the specification of the distribution (e.g. normal) and the function (e.g. quantile function): 1. Distributions - norm normal distribution - unif uniform distribution - binom binomial distribution - pois poisson distribution - . . . 2. Functions - d density function - p probability function - q quantile function - r random number <?page no="32"?> 2.1 Some R basics 33 For the standard normal distribution, we obtain e.g.: dnorm (0) ## [1] 0.3989 qnorm (0.975) ## [1] 1.96 pnorm (1.96) ## [1] 0.975 rnorm (3) ## [1] -0.7434 0.5960 -0.2284 Note that for reproducibility of random numbers, one has to initialize the random generator using set .seed(): rnorm (3) ## [1] -1.1468 0.3116 2.4554 rnorm (3) ## [1] -1.1254 -2.1804 0.9017 set.seed (1) rnorm (3) ## [1] -0.6265 0.1836 -0.8356 set.seed (1) rnorm (3) ## [1] -0.6265 0.1836 -0.8356 2.1.7 Graphics R allows very flexible configuration of graphics. Due to the recognition of object classes and default settings, standard graphics can be produced very easily. We give two examples, a histogram and a scatter plot with the regression line according to the method of least squares added (see figure 2.1). n <- 1000 # generate realisations of chi-squared rand. var. x <rchisq (n,5) # figure a histogram hist (x, main='') # generate realis. of norm. rand. var. u <rnorm (n,sd=3) y <- 10+0.5*x+u <?page no="33"?> 34 2 Introduction to R x 0 5 10 15 20 25 Frequency 0 50 100 150 200 250 300 (a) Basic histogram 0 5 10 15 20 x y 5 10 15 20 (b) Scatter plot with regression line Figure 2.1: Example for standard graphics with R. # figure b scatterplot plot (x,y) abline ( lm (y~x),lwd=2) # add regression line One can export the plots as pdf-files for inclusion in documents using the following command: dev.copy (device=pdf, file="graphic.pdf", width=7, height=4, pointsize=10, onefile=FALSE, paper="special") dev.off () 2.1.8 Linear regression We already made use of the lm () -function for estimating linear models in the last graphic. The function lm () returns a lm-object with detailed information on the estimated linear model. reg <lm (y~x) names (reg) ## [1] "coefficients" "residuals" "effects" ## [4] "rank" "fitted.values" "assign" ## [7] "qr" "df.residual" "xlevels" ## [10] "call" "terms" "model" E.g., the regression coefficients we obtain by referring to the sub-object or by calling the coef -function: reg$coef ## (Intercept) x ## 10.1208 0.4614 <?page no="34"?> 2.1 Some R basics 35 coef (reg) ## (Intercept) x ## 10.1208 0.4614 The summary -command provides detailed information on the regression: sreg <summary (reg) sreg ## ## Call: ## lm(formula = y ~ x) ## ## Residuals: ## Min 1Q Median 3Q Max ## -10.281 -2.011 -0.007 2.257 9.570 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.1208 0.1834 55.2 <2e-16 *** ## x 0.4614 0.0313 14.7 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.16 on 998 degrees of freedom ## Multiple R-squared: 0.178,Adjusted R-squared: 0.178 ## F-statistic: 217 on 1 and 998 DF, p-value: <2e-16 The summary-object again contains much more information on the regression: names (sreg) ## [1] "call" "terms" "residuals" ## [4] "coefficients" "aliased" "sigma" ## [7] "df" "r.squared" "adj.r.squared" ## [10] "fstatistic" "cov.unscaled" E.g., the coefficient of determination: sreg$r.squared ## [1] 0.1784 The estimated variance-covariance matrix of the regression parameters [ ˆ σ 2 · ( X ′ X ) −1 ] can be obtained by: vcov (reg) ## (Intercept) x ## (Intercept) 0.03363 -0.0048203 ## x -0.00482 0.0009821 <?page no="35"?> 36 2 Introduction to R 2.2 Sampling from a population We introduce some specific commands which are useful in the context of survey sampling. 2.2.1 Enumeration of samples Consider a population U of size N = 4 with variable of interest Y . A sample of size n = 2 shall be drawn. We use some artificial data. N <- 4 n <- 2 Y <c (10,13,25,52) U <- 1: 4 The number of all possible samples |S| = M is obtained using the binomial coefficient M <choose (N,n); M ## [1] 6 The enumeration of the M samples with indexes of population elements are obtained using the function combn(). si <combn (N,n); si ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 1 1 2 2 3 ## [2,] 2 3 4 3 4 4 The M = 6 columns contain the indexes of the population elements which are drawn into the samples. E.g., sample 4 (i.e. column 4) contains the two elements 2 and 3. If we are interested in the values of Y that are contained in the samples, we can apply the indexes towards the population values of Y . myf <function(z) Y[z] ys <apply (si,2,myf); ys ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 10 10 10 13 13 25 ## [2,] 13 25 52 25 52 52 # or equivalently ys <apply (si,2,function(z) Y[z]); ys ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 10 10 10 13 13 25 ## [2,] 13 25 52 25 52 52 <?page no="36"?> 2.2 Sampling from a population 37 If we are interested in the population mean, we can obtain the estimators based on each of the M samples as colMeans (ys) ## [1] 11.5 17.5 31.0 19.0 32.5 38.5 2.2.2 The sample() function We can generate a random sample of size n = 2 using the function sample (). We can either draw the indexes of population elements and obtain the values of Y for these elements (alternative 1) or we can directly sample values of Y (alternative 2). # alternative 1 set.seed (1) i <sample (U,n); i ## [1] 2 4 y <- Y[i]; y ## [1] 13 52 # alternative 2 set.seed (1) y <sample (Y,n); y ## [1] 13 52 <?page no="37"?> 38 2 Introduction to R 2.3 Exercises 1. Define a vector x containing the values { 1 , 5 , 12 , 3 } . a) Sort x using sort () and order () commands. b) Calculate ∑ i x i . c) Let object n contain the length of x. d) Calculate the variance of x and compare your result with the result obtained from var (x). e) Calculate the median of x and compare your result with the results obtained from the following commands: quantile (x,type=1,0.5) quantile (x,type=7,0.5) median (x). 2. Probability distributions a) Consider the exponential distribution with rate λ = 0 . 2. i. Calculate the value of the probability distribution for x = 2 . 5. ii. Calculate the value of the density distribution for x = 0 . 8. iii. Calculate the median. iv. Generate a realisation with n = 10 from the exponential distribution (rate λ = 0 . 2). b) Consider the normally distributed random variate X ∼ N ( µ = 2 , σ 2 = 5) and calculate P( X < 3) , P( X > 0 . 5) , and P( − 2 < X < 3) . c) Consider the Poisson distributed random variate X ∼ Pois( λ = 4) and calculate P( X = 3) , P( X ≤ 2) , and P(1 < X ≤ 5) . d) Generate a realisation with n = 100 from the normal distribution ( X ∼ N ( µ = 2 , σ 2 = 5)) and display the distribution using hist (). <?page no="38"?> 2.3 Exercises 39 3. Draw a sample of size n = 20 from X and calculate the mean. Repeat this for B = 1000 samples, store the calculated means and display the distribution. 4. Define the following matrix using the command x <matrix (1: 16,nrow=4,ncol=4,byrow=F) x =     1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16     . a) Select the vector containing the second column of x . b) Select the vector containing the third row of x . c) Select the 2 × 2 − matrix containing the elements x 2,3 , x 2,4 , x 3,3 , x 34 . 5. Consider the following simple numerical example y =       3 1 8 3 5       , X =       1 3 5 1 1 4 1 5 6 1 2 4 1 4 6       . and obtain the parameter vector ˆ β of the linear model y i = β 0 + β 1 x 1i + β 2 x 2i + u i . <?page no="40"?> 3 Inclusion probabilities In this chapter, we introduce the inclusion indicator and inclusion probabilities. The distinguished feature of modern survey sampling theory is the distinction between fixed values of statistical variables and the inclusion indicator, which is random. Hence, the stochastic elements in sampling focus on the inclusion indicator. 3.1 Introduction 42 3.2 Some notation 42 3.3 Inclusion indicator I 43 3.4 A small example 44 3.5 Inclusion probabilities π 45 3.6 Obtaining inclusion probabilities with R 46 3.7 Simple random sampling (SI) 47 3.8 Properties of the inclusion indicator 50 3.8.1 The expected value of the inclusion indicator 51 3.8.2 The variance of the inclusion indicator 51 3.8.3 The covariance of the inclusion indicator 51 3.8.4 Properties of the covariance 52 3.8.5 Covariance matrix and sums of sums 53 3.9 Exercises 54 <?page no="41"?> 42 3 Inclusion probabilities 3.1 Introduction In this chapter we introduce two important concepts, the inclusion indicator and the inclusion probability. 3.2 Some notation U (Universe) denotes the set of all units belonging to a well-defined population. u k with k = 1 , 2 , . . . , N refers to individual units of U . For short notation, we use k for referring towards u k . s denotes a realised sample which is a subset of U : s ⊂ U . Throughout the text, we assume that the number N of units in U is known. The statistical variable of interest is denoted by Y . For each unit in U , we assume that k is known (e.g. a register or a directory), but y k with k ∈ U , the specific value the statistical variable Y takes for element u k , is unknown. The distribution of the statistical variable Y in the population U can be characterized by parameters. The total is denoted by t y and defined as t y = ∑ U y k = N ∑ k=1 y k . (3.1) The mean is denoted by ¯ y U and defined as ¯ y U = 1 N ∑ U y k = 1 N N ∑ k=1 y k . (3.2) If we introduce a second statistical variable Z , the ratio of Y and Z will be denoted by r r = t y t z . (3.3) The variance of Y in U is denoted by s 2 y,U and defined as s 2 y,U = 1 N − 1 ∑ U ( y k − ¯ y U ) 2 . (3.4) Note that in the definition of the variance the denominator ( N − 1) is used. This matches the implementation of the variance function in R. If Y is dichotomous, it indicates whether a unit has a specific attribute or property ( Y = 1) or not ( Y = 0) y k = { 1 , unit k has attribute 0 , else. (3.5) <?page no="42"?> 3.3 Inclusion indicator I 43 In this case, the fraction of units having the attribute is simply the mean (¯ y U ). S denotes the set of all possible samples that can be drawn from the population U S = {s 1 , s 2 , . . . , s M } , (3.6) where M is the number of possible samples. The sampling design defines the probability distribution p ( · ) for S . Before drawing a sample, it is random which sample of S will be realised. S is random and denotes a sample to be drawn. S = s indicates that S , which is random, has taken a specific realisation s (sample). p ( s ) is the shorthand notation for the probability that S equals s , that is that the specific sample s will be drawn P( S = s ) = p ( s ) for all s ∈ S. (3.7) It holds that p ( s ) ≥ 0 for all s ∈ S . That is, the probability for each possible sample is non-negative. If we draw a sample, it is certain that one of the M possible samples will be drawn, hence ∑ s ∈ S p ( s ) = 1 . (3.8) 3.3 Inclusion indicator I The indicator variable I is a dichotomous random variate I k = { 1 , k ∈ S 0 , k / ∈ S. (3.9) Whether unit k will be in the sample depends on which sample will be drawn. Hence, the random indicator variable I k is a function of the random variate S I k = I k ( S ) . (3.10) Let n denote the identical sample size of all M possible samples. Note that each sample s contains n elements of U and N − n elements of U are not included in the sample. Whether element k is element of the sample depends on the sample and is indicated by the inclusion indicator I k . Therefore, whether I k takes value 1 or 0 depends on S . <?page no="43"?> 44 3 Inclusion probabilities 3.4 A small example We assume a population, which consists of four units: U = {u 1 , u 2 , u 3 , u 4 }. Y k is e.g. the income of unit k . Now we draw a random sample of size n = 2. The average income in the sample ¯ y S will be taken as an estimator of the average income in the population ¯ y U . Because it is random which sample will be drawn, the average income in the sample is random. We want to calculate how many different samples are possible. We will choose n elements from the N elements in the population. The ordering in the sample is not relevant, because we consider the sum of the incomes of all the units in the sample. We will not replace a drawn unit, which means that a unit can at most be included once in the sample. Hence, we are interested in the number of combinations without replacement M = ( N n ) = ( 4 2 ) = 4! 2! 2! = 6 . (3.11) We now have a closer look at the M possible samples. Using numbers from 1 to 4 to indicate the population units, the following samples are possible: s 1 = { 1 , 2 }, s 2 = { 1 , 3 }, s 3 = { 1 , 4 }, s 4 = { 2 , 3 }, s 5 = { 2 , 4 }, s 6 = { 3 , 4 }. We can also characterize each sample using the inclusion indicators of the 4 units of the population s : {I 1 , I 2 , I 3 , I 4 }. The samples are then expressed as s 1 : {I 1 = 1 , I 2 = 1 , I 3 = 0 , I 4 = 0 }, s 2 : {I 1 = 1 , I 2 = 0 , I 3 = 1 , I 4 = 0 }, s 3 : {I 1 = 1 , I 2 = 0 , I 3 = 0 , I 4 = 1 }, s 4 : {I 1 = 0 , I 2 = 1 , I 3 = 1 , I 4 = 0 }, s 5 : {I 1 = 0 , I 2 = 1 , I 3 = 0 , I 4 = 1 }, s 6 : {I 1 = 0 , I 2 = 0 , I 3 = 1 , I 4 = 1 }. If applying simple random sampling, each of the M = 6 possible samples has the identical probability P( S = s ) = p ( s ) = 1 / 6 . <?page no="44"?> 3.5 Inclusion probabilities π 45 3.5 Inclusion probabilities π The inclusion probability π k is the probability that a random sample S will be drawn, which contains element k π k = P( k ∈ S ) = P( I k = 1) = ∑ s 3 k p ( s ) . (3.12) Here, s 3 k means that we consider all samples s , which include element k . Note that there are many different possible samples, of which some will include k and some will not. If k is included in sample s , this is indicated by I k = 1. The inclusion probability π k can therefore be calculated through summing up the probabilities p ( s ) of the samples that contain element k . In our small example with n = 2 and N = 4, we find for k = 1 π 1 = P(1 ∈ S ) = P( I 1 = 1) = ∑ s 3 1 p ( s ) = p ( s 1 ) + p ( s 2 ) + p ( s 3 ) . (3.13) Summing up all individual inclusion probabilities π k for all k ∈ U , we find ∑ k ∈ U π k = ∑ k ∈ U ∑ s 3 k p ( s ) = n ∑ s ∈ S p ( s ) = n · 1 = n. (3.14) The following table shows the M = |S n | samples in the columns and N rows for elements of the population. As each sample has n elements, in each of the columns we find n indicators where I k = 1. Hence, each of the M = |S n | possible samples is included n times in the double sum. s 1 s 2 · · · s M I 1 ( s 1 ) I 1 ( s 2 ) · · · I 1 ( s M ) I 2 ( s 1 ) I 2 ( s 2 ) · · · I 2 ( s M ) . . . . . . . . . . . . I N ( s 1 ) I N ( s 2 ) · · · I N ( s M ) The probability that elements k and l will be jointly sampled equals the probability that a sample s will be drawn, which contains both elements π kl = P( k & l ∈ S ) = P( I k I l = 1) = ∑ s 3 k&l p ( s ) . (3.15) <?page no="45"?> 46 3 Inclusion probabilities It holds that π kl = π lk for all k, l . Note: If k = l then π kl = π k,k = P( k & k ∈ S ) = P( I k I k = 1) = P( I 2 k = 1) = P( I k = 1) = π k . (3.16) In the small example, we find for elements k = 1 and l = 2 π 1,2 = P(1 & 2 ∈ S ) = P( I 1 I 2 = 1) = ∑ s 3 1&2 p ( s ) = p ( s 1 ) . (3.17) The expected value of the inclusion indicator I k is π k as there are only two possible values 1 and 0 with probabilities π k and 1 − π k , respectively: E( I k ) = 1 · P( I k = 1) + 0 · P( I k = 0) = P( I k = 1) = π k . (3.18) Since the product of I k and I l is 1 only in the case that both elements k and l are included in the sample, we find E( I k I l ) = 1 · P( I k I l = 1) + 0 · P( I k I l = 0) = P( I k I l = 1) = π kl . (3.19) Note: If π k = 0 there is no chance to obtain any information about k through sampling, hence, most often it is warranted that π k > 0 for all k ∈ U by the sampling design. 3.6 Obtaining inclusion probabilities with R We assume a small population U : {u 1 , u 2 , u 3 , u 4 } and a sample size n = 2. The sampling design we consider as an example defines that there are only three possible samples S : {s 1 , s 2 , s 3 } s 1 = {u 1 , u 2 }, s 2 = {u 2 , u 4 }, s 3 = {u 3 , u 4 }. The probabilities for the three samples are p ( s 1 ) = 0 . 5, p ( s 2 ) = 0 . 3, p ( s 3 ) = 0 . 2. We first build a matrix containing the three possible samples with element indexes in columns. m <cbind ( c (1,2), c (2,4), c (3,4)); m ## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 2 4 4 <?page no="46"?> 3.7 Simple random sampling (SI) 47 Next, we obtain the matrix with inclusion indicators for all N = 4 elements in columns. Columnwise (samples) we determine which of the four elements of the population are contained in the sample using the function is .element which returns logical values. The logical values we transform to numerical values. i <apply (m,2,function(z) as.numeric ( is.element (1: 4,z))) i ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 1 1 0 ## [3,] 0 0 1 ## [4,] 0 1 1 Finally, we multiply each column by the probability to obtain the respective sample s , which is p ( s ), and sum over all N elements of U . p <c (0.5,0.3,0.2) pi_k <colSums ( t (i)*p); pi_k ## [1] 0.5 0.8 0.2 0.5 sum (pi_k) ## [1] 2 We find that the calculated inclusion probabilities sum to n = 2. 3.7 Simple random sampling (SI) Simple random sampling (SI) is characterized by the property that each sample s with sample size n and s ∈ S has the same probability to be drawn. This results in identical inclusion probabilities π k for all k ∈ U π k = n N . (3.20) Note that we are not interested in the ordering of the appearance of the elements in the sample. Hence, we are interested in combinations (not variations). Different n -tuples belonging to the identical equivalence class, e.g. (1 , 5 , 9 , 9), (5 , 1 , 9 , 9), (9 , 1 , 5 , 9) , . . . are regarded as one single sample. The number of combinations without replacement is given by the binomial ( N n ) . (3.21) <?page no="47"?> 48 3 Inclusion probabilities Because all samples have the identical probability, we find p ( s ) = ( N n ) −1 . (3.22) The number of possible samples is rapidly increasing in the sample size ( 100 2 ) = 4950 , ( 100 5 ) = 75 , 287 , 520 and ( 100 10 ) = 17 , 310 , 309 , 456 , 440 , as well as in the population size ( 15 10 ) = 3003 , ( 20 10 ) = 184 , 756 and ( 50 10 ) = 10 , 272 , 278 , 170 . We illustrate this fact graphically using R (see figure 3.1). # figure a calculations N <- 100 nmax <- 10 nv <- 1: nmax M <rep (NA,nmax) for (i in 1: nmax) M[i] <choose (N,nv[i]) # figure a plot plot (nv,M,ylab="",xlab="n",pch=19,cex=0.6) # figure b calculations n <- 10 Nmax <- 100 Nv <- 10: Nmax lNv <length (Nv) M <rep (NA,lNv) for (i in 1: lNv) M[i] <choose (Nv[i],n) # figure b plot plot (Nv,M,ylab="",xlab="N",pch=19,cex=0.6) The definition of the inclusion probability is π k = P( k ∈ S ) = P( I k = 1) = ∑ s 3 k p ( s ) . (3.23) <?page no="48"?> 3.7 Simple random sampling (SI) 49 2 4 6 8 10 n M ·10 13 0.0 0.5 1.0 1.5 (a) N = 100, increasing n 20 40 60 80 100 N M ·10 13 0.0 0.5 1.0 1.5 (b) n = 10, increasing N Figure 3.1: Increasing number of combinations in n and in N . How can we obtain a general expression for π k given N and n ? The idea is to make use of the Laplace definition of probability: “The theory of chance consists in reducing all the events of the same kind to a certain number of cases equally possible, that is to say, to such as we may be equally undecided about in regard to their existence, and in determining the number of cases favorable to the event whose probability is sought. The ratio of this number to that of all the cases possible is the measure of this probability, which is thus simply a fraction whose numerator is the number of favorable cases and whose denominator is the number of all the cases possible.” 1 Therefore, we have to divide the number of samples that contain element k by the number of all possible samples. The number of samples with element k is ( N − 1 n − 1 ) . This is the number of different samples of size n − 1 possibly drawn from a population of size N − 1. If we add the additional element k to each of these samples, they have size n and each sample contains element k . We also have to add the additional element k to the population ( N − 1), which then has N elements. Hence, this is the number of samples of size n from population of size N that contain k . Applying the Laplace definition of probability 1 Pierre Simon Laplace (1902), A philosophical essay on probabilities, Wiley, New York, 6-7. <?page no="49"?> 50 3 Inclusion probabilities “. . .the number of events in the disjunction divided by the total number of elementary events” results in π k = ( N−1 n−1 ) ( N n ) = (N−1)! (n−1)! (N−1−n+1)! N! n! (N−n)! = (N−1)! (n−1)! N! n! = n! (n−1)! N! (N−1)! = n N . (3.24) The inclusion probability π kl is obtained using an analogue reasoning. We obtain the number of possible samples of size n − 2 drawn from a population of size N − 2. The number of combinations is ( N − 2 n − 2 ) . To each of these samples we add both elements k and l , therefore, all samples now contain the elements k and l . Hence, this is the number of samples including k and l of size n drawn from a population of size N . Using again Laplace, we obtain π kl = ( N−2 n−2 ) ( N n ) = (N−2)! (n−2)! (N−2−n+2)! N! n! (N−n)! = (N−2)! (n−2)! N! n! = n! (n−2)! N! (N−2)! = n ( n − 1) N ( N − 1) . (3.25) As a simple numerical example, we obtain for simple random sampling of size n = 2 from a population of size N = 4 π k = n N = 2 4 = 0 . 5 and π kl = n ( n − 1) N ( N − 1) = 2(2 − 1) 4(4 − 1) = 1 6 . 3.8 Properties of the inclusion indicator The inclusion indicator I k = I k ( S ) is a function of the random variate S , as its value depends on which of the M possible samples will be drawn. <?page no="50"?> 3.8 Properties of the inclusion indicator 51 3.8.1 The expected value of the inclusion indicator The expected value of the inclusion indicator is the inclusion probability E( I k ) = π k . (3.26) because I k = I k ( S ) is a Bernoulli random variate that can either take value 0 or 1. The respective probabilities are 1 − π k and π k , hence E( I k ) = 0 · (1 − π k ) + 1 · π k = π k = P( I k = 1) . (3.27) 3.8.2 The variance of the inclusion indicator The variance of a random variate Z is defined as V( Z ) = E [( Z ) − E( Z )] 2 = E( Z 2 ) − E( Z ) 2 . (3.28) Applying the definition of the variance on the inclusion indicator, we find V( I k ) = E[( I k − E( I k )] 2 = E( I 2 k ) − E( I k ) 2 . (3.29) Given I k = I 2 k , we obtain V( I k ) = E( I k ) − E( I k ) 2 = π k − π 2 k = π k (1 − π k ) . (3.30) 3.8.3 The covariance of the inclusion indicator The covariance of two random variates Z and X is defined as Cov( Z, X ) = E[ Z − E( Z )] E[ X − E( X )] = E( ZX ) − E( Z ) E( X ) . (3.31) Applying the definition of the covariance on the inclusion indicators, we find Cov( I k , I l ) = E[ I k − E( I k )] E[ I l − E( I l )] = E( I k I l ) − E( I k ) E( I l ) . (3.32) Hence, the covariance of inclusion indicators I k and I l is Cov( I k , I l ) = π kl − π k π l . (3.33) Note that the product of two inclusion indicators I k I l will take the value 1 only in the case that both elements k and l are included in the sample S : I k I l = { 1, if I k = 1 & I l = 1 0, else. (3.34) <?page no="51"?> 52 3 Inclusion probabilities Therefore, E( I k I l ) = 0 · (1 − π kl ) + 1 · π kl = P( I k I l = 1) = π kl (3.35) follows from the definition of the inclusion probability π kl = P( k & l ∈ S ) = P( I k I l = 1) = ∑ s 3 k&l p ( s ) . (3.36) 3.8.4 Properties of the covariance We use the following abbreviation Cov( I k , I l ) = π kl − π k π l = ∆ kl . (3.37) If k = l then π kk = π k and Cov( I k , I l ) = V( I k ) = π k (1 − π k ) = ∆ kk . (3.38) In the case of simple random sampling (SI), we find for the inclusion probabilities π k = π l = n N and π kl = n ( n − 1) N ( N − 1) (3.39) and for the covariance Cov( I k , I l ) = ∆ kl = π kl − π k π l = n ( n − 1) N ( N − 1) − n N n N . (3.40) Using the sampling fraction f = n/ N , the covariances can be shown to be strictly negative in the SI case Cov( I k , I l ) = n ( n − 1) N ( N − 1) − n N n N = f ( n − 1) ( N − 1) − ff = −f [ (1 − n ) ( N − 1) + f ] = −f [ (1 − n ) ( N − 1) + ( N − 1) f ( N − 1) ] = −f (1 − n + ( N − 1) f ) ( N − 1) = −f (1 − n + Nf − f ) ( N − 1) = −f (1 − n + N n N − f ) ( N − 1) = −f (1 − f ) N − 1 < 0 , (3.41) because f = n/ N > 0, 1 − f = 1 − n/ N > 0 and N − 1 > 0. <?page no="52"?> 3.8 Properties of the inclusion indicator 53 3.8.5 Covariance matrix and sums of sums Covariances necessitate the use of double sums (sums of sums). Let A be a set of elements. a kl denotes the value for elements k and l . In the following we use ∑∑ A a kl as an abbreviation for ∑ k ∈ A ∑ l ∈ A a kl = ∑ k ∈ A a kk + ∑ k ∈ A ∑ l ∈ A k6=l a kl = ∑ k ∈ A a kk + ∑∑ A k6=l a kl . We consider a most simple example in which set A contains only two elements: A = { 1 , 2 } ∑∑ A a kl = ∑ k ∈ A ∑ l ∈ A a kl = a 11 + a 12 + a 21 + a 22 = ∑ k ∈ A a kk + ∑ k ∈ A ∑ l ∈ A k6=l a kl = a 11 + a 22 + a 12 + a 21 . <?page no="53"?> 54 3 Inclusion probabilities 3.9 Exercises 1. A sample of size n = 3 shall be drawn from a population of size N = 6 by simple random sampling (without replacement). The number of combinations without replacement is given by ( N n ) . Derive this expression. 2. The inclusion indicator I is a function of the random variate S . Please explain. 3. What is the meaning of π k and how can it be obtained? 4. What is the meaning of π kl and how can it be obtained? 5. Consider a population with four elements ( N = 4) U : = {u 1 , u 2 , u 3 , u 4 } A sample of size n = 2 is to be drawn from the population. a) Consider for the moment the case of simple random sampling (SI). i. Obtain M = |S| . ii. Obtain π k . iii. Obtain the sum of all inclusion probabilities of elements k ∈ U . b) Consider the following sampling design: S n = {s 1 , s 2 , s 3 } where s 1 = {u 1 , u 3 }, s 2 = {u 1 , u 4 }, s 3 = {u 2 , u 4 }. Assume the following probabilities for the samples: p ( s 1 ) = 0 . 1 , p ( s 2 ) = 0 . 6 , p ( s 3 ) = 0 . 3 . i. Obtain all inclusion probabilities π k . ii. Obtain numerically the sum of inclusion probabilities. iii. Obtain all inclusion probabilities π kl . <?page no="54"?> 3.9 Exercises 55 6. Solve exercise 5 using R. 7. Show that the covariance of the inclusion indicators I k and I l is negative in the case of simple random sampling (SI). 8. Obtain the covariance matrix of inclusion indicators for the example of exercise 5b using R. <?page no="56"?> 4 Estimation The distribution of a statistical variable in the population can be characterized by a set of parameters. Having obtained a sample from the population one can estimate the population parameters based on the elements in the sample. In this chapter we discuss some estimation functions for population parameters. 4.1 Introduction 58 4.2 Estimating functions and estimators 58 4.3 Properties of estimation functions 58 4.4 The π-estimator 59 4.4.1 Properties of the π-estimator 59 4.4.2 Expected value and variance of the π-estimator 59 4.4.3 An alternative expression of the variance 62 4.4.4 The Yates-Grundy variance of the total 62 4.5 Estimation using R 64 4.5.1 A small numerical example 64 4.5.2 An empirical example: PSID 68 4.6 Generating samples with unequal inclusion probabilities 70 4.6.1 Probabilities proportional to size (PPS) 70 4.6.2 The Sampford-algorithm 72 4.7 Exercises 73 <?page no="57"?> 58 4 Estimation 4.1 Introduction In this chapter, we introduce the π -estimator. We discuss estimation functions for the total and its variance and demonstrate the working of estimating functions using empirical wage data. 4.2 Estimating functions and estimators We use the following notation: • θ population parameter, • ˆ θ = ˆ θ ( S ) estimation function for θ based on a sample S to be drawn, • ˆ θ = ˆ θ ( s ) estimator (numerical value) for θ based on a realized sample s . Unfortunately, in the literature most often the estimator ˆ θ refers to the numerical estimator as well as to the estimation function. In sampling theory, we are interested in the properties of the estimation functions. The distribution of an estimation function can be obtained through the inspection of all M = |S| possible samples s . We focus on the properties of ˆ θ = ˆ θ ( S ) and less on the outcome of a specific realized sample s . 4.3 Properties of estimation functions We consider the following properties to characterize an estimation function: • unbiasedness: E( ˆ θ ) = ∑ s ∈ S ˆ θ ( s ) p ( s ) = θ, • variance: V( ˆ θ ) = ∑ s ∈ S ( ˆ θ ( s ) − E( ˆ θ ) ) 2 p ( s ) , • bias: B( ˆ θ ) = E( ˆ θ ) − θ, • mean square error: mse = E [ ˆ θ − θ ] 2 = ∑ s ∈ S ( ˆ θ ( s ) − θ ) 2 p ( s ) . <?page no="58"?> 4.4 The π-estimator 59 Note the following relation between mean square error, variance and bias mse = V( ˆ θ ) + [ B( ˆ θ ) ] 2 . (4.1) 4.4 The π-estimator We are interested in obtaining the total t = t y of the population U . Based on a sample s of size n drawn with inclusion probabilities π k , we can obtain the π -estimator as: ˆ t π = ∑ U I k y k π k = ∑ U I k ˇ y k = ∑ s y k π k = ∑ s ˇ y k , (4.2) where ˆ t π is the “ π -estimator” of the total t = t y . 4.4.1 Properties of the π-estimator The π -estimator is defined for each sampling design. The sampling design is characterized by the inclusion probabilities π k for k = 1 , . . . , N which have to be known for each k in sample s for estimation. ˇ y k = y k / π k is the “expanded value” of the value y k obtained for element k in the sample s . ˇ y k can be regarded as the value of y k expanded to represent 1 / π k elements of the population. The intuition is, that elements with large inclusion probabilities will show up “too often” and elements with small inclusion probabilities “too seldom” in the sample. The inverse probability weighting just “corrects” this imbalance giving higher weights to the elements with lower inclusion probabilities. Note that π k > 0 for all k ∈ U must hold, as otherwise a sample cannot provide any information on element k . The π -estimator is also called Horvitz-Thompson estimator or inverse probability estimator. 4.4.2 Expected value and variance of the π-estimator As the values of the statistical variable Y are treated as fix for all elements in the population, only the inclusion indicator is a random variate. Its value (0 or 1) depends on S . Since we consider sampling without replacement, there exist dependencies between the inclusion indicators which are captured by the covariance Cov( I k , I l ) = ∆ kl =: { π k (1 − π k ) , k = l π kl − π k π l , k 6 = l. (4.3) <?page no="59"?> 60 4 Estimation The expanded value of the covariance is: ˇ ∆ kl = ∆ kl π kl = { 1 − π k , k = l 1 − π k π l π kl , k 6 = l. (4.4) Since the inclusion probabilities are determined by the survey design and are fix as well as the values of the statistical variable Y for all k ∈ U , the expected value of ˆ t π can be obtained as: E ( ∑ U I k π k y k ) = ∑ U y k π k E ( I k ) = ∑ U y k π k π k = ∑ U y k . (4.5) As the expected value of the π -estimator of the total is just the total, the π -estimator is an unbiased estimator of the total. The variance of ˆ t π , we obtain as: V( ˆ t π ) = V ( ∑ U I k π k y k ) = V ( ∑ U I k ˇ y k ) . (4.6) The variance of the sum of two random variates X and Y , multiplied by constant factors a or b is V( aX + bY ) = a 2 V( X ) + b 2 V( Y ) + 2 ab Cov( X, Y ) . (4.7) We use this property and generalise it towards the case of N terms of the sum. In the case of the variance of the total estimator, the expanded values ˇ y k replace the constants ( a, b, . . . ) and the inclusion indicators replace the random variates ( X, Y, . . . ). Note that the variance is a quadratic expression. If we focus on a specific k , we find for this k N − 1 covariances Cov( I k , I l ) , k 6 = l and one variance Cov ( I k , I l ) = V ( I k ) for k = l . This holds for all k ∈ U . We obtain the Horvitz and Thompson estimator (1952) of the variance as: V( ˆ t π ) = ∑ U V( I k )ˇ y 2 k + ∑∑ U k6=l Cov( I k , I l )ˇ y k ˇ y l = ∑ U ∆ kk ˇ y 2 k + ∑∑ U k6=l ∆ kl ˇ y k ˇ y l = ∑∑ U ∆ kl ˇ y k ˇ y l . (4.8) Based on a sample s , we obtain the estimator of the variance of the π -estimator of the total ̂ V( ˆ t π ) = ∑∑ s ˇ ∆ kl ˇ y k ˇ y l with ˇ ∆ kl = ∆ kl π kl . (4.9) <?page no="60"?> 4.4 The π-estimator 61 Since the inclusion probabilities π k and π kl are determined by the survey design and are fix as well as the values of the statistical variable Y for all k ∈ U , the expected value of the variance estimator can be obtained as: E ( ̂ V( ˆ t π ) ) = E ( ∑∑ s ˇ ∆ kl ˇ y k ˇ y l ) = E ( ∑∑ U I k I l ˇ ∆ kl ˇ y k ˇ y l ) = ∑∑ U E( I k I l ) ˇ ∆ kl ˇ y k ˇ y l = ∑∑ U π kl ∆ kl π kl ˇ y k ˇ y l = ∑∑ U ∆ kl ˇ y k ˇ y l = V( ˆ t π ) . (4.10) Expressing the covariances and expanded values using inclusion probabilities, we find: V( ˆ t π ) = ∑∑ U ∆ kl ˇ y k ˇ y l = ∑∑ U ∆ kl y k π k y l π l = ∑∑ U ( π kl − π k π l ) y k π k y l π l = ∑∑ U ( π kl π k π l − 1 ) y k y l = ∑∑ U π kl π k π l y k y l − ∑∑ U y k y l = ∑∑ U π kl π k π l y k y l − ( ∑ U y k ) 2 . (4.11) The estimator of the variance expressed using inclusion probabilities π k and π kl is: ̂ V( ˆ t π ) = ∑∑ s π −1 kl ( π kl π k π l − 1 ) y k y l . (4.12) <?page no="61"?> 62 4 Estimation 4.4.3 An alternative expression of the variance For the variance of a variable X , it holds that V( x ) = 1 n n ∑ i=1 ( x i − ¯ x ) 2 = 1 n n ∑ i=1 x 2 i − ¯ x 2 = 1 2 n 2 n ∑ i=1 n ∑ j=1 ( x i − x j ) 2 . (4.13) This can be shown using some manipulations V( x ) = 1 2 n 2 n ∑ i=1 n ∑ j=1 ( x i − x j ) 2 = 1 2 n 2 n ∑ i=1 n ∑ j=1 ( x 2 i + x 2 j − 2 x i x j ) = 1 2 n 2 [ n ∑ i=1 n ∑ j=1 x 2 i + n ∑ i=1 n ∑ j=1 x 2 j − n ∑ i=1 n ∑ j=1 2 x i x j ] = 1 2 n 2 [ n n ∑ i=1 x 2 i + n n ∑ j=1 x 2 j − 2 n ∑ i=1 n ∑ j=1 x i x j ] (4.14) = 1 2 n 2 [ 2 n n ∑ i=1 x 2 i − 2 n ∑ i=1 n ∑ j=1 x i x j ] = 1 n n ∑ i=1 x 2 i − 1 n n ∑ i=1 x i 1 n n ∑ j=1 x j = 1 n n ∑ i=1 x 2 i − ¯ x 2 . 4.4.4 The Yates-Grundy variance of the total Using analogue manipulations, we can show that the variance of the π -estimator for sampling designs with fixed n can be expressed as (Yates and Grundy [1953] and Sen [1953]): 1 V( ˆ t π ) = ∑∑ U ∆ kl ˇ y k ˇ y l = − 1 2 ∑∑ U ∆ kl (ˇ y k − ˇ y l ) 2 . (4.15) 1 Yates, F., Grundy P. M. (1953), Selection without replacement from within strata with probability proportional to size, Journal of the Royal Statistical Society, B, Vol. 15, 253-261. Sen, A. R. (1953), On the estimate of the variance in sampling with varying probability, Journal of the Indian Society of Agricultural Statistics, Vol. 5, 119-129. <?page no="62"?> 4.4 The π-estimator 63 The estimator based on a sample s is defined as (Yates and Grundy [1953] and Sen [1953]) ̂ V( ˆ t π ) = − 1 2 ∑∑ s ˇ ∆ kl (ˇ y k − ˇ y l ) 2 . (4.16) We want to show that V( ˆ t π ) = ∑∑ U ∆ kl ˇ y k ˇ y l = − 1 2 ∑∑ U ∆ kl (ˇ y k − ˇ y l ) 2 . (4.17) We start with some manipulations V( ˆ t π ) = − 1 2 ∑∑ U ∆ kl (ˇ y k − ˇ y l ) 2 = − 1 2 ∑∑ U ∆ kl [ ˇ y 2 k + ˇ y 2 l − 2ˇ y k ˇ y l ] = − 1 2 [ ∑∑ U ∆ kl ˇ y 2 k + ∑∑ U ∆ kl ˇ y 2 l − 2 ∑∑ U ∆ kl ˇ y k ˇ y l ] = − 1 2 [ ∑ k ∈ U ˇ y 2 k ∑ l ∈ U ∆ kl + ∑ l ∈ U ˇ y 2 l ∑ k ∈ U ∆ kl (4.18) − 2 ∑∑ U ∆ kl ˇ y k ˇ y l ] = ∑∑ U ∆ kl ˇ y k ˇ y l − 1 2 [ ∑ k ∈ U ˇ y 2 k ∑ l ∈ U ∆ kl + ∑ l ∈ U ˇ y 2 l ∑ k ∈ U ∆ kl ] . The second sum of the right hand side must be 0 for the equality to hold. This would be true, if ∑ l ∈ U ∆ kl = 0 . (4.19) The sum of covariances can be expressed as ∑ l ∈ U ∆ kl = ∑ l ∈ U ( π kl − π k π l ) = ∑ l ∈ U π kl − π k ∑ l ∈ U π l (4.20) <?page no="63"?> 64 4 Estimation and for the sum of inclusion probabilities, we find ∑ l ∈ U π l = ∑ l ∈ U ∑ s 3 l p ( s ) = n M ∑ s=1 p ( s ) = n. (4.21) Note that each sample contains n elements, therefore, each sample will be considered n− times. Furthermore, it holds that ∑ l ∈ U π kl = ∑ l ∈ U E( I l I k ) = E ( ∑ l ∈ U I l I k ) = E ( I k ∑ l ∈ U I l ) = nE ( I k ) = nπ k . (4.22) Substituting into the variance expression, we find ∑ l ∈ U ∆ kl = ∑ l ∈ U π kl − π k ∑ l ∈ U π l = nπ k − nπ k = 0 . (4.23) Hence, it holds that V( ˆ t π ) = ∑∑ U ∆ kl ˇ y k ˇ y l − 1 2 [ ∑ k ∈ U ˇ y 2 k ∑ l ∈ U ∆ kl + ∑ l ∈ U ˇ y 2 l ∑ k ∈ U ∆ kl ] (4.24) = ∑∑ U ∆ kl ˇ y k ˇ y l . 4.5 Estimation using R 4.5.1 A small numerical example We consider a population U of size N = 5. The values of variable Y in the population are as follows: Y <c (1,2,5,12,30) N <length (Y) n <- 3 We consider samples of fixed size n = 3. The number of possible samples is M <choose (N,n); M ## [1] 10 The matrix containing in its M columns the indexes of the population elements, which are included in the sample, is <?page no="64"?> 4.5 Estimation using R 65 i <combn (N,n); i ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1 1 1 2 2 2 3 ## [2,] 2 2 2 3 3 4 3 3 4 4 ## [3,] 3 4 5 4 5 5 4 5 5 5 We assume that the probabilities of the M = 10 possible samples are given as p <- 1: M/ sum (1: M); round (p,3) ## [1] 0.018 0.036 0.055 0.073 0.091 0.109 0.127 0.145 ## [9] 0.164 0.182 The vector of the inclusion probabilities for N = 5 population elements is obtained using the definition π k = P( k ∈ S ) = P( I k = 1) = ∑ s 3 k p ( s ) . (4.25) We first obtain the matrix with inclusion indicators. We use a slightly modified version of the function is .element() to mimic the inclusion indicator function: # inclusion indicator Iks <function(x,y) as.numeric ( is.element (x,y)) ind <apply (i,2,function(z) Iks (1: N,z)) ind ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1 1 1 0 0 0 0 ## [2,] 1 1 1 0 0 0 1 1 1 0 ## [3,] 1 0 0 1 1 0 1 1 0 1 ## [4,] 0 1 0 1 0 1 1 0 1 1 ## [5,] 0 0 1 0 1 1 0 1 1 1 Now we add up the probabilities of relevant samples s for all elements k pi <colSums ( t (ind)*p); round (pi,3) ## [1] 0.382 0.545 0.636 0.691 0.745 Assume that we want to estimate the mean of the population mean (Y) ## [1] 10 using the π -estimator ˆ ¯ y π = 1 N ∑ s y k π k . (4.26) <?page no="65"?> 66 4 Estimation Using the indexes of the sample elements ( k ∈ s ), we find the corresponding values y k and inclusion probabilities ( π k ). Adding up the expanded values y k / π k results in the estimates ybar.hat <- 1/ N* apply (i,2,function(z) sum (Y[z]/ pi[z])) round (ybar.hat,2) ## [1] 2.83 4.73 9.31 5.57 10.14 12.05 5.78 10.35 ## [9] 12.26 13.09 We find, that the expected value of the estimator over all M possible samples, taking into account their probabilities p ( s ), is just the population mean sum (ybar.hat*p) ## [1] 10 The variance of the estimator is sum ((ybar.hatmean (Y))^2*p) ## [1] 8.717 For the M = 10 samples, we obtain the following variance estimators according to ̂ V( ˆ ¯ y π ) = 1 N 2 ∑∑ s π −1 kl ( π kl π k π l − 1 ) y k y l . (4.27) The matrix of the second order inclusion probabilities is obtained by pikl <matrix (NA,N,N) for (k in 1: N){ for (l in 1: N) pikl[k,l] <sum ( apply (i,2,function(z) Iks (k,z)* Iks (l,z))*p) } round (pikl,3) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.382 0.109 0.182 0.218 0.255 ## [2,] 0.109 0.545 0.291 0.327 0.364 ## [3,] 0.182 0.291 0.636 0.382 0.418 ## [4,] 0.218 0.327 0.382 0.691 0.455 ## [5,] 0.255 0.364 0.418 0.455 0.745 The elementwise multiplication of the results of the two Iks() commands ensures that we obtain the value 1 for the specific sample only, if both elements k and l are contained in the sample. An alternativ way for calculating the second order inclusion <?page no="66"?> 4.5 Estimation using R 67 probabilities is to obtain first the matrix of second order inclusion indicators for all samples and then summing the up the matrices using the sample probabilities as weights. We make use of the array () and the %o% commands: ind2 <array (NA,dim= c (N,N,M)) for (s in 1: M) ind2"s] <ind[,s]%o%ind[,s] pikl <apply (ind2* rep (p,each=N*N), c (1,2),sum) The M variance estimates are # function for estimating var for sample vhat <function(s) { n <length (s) sl <rep (NA,n) sk <sl for (j1 in 1: n){ k <s[j1] for (j2 in 1: n){ l <s[j2] sl[j2] <- 1/ pikl[k,l]*(pikl[k,l]/ (pi[k]*pi[l])-1)*Y[k]*Y[l] } sk[j1] <sum (sl) } sum (sk)/ N^2 } # apply function to all samples round ( apply (i,2,vhat),2) ## [1] -0.39 1.91 13.81 1.83 12.61 11.19 2.00 12.39 ## [9] 10.86 8.62 An alternativ way for calculating the variance for all M samples is again making use of the function %o%: vhat <function(s) {1/ N^2* sum (1/ pikl[s,s]* (pikl[s,s]/ (pi[s]%o%pi[s])-1)*(Y[s]%o%Y[s]))} v <apply (i,2,vhat) The expected value of the variance estimator is the variance of the π -estimator: sum (v*p) ## [1] 8.717 Unfortunately, the variance estimator for the first sample turns out <?page no="67"?> 68 4 Estimation negative. As the variance is defined as a (positively weighted) sum of squares, this is an unwelcome feature of the variance estimation function. 4.5.2 An empirical example: PSID We now consider a somewhat more realistic example. We treat the sample of 1000 observations as the population. Hence, we have N = 1000 . We consider samples of size n = 20. The inclusion probabilities ( π ) we assume to be proportional to the person’s age. d <read.csv2 ("psid_2007_sp.csv") N <nrow (d) n <- 20 Y <d$wage pi <d$age/ sum (d$age)*n Sampling with unequal inclusion probabilities is somewhat complicated. We will turn to that issue later on. For the moment, we use the sample function implemented in R which allows to approximate draws with unequal inclusion probabilities without replacement. We draw one sample and obtain the π -estimator for this particular sample. set.seed (123) i <sample (1: N,n,prob=pi) 1/ N* sum (Y[i]/ pi[i]) ## [1] 50939 Ybar <mean (Y); Ybar ## [1] 44487 Obviously, based on this specific sample, we overestimate the true average wage in the population considerably. The distribution of the π -estimation function with n = 20 can be approximated through sampling a large number of samples, say B = 10 , 000 samples, and calculating the π -estimator or each sample. B <- 10000 ybar <rep (NA,B) set.seed (123) for (b in 1: B) { <?page no="68"?> 4.5 Estimation using R 69 20000 40000 60000 80000 100000 120000 Estimate Density ·10 −5 0 1 2 3 4 ¯ y U Figure 4.1: Approximated distribution of the estimation function for n = 20. i <sample (1: N,n,prob=pi) ybar[b] <- 1/ N* sum (Y[i]/ pi[i]) } We draw the density distribution of the estimation function using a density estimation (see figure 4.1). Note that the distribution is highly skewed and obviously deviates strongly from the normal distribution. plot ( density (ybar),xlab="Estimate",main="") arrows (Ybar,5e-06,Ybar,0,length=0.1,angle=25) text (Ybar,7.5e-06, expression ( bar (y)[U])) Based on the B = 10 , 000 samples, we can obtain an approximate confidence interval. lower <round ( quantile (ybar,0.05)) upper <round ( quantile (ybar,0.95)) E.g. we can expect with a probability of 90% to find an estimate of the average wage (which is 44 , 487 ) to be located within the interval [29 , 971; 67 , 066]. Note that in the population, there is a small positive correlation between age and wage: cor (d$age,d$wage) ## [1] 0.1935 <?page no="69"?> 70 4 Estimation Therefore, inclusion probabilities being proportional to age will result in slightly higher probabilities for samples being “too old” and because of the small positive correlation, the samples will contain on average slightly “too rich” persons. Due to the inverse probability weighting, this is taken into account in the estimation. Note that the R-function sample () does in fact not result in probabilities p ( s ) which conform strictly (but approximatively) to the individual inclusion probabilities. But this effect is rather negligible in this example. 4.6 Generating samples with unequal inclusion probabilities The π -estimator is defined for every sampling design. In practice, most often inclusion probabilities differ between individuals or groups (strata, cluster) of the population. In this section, we discuss a specific design, in which the inclusion probabilities are proportional to an auxiliary variable Z or even proportional to the value of the statistical variable of interest Y . We also discuss how to generate samples with unequal inclusion probabilities. 4.6.1 Probabilities proportional to size (PPS) In this sampling design, the inclusion probabilities π are proportional to an auxiliary variable Z π k = n z k ∑ N j=1 z j . (4.28) The Horvitz-Thompson-estimator ( π -estimator) is defined as ˆ t = ∑ s y k π k . (4.29) Often a specific design is predetermined and results in individual inclusion probabilities. However, we can also specify individual inclusion probabilities which then determine the probabilities of the different possible samples and thereby the sampling design. Note that if we could define inclusion probabilities proportional to Y π k = n y k ∑ N j=1 y j (4.30) <?page no="70"?> 4.6 Generating samples with unequal inclusion probabilities 71 we would obtain ˆ t = ∑ s y k n y k ∑ N j=1 y j = 1 n ∑ s N ∑ j=1 y j = 1 n ∑ s t = t, (4.31) I.e. every sample of the sample space S would result in the identical numerical estimator which would be exactly the population total. Based on this finding, it is rather intuitive to search for an auxiliary variable Z that is highly correlated with Y . The proportional to size sampling design brings along two problems. It is practically difficult to generate samples according to the given inclusion probabilities. Most simple sampling algorithms do not guarantee that the inclusion probability equals the one draw-probability: π i = np i . Furthermore, it is quite difficult to obtain the second order inclusion probabilities π kl . E.g. the R-function sample () draws sequentially with one-draw inclusion probabilities p i of the remaining elements in the population. This algorithm violates the postulation π i = np i . We demonstrate the failure of the sample () function with the following simple example. We want to generate a sample of size n = 3 from a population of size N = 5 with unequal inclusion probabilities. B <- 10000 N <- 5 n <- 3 e <matrix (NA,B,n) p <- 4: 8/ sum (4: 8); p ## [1] 0.1333 0.1667 0.2000 0.2333 0.2667 set.seed (123) for (i in 1: B) e[i,] <sample (1: N,n,prob=p) pi_emp <rep (NA,N) for (i in 1: N) { pi_emp[i] <sum ( apply (e,1,function(z) i%in%z))/ B } rbind (p*n, round (pi_emp,3)) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.400 0.500 0.600 0.700 0.800 ## [2,] 0.448 0.544 0.612 0.675 0.721 We find that the relative frequencies differ considerably from the inclusion probabilities. Additional R-packages (e.g. pps and <?page no="71"?> 72 4 Estimation sampling) allow generating samples complying with the postulation π i = np i and allow calculating second order inclusion probabilities. 4.6.2 The Sampford-algorithm The algorithm is computational somewhat greedy but easy to implement. For the first element the one-draw probabilities are given as p i = π i / n . For the remaining N − 1 elements in the population new one-draw probabilities are calculated according to ˜ p i = π i 1−π i ∑ N j=1 π j 1−π j . (4.32) The n − 1 elements of the sample are drawn with probabilities ˜ p i with replacement . If the generated sample contains the same elements more than once, the sample is discarded and a new sample is generated until a sample is obtained which contains only unique elements. The Sampford-algorithm guarantees that np i = π i and the second order inclusion probabilities π kl can be calculated with justifiable effort. library (pps) B <- 10000 e <matrix (NA,B,n) N <- 5 n <- 3 p <- 4: 8/ sum (4: 8); p ## [1] 0.1333 0.1667 0.2000 0.2333 0.2667 set.seed (123) for (i in 1: B) e[i,] <sampford (p,n) pi_emp <rep (NA,N) #i <- 1 for (i in 1: N) { pi_emp[i] <sum ( apply (e,1,function(z) i%in%z))/ B } rbind (p*n, round (pi_emp,3)) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.400 0.500 0.600 0.700 0.8 ## [2,] 0.406 0.504 0.592 0.698 0.8 Using the Sampford-algorithm in the simulation, we find that the relative frequencies are almost identical to the given inclusion probabilities. <?page no="72"?> 4.7 Exercises 73 4.7 Exercises 1. How is the π -estimator ( ˆ t π ) of the population total defined? 2. Proof that ˆ t π is an unbiased estimation function of the population total t . 3. Derive the variance of the π -estimator ˆ t π . 4. Proof that ̂ V( ˆ t π ) = ∑∑ s ˇ ∆ kl ˇ y k ˇ y l is an unbiased estimation function of V( ˆ t π ). 5. We consider a population with N = 3 elements y 1 = 1 , y 2 = 2 , y 3 = 5 and want to estimate the total of the population based on a sample of size n = 2. The sampling design is characterized by the following probabilities p ( s ): p ( s 1 ) = 0 . 5 , p ( s 2 ) = 0 . 3 , p ( s 3 ) = 0 . 2 . a) Obtain the population total. b) Obtain the number M = |S| of possible samples. c) Obtain all elements of S n . d) Obtain the first order inclusion probabilities π k and the second order inclusion probabilities π kl . e) Obtain the covariances of the inclusion indicators I k and I l . f) Obtain the numerical estimates ˆ t π = ∑ s y k π k for the total for each of the M samples. g) Show numerically that ˆ t π = ∑ s y k π k is an unbiased estimation function for the total t . h) Derive the variance of ˆ t π . i) Obtain the numerical estimates of the variance estimators (for the π -estimator of the total) for each of the M samples. <?page no="73"?> 74 4 Estimation j) Show numerically that the estimation function ̂ V( ˆ t π ) is an unbiased estimator of V( ˆ t π ). 6. Solve exercise 5 using R. <?page no="74"?> 5 Simple sampling In this chapter, we discuss the simple random sampling design (SI). Whereas the π-estimator can be applied in all sampling designs, the simple random sampling design assumes identical inclusion probabilities for all elements of the population. Estimating functions simplify considerably in the case of simple random sampling. 5.1 Introduction 76 5.2 Some general estimation functions 76 5.2.1 The π-estimator for the total 76 5.2.2 The π-estimator for the mean 76 5.2.3 The π-estimator for a proportion 77 5.3 Simple random sampling 77 5.3.1 The π-estimator for the total (SI) 78 5.3.2 The π-estimator for the mean (SI) 79 5.3.3 The π-estimator for a proportion (SI) 80 5.4 Some examples using R 81 5.5 Exercises 86 <?page no="75"?> 76 5 Simple sampling 5.1 Introduction We first reconsider the general π -estimator for the total, the mean and a proportion and for their variances. Because of identical inclusion probabilities, the expressions simplify considerably in the case of simple random sampling. 5.2 Some general estimation functions We briefly reconsider the general π -estimators for the total, the mean, and a proportion defined for any design. These estimation functions have already been discussed in more detail in the previous chapters and only serve as the starting point for simplifications in simple random sampling. 5.2.1 The π-estimator for the total ˆ t π is the π -estimator of the total t in population U ˆ t π = ∑ U I k y k π k = ∑ U I k ˇ y k = ∑ s y k π k = ∑ s ˇ y k . (5.1) The variance of ˆ t π is V( ˆ t π ) = ∑ U V( I k )ˇ y 2 k + ∑∑ U k6=l Cov( I k , I l )ˇ y k ˇ y l = ∑ U ∆ kk ˇ y 2 k + ∑∑ U k6=l ∆ kl ˇ y k ˇ y l = ∑∑ U ∆ kl ˇ y k ˇ y l . (5.2) An unbiased estimator of the variance V( ˆ t π ) based on a sample is ̂ V( ˆ t π ) = ∑∑ s ˇ ∆ kl ˇ y k ˇ y l with ˇ ∆ kl = ∆ kl π kl . (5.3) 5.2.2 The π-estimator for the mean The treatment of the mean is strictly analogue to the treatment of the total, because of ¯ y = t/ N . Hence, the variance of the mean is the variance of the total multiplied by the factor 1 / N 2 . ˆ ¯ y π is the π -estimator of the mean ¯ y of population U ˆ ¯ y π = 1 N ∑ U I k y k π k = 1 N ∑ U I k ˇ y k = 1 N ∑ s y k π k = 1 N ∑ s ˇ y k . (5.4) <?page no="76"?> 5.3 Simple random sampling 77 The variance V( ˆ ¯ y π ) of ˆ ¯ y π is V( ˆ ¯ y π ) = 1 N 2 ∑ U V( I k )ˇ y 2 k + 1 N 2 ∑∑ U k6=l Cov( I k , I l )ˇ y k ˇ y l = 1 N 2 ∑ U ∆ kk ˇ y 2 k + 1 N 2 ∑∑ U k6=l ∆ kl ˇ y k ˇ y l = 1 N 2 ∑∑ U ∆ kl ˇ y k ˇ y l . (5.5) An unbiased estimator of the variance V( ˆ ¯ y π ) based on a sample is ̂ V( ˆ ¯ y π ) = 1 N 2 ∑∑ s ˇ ∆ kl ˇ y k ˇ y l with ˇ ∆ kl = ∆ kl π kl . (5.6) 5.2.3 The π-estimator for a proportion The estimation of a proportion is identical to the estimation of the mean. ˆ ¯ y π is the π -estimator of a proportion, if the variable Y takes only two values, either 0 or 1. The value y k = 1 indicates that the element k has the property of interest, y k = 0 indicates the absence of that property y k = { 1 , unit k has property 0 , else . (5.7) 5.3 Simple random sampling Under simple random sampling the expressions of estimators and variances simplify considerably. Whereas in the general case of the π -estimator the elements of the population may have individual inclusion probabilities π k , in the SI case all N elements of U have identical inclusion probabilities. In simple random sampling we find π k = n N and π kl = n N ( n − 1) ( N − 1) . (5.8) <?page no="77"?> 78 5 Simple sampling 5.3.1 The π-estimator for the total (SI) The π -estimator ˆ t π of the total t in population U simplifies towards ˆ t π = ∑ U I k y k π k = ∑ U I k y k n N = N n ∑ U I k y k = N n ∑ s y k = N ¯ y s . (5.9) The expected value of ˆ t π is again found to be the population total E ( N n ∑ s y k ) = N n E ( ∑ U I k y k ) = N n ∑ U y k E( I k ) = N n ∑ U y k n N = ∑ U y k . (5.10) Obtaining a simplified expression for the variance V ( ˆ t π ) in the simple random sampling is somewhat tedious V(ˆ t π ) = ∑∑ U ( π k,l − π k π l ) y k π k y l π l = ∑ U π k (1 − π k ) ( y k π k ) 2 + ∑ ∑ U,k6=l ( π k,l − π k π l ) y k π k y l π l = ∑ U n N ( 1 − n N ) ( y k n N ) 2 + ∑ ∑ U,k6=l ( n N (n − 1) (N − 1) − n N n N ) y k n N y l n N = N 2 n N ( 1 − n N ) 1 n 2 ∑ U y 2 k + N 2 ( n N (n − 1) (N − 1) − n N n N ) 1 n 2 ∑ ∑ U,k6=l y k y l = N 2 1 n ( 1 N − n N 2 )∑ U y 2 k + N 2 1 n ( (n − 1) N (N − 1) − n N 2 )∑ ∑ U,k6=l y k y l = N 2 1 n ( N − n N 2 )∑ U y 2 k + N 2 1 n ( N 2 (n − 1) − nN (N − 1) N 2 N (N − 1) )∑ ∑ U,k6=l y k y l <?page no="78"?> 5.3 Simple random sampling 79 = N 2 1 n ( N − n N 2 )∑ U y 2 k (5.11) + N 2 1 n ( nN 2 − N 2 − nN 2 + nN N 2 N (N − 1) )∑ ∑ U,k6=l y k y l = N 2 1 n ( N − n N − 1 N − 1 N 2 )∑ U y 2 k + N 2 1 n ( −(N − n) N 2 (N − 1) )∑ ∑ U,k6=l y k y l = N 2 1 n N − n N − 1   ( 1 N − 1 N 2 )∑ U y 2 k − 1 N 2 ∑ ∑ U,k6=l y k y l   = N 2 1 n N − n N − 1 ( 1 N ∑ U y 2 k − 1 N 2 ∑ U y 2 k − 1 N 2 ∑ ∑ U,k6=l y k y l   = N 2 1 n N − n N − 1 ( 1 N ∑ U y 2 k − 1 N 2 ∑∑ U y k y l ) = N 2 1 n N − n N − 1 V(Y ) = N 2 (1 − f) n S 2 y with S 2 y = 1 N − 1 ∑ U (y k − ¯ y) 2 . The variance estimator based on a sample uses the variance within the sample S 2 ys for the unknown variance S 2 y in the population ̂ V( ˆ t π ) = N 2 (1 − f ) n S 2 ys with S 2 ys = 1 n − 1 ∑ s ( y k − ¯ y s ) 2 . (5.12) 5.3.2 The π-estimator for the mean (SI) The expressions for the mean differ only by the factor 1 / N from the total. The variance expressions for the mean contain an additional factor 1 / N 2 . For the mean we find ˆ ¯ y π = 1 N ∑ U I k y k n N = 1 n ∑ U I k y k = 1 n ∑ s y k . (5.13) <?page no="79"?> 80 5 Simple sampling We obtain the expected value ˆ ¯ y π as E ( ˆ ¯ y π ) = E ( 1 n ∑ U I k y k ) = 1 n ∑ U y k E ( I k ) = 1 n ∑ U y k n N = 1 N ∑ U y k . (5.14) The variance of ˆ ¯ y π is V( ˆ ¯ y π ) = V ( 1 N ˆ t π ) = 1 N 2 V( ˆ t π ) = (1 − f ) n S 2 y with S 2 y = 1 N − 1 ∑ U ( y k − ¯ y ) 2 . (5.15) The estimator based on a sample is ̂ V( ˆ ¯ y π ) = (1 − f ) n S 2 ys with S 2 ys = 1 n − 1 ∑ s ( y k − ¯ y s ) 2 . (5.16) 5.3.3 The π-estimator for a proportion (SI) In this case Y is dichotomous (0,1) y k = { 1 , if k has property 0 , else (5.17) and the estimator of a proportion is again ˆ ¯ y π . ˆ ¯ y π = 1 n ∑ s y k . (5.18) <?page no="80"?> 5.4 Some examples using R 81 The estimator ̂ P of population proportion P is the proportion in the sample. The variance V( ˆ ¯ y π ) simplifies towards V( ˆ ¯ y π ) = (1 − f ) n S 2 y = (1 − f ) n N N − 1 V( Y ) = ( N − n ) N n N N − 1 ( 1 N ∑ U y 2 k − ¯ y 2 ) = 1 n ( N − n ) N − 1 ( 1 N ∑ U y k − ( 1 N ∑ U y k ) 2 ) = 1 n ( N − n ) N − 1 ( 1 N ∑ U y k )( 1 − 1 N ∑ U y k ) = 1 n ( N − n ) N − 1 P (1 − P ) (5.19) and is estimated based on the sample by ̂ V( ˆ ¯ y π ) = (1 − f ) n S 2 ys = (1 − f ) n n n − 1 ( 1 n ∑ s y k )( 1 − 1 n ∑ s y k ) = 1 − f n − 1 ̂ P (1 − ̂ P ) with ̂ P = 1 n ∑ s y k . (5.20) 5.4 Some examples using R Again, we have a look at our PSID data set. We are interested in estimating the average wage of the population. The estimation function is simply the average wage in the sample. The variance of the mean estimator is V( ˆ ¯ y π ) = (1 − f ) n S 2 y with S 2 y = 1 N − 1 ∑ U ( y k − ¯ y ) 2 . (5.21) For the population we obtain d <read.csv2 ("psid_2007_sp.csv") N <nrow (d) n <- 20 <?page no="81"?> 82 5 Simple sampling f <n/ N Y <d$wage Ybar <mean (Y); Ybar ## [1] 44487 V.Ybar <- (1-f)/ n* var (Y); V.Ybar ## [1] 173588669 The variance estimator based on a sample is ̂ V( ˆ ¯ y π ) = (1 − f ) n S 2 ys with S 2 ys = 1 n − 1 ∑ k ∈ S ( y k − ¯ y s ) 2 . (5.22) We write a function returning the variance estimator for a given sample: mf_var <function(x,N){ n <length (x) f <n/ N (1-f)/ n* var (x) } We draw B = 10 , 000 samples of size n = 20 and calculate the estimators of the mean and the variance of the mean for each sample. n <- 20 B <- 10000 m <rep (NA,B) v <m N <length (Y) set.seed (1) for (b in 1: B) { y <sample (Y,n) m[b] <mean (y) v[b] <mf_var (y,N) } # in mil. vm <v/ 10^6 V.Ybarm <- V.Ybar/ 10^6 We display the results graphically (see figure 5.1). The population parameters are indicated by an arrow. <?page no="82"?> 5.4 Some examples using R 83 20000 60000 100000 140000 Estimate Density ·10 −5 ¯ y U 0 1 2 3 4 (a) mean 0 1000 2000 3000 Estimate in mil. Density ·10 −3 V(¯ y U ) 0 2 4 6 8 10 (b) variance Figure 5.1: Approximated distribution of the estimation functions for n = 20. # figure a plot ( density (m),lwd=2,xlab="Estimate",main="") arrows (Ybar,5e-06,Ybar,0,length=0.1,angle=25) text (Ybar,7.5e-06, expression ( bar (y)[U])) # figure b plot ( density (vm),lwd=2,xlab="Estimate in mil.",main="") arrows (V.Ybarm+1000,0.002,V.Ybarm,0,length=0.1,angle=25) text (V.Ybarm+1000,0.0026, expression ( V ( bar (y)[U]))) Obviously, the estimation of the variance can go horribly wrong. table ( round (v/ V.Ybar))/ B ## ## 0 1 2 3 4 5 6 ## 0.6446 0.1925 0.0766 0.0226 0.0249 0.0136 0.0026 ## 7 8 14 15 16 17 18 ## 0.0003 0.0004 0.0015 0.0179 0.0013 0.0004 0.0007 ## 21 ## 0.0001 mean (v/ V.Ybar>2) ## [1] 0.0968 mean (v/ V.Ybar<0.5) ## [1] 0.6446 We find that about 10% of the variance estimates overestimate the true variance of the sample mean by more than factor 2. The probability to underestimate the true variance by more than the factor 2 is about 64%. <?page no="83"?> 84 5 Simple sampling Of course, n = 20 is a rather small sample size. We also have a look at the approximate distributions ( B = 10 , 000 ) of the mean and variance estimators for n = 100. n <- 100 B <- 10000 m <rep (NA,B) v <m f <n/ N V.Ybar <- (1-f)/ n* var (Y); V.Ybar ## [1] 31883633 V.Ybarm <- V.Ybar/ N^2 set.seed (12) for (b in 1: B) { y <sample (Y,n) m[b] <mean (y) v[b] <mf_var (y,N) } # in mil. vm <v/ 10^6 The figure is obtained by the following code (see figure 5.2): # figure a plot ( density (m),lwd=2,xlab="Estimate",main="") arrows (Ybar,5e-06,Ybar,0,length=0.1,angle=25) text (Ybar,10e-06, expression ( bar (y)[U])) # figure b plot ( density (vm),lwd=2,xlab="Estimate in mil.",main="") arrows (V.Ybarm,0.005,V.Ybarm,0,length=0.1,angle=25) text (V.Ybarm,0.0075, expression ( V ( bar (y)[U]))) Some results: table ( round (v/ V.Ybar,1))/ B ## ## 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 ## 0.0064 0.0736 0.1106 0.1048 0.1055 0.0920 0.0730 0.0544 ## 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 ## 0.0371 0.0411 0.0447 0.0353 0.0295 0.0230 0.0192 0.0140 ## 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 ## 0.0105 0.0068 0.0066 0.0049 0.0022 0.0012 0.0011 0.0011 ## 2.5 2.6 2.7 2.8 3 3.1 3.2 3.3 ## 0.0006 0.0004 0.0003 0.0002 0.0002 0.0120 0.0147 0.0122 ## 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 <?page no="84"?> 5.4 Some examples using R 85 30000 50000 70000 Estimate Density ·10 −5 ¯ y U 0 2 4 6 (a) mean 0 50 100 150 Estimate in mil. Density ·10 −2 V(¯ y U ) 0 1 2 3 (b) variance Figure 5.2: Approximated distribution of the estimating functions for n = 100. ## 0.0118 0.0104 0.0081 0.0059 0.0035 0.0036 0.0045 0.0029 ## 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 ## 0.0027 0.0030 0.0017 0.0009 0.0002 0.0003 0.0005 0.0002 ## 5 5.1 5.2 5.7 ## 0.0002 0.0001 0.0002 0.0001 mean (v/ V.Ybar>1.5) ## [1] 0.1592 mean (v/ V.Ybar<0.5) ## [1] 0.3505 Interestingly, most often we do not come close to the true variance. With a probability of about 35%, we will strongly underestimate the true variance by more than 50% and with a probability of about 16%, we will overestimate the variance by more than 50%. <?page no="85"?> 86 5 Simple sampling 5.5 Exercises 1. A sample of size n shall be drawn from a population of size N . a) What is the probability p ( s ) for each s ∈ S ? b) Derive the inclusion probability π k . c) Derive the second order inclusion probability π kl . 2. Consider a population of size N = 6 and variable X : x 1 = 1 , x 2 = 2 , x 3 = 5 , x 4 = 10 , x 5 = 14 , x 6 = 28 . a) Obtain the number M = |S| of possible samples for n = 3. b) What is the largest (smallest) sample mean that can occur? c) What can you say about the distribution of the sample mean? 3. Consider the general π -estimator of the total which is defined for all sampling designs. a) How is the estimation function of the total defined? b) How does the estimation function simplifies in the SI case? c) Show that the simplified (SI case) estimation function of the total is unbiased. d) How is the variance of the π -estimator for the total defined. How does the variance simplifies in the SI case? e) Obtain the estimation function of the variance of the total-estimator based on a sample. f) Show that the simplified estimation function of the variance of the π -estimator for the total is unbiased. 4. Consider the variable wage of the PSID data. Obtain the approximate distribution ( B = 1000) of the estimation function for the population mean and of the estimation function for the variance of sample mean for n = 30. Comment on your findings. <?page no="86"?> 5.5 Exercises 87 5. Consider the variable sector of the PSID data. Obtain the approximate distribution ( B = 1000 ) of the estimation function for the proportion of persons working in the service sector (identified by variable sector having value 7) and of the estimation function for the variance of the proportion estimator for n = 30. Comment on your findings. <?page no="88"?> 6 Confidence intervals It is quite common to provide confidence intervals for estimates, despite the fact, that after a specific sample has been obtained, nothing specific can be said beyond the n elements observed in the sample. 6.1 Introduction 90 6.2 Chebyshev-inequality 92 6.2.1 Derivation of the Chebyshev-inequality 92 6.2.2 Application of the Chebyshevinequality 94 6.3 Confidence intervals based on a specific sample 94 6.4 Some general remarks 97 6.4.1 No approximate normality 97 6.4.2 Simplified variance estimators 97 6.4.3 Effect of simplification in the simple random sampling case 98 6.4.4 Effect of simplification for the general π-estimator 99 6.4.5 Effect of simplification in stratified and clustered samples 99 6.4.6 Bootstrap 100 6.5 Exercises 101 <?page no="89"?> 90 6 Confidence intervals 6.1 Introduction We start this introduction with an example based on the PSID data. Suppose we want to estimate the average wage of the population based on a simple random sample of size n = 100. While the true distribution of the sample mean, which is the π -estimator in the case of simple random sampling, is unknown, we can approximate the distribution through drawing many (e.g. B = 100 , 000) samples. For the population, we obtain d <read.csv2 ("psid_2007_sp.csv") N <nrow (d) n <- 100 Y <d$wage Ybar <mean (Y); Ybar ## [1] 44487 V.Ybar <- (1-n/ N)/ n* var (Y); V.Ybar ## [1] 31883633 B <- 100000 m <rep (NA,B) set.seed (123) for (b in 1: B) m[b] <mean ( sample (Y,n)) mean (m)/ Ybar ## [1] 0.9997 var (m)/ V.Ybar ## [1] 0.9992 The mean of the B sample averages differs by about − 0 . 03% from the population mean. The variance of the B sample averages differs by about − 0 . 08% from the true variance of the sample mean. Before we display the result graphically, we do some calculations: q025 <quantile (m,0.025); q025 ## 2.5% ## 35349 q975 <quantile (m,0.975); q975 ## 97.5% ## 57342 u <- Ybar+ qnorm (0.025)* sqrt (V.Ybar); u <?page no="90"?> 6.1 Introduction 91 ¯ y s 20000 40000 60000 80000 0.00000 0.00002 0.00004 0.00006 0.00008 Density ¯ y U q 0.025 q 0.975 Figure 6.1: Approximated distribution of the sample mean for n = 100. ## [1] 33420 o <- Ybar+ qnorm (0.975)* sqrt (V.Ybar); o ## [1] 55554 mean (m<u | m>o) ## [1] 0.04741 Next we plot the result (see figure 6.1). # plot histogram hist (m,prob=T,nclass=75,xlab= expression ( bar (y)[s]), main="",xlim= c (20000,80000),ylim= c (0,0.00008)) # draw arrows and annotations arrows (Ybar,2e-05,Ybar,0,length=0.1,angle=25,lwd=2.5) text (Ybar,2.3e-05, expression ( bar (y)[U]),cex=1.3) arrows (q025,2e-05,q025,0,length=0.1,angle=25,lwd=2.5) text (q025,2.3e-05, expression (q[0.025]),cex=1.3) arrows (q975,2e-05,q975,0,length=0.1,angle=25,lwd=2.5) text (q975,2.3e-05, expression (q[0.975]),cex=1.3) # overlay with normal dist. xx <seq (20000,80000,10) dxx <dnorm (xx,Ybar, sqrt (V.Ybar)) lines (xx,dxx) We indicate the population parameter by an arrow. We also indicate the 0 . 025-quantile and the 0 . 975-quantile and overlay the histogram by a normal distribution based on the true parameters ¯ y U and V(¯ y U ). Note that the interval containing the central 95% of sample means is [ 35 , 349 ; 57 , 342 ]. An approximation with the <?page no="91"?> 92 6 Confidence intervals normal distribution would result in the misleading symmetric interval [ 33 , 420 ; 55 , 554 ]. About 5% of the B sample means lie outside the interval. 6.2 Chebyshev-inequality The Chebyshev-inequality allows to calculate intervals of width 2 c √ V(¯ x ) based only on the knowledge of the mean ¯ x U and the variance S 2 y,U without any further assumptions on the distribution of the statistical variable X . The interval will contain at least the proportion 1 − 1 / c 2 of all observations. 6.2.1 Derivation of the Chebyshev-inequality The derivation of this famous inequality is possible without any reference to probability statements. We consider the variable X for n elements with values x k ( k = 1 , . . . , n ). The location of the empirical distribution of the x -values is characterised by the arithmetic mean ¯ x and the dispersion by the standard deviation σ X . Depending on the choice of the constant c , some (or 0 for an extreme choice of c ) values x k differ from the mean at least c -times the standard deviation, i.e. |x k − ¯ x| ≥ c · σ X and some (or 0 for an extreme choice of c ) differ less, i.e. |x k − ¯ x| < c · σ X . Assume it holds for x k that |x k − ¯ x| ≥ c · σ X , (6.1a) then it holds also that ( x k − ¯ x ) 2 ≥ c 2 · σ 2 X , (6.1b) or equivalently ( x k − ¯ x σ X ) 2 ≥ c 2 . (6.1c) Now suppose n increasingly ordered observations x k with k = 1 , . . . , n . The number of observations x k for which ( x k − ¯ x σ X ) 2 < c 2 , (6.1d) we denote by m . For the remaining n − m therefore, it holds that ( x k − ¯ x σ X ) 2 ≥ c 2 . (6.1e) <?page no="92"?> 6.2 Chebyshev-inequality 93 Summing up the n − m (large) observations, we find n ∑ k=m+1 ( x k − ¯ x σ X ) 2 ≥ ( n − m ) c 2 . (6.1f) Dividing by n results in 1 n · σ 2 X n ∑ k=m+1 ( x k − ¯ x ) 2 ≥ ( n − m ) n c 2 . (6.1g) From the definition of the variance 1 n n ∑ k=1 ( x k − ¯ x ) 2 = σ 2 X , (6.1h) it follows that 1 n · σ 2 X n ∑ k=1 ( x k − ¯ x ) 2 = 1 . (6.1i) Combining our results, we find ( n − m ) n c 2 ≤ 1 n · σ 2 X n ∑ k=m+1 ( x k − ¯ x ) 2 (6.1j) ≤ 1 n · σ 2 X n ∑ k=1 ( x k − ¯ x ) 2 = 1 , (6.1k) and therefore ( n − m ) n c 2 ≤ 1 ⇔ 1 − m n ≤ 1 c 2 ⇔ m n ≥ 1 − 1 c 2 . (6.1l) m/ n denotes the share of observations for which it holds that |x k − ¯ x| < c · σ X . Hence, we have shown that m n = h ( X : |x k − ¯ x| < c · σ X ) ≥ 1 − 1 c 2 . (6.2) Note that we did not refer to probabilities or random variates in our proof. It is common to express the inequality for a random variate to obtain a probability statement. For the random variate X with expected value µ and variance σ 2 it holds that P( |x − µ| < c · σ X ) ≥ 1 − 1 c 2 . (6.3) <?page no="93"?> 94 6 Confidence intervals The fraction 1 − 1 / c 2 is a lower limit and results in conservative intervals. E.g. considering a normal distribution N ( µ = 10 , σ 2 = 4), the Chebyshev-interval [6; 14] has a probability of at least 75% according to the inequality, while it in fact has a probability of 95 . 45%. 6.2.2 Application of the Chebyshev-inequality Returning to our wage example, we find with Chebyshev 1 − 1 c 2 = 0 . 95 ⇔ c = √ 20 = 4 . 472 , (6.4) the 95% interval uu <- Ybarsqrt (20)* sqrt (V.Ybar); uu ## [1] 19234 oo <- Ybar+ sqrt (20)* sqrt (V.Ybar); oo ## [1] 69739 [19 , 234; 69 , 739], which is rather uninformative. 6.3 Confidence intervals based on a specific sample In the previous sections, we obtained intervals for the sample mean based on our knowledge of the complete population. Of course, the relevant case is that we have obtained one specific sample s out of the M = |S| possible samples. Hence, the population parameter ¯ y U is unknown and we observe the estimator ¯ y s . Furthermore, we do not know the variance S 2 y = 1 N − 1 ∑ U ( y k − ¯ y U ) 2 , (6.5) but have to estimate the variance in the population by S 2 ys = 1 n − 1 ∑ s ( y k − ¯ y s ) 2 (6.6) to estimate the variance of the sample mean ̂ V(¯ y s ) = (1 − f ) n S 2 ys . (6.7) Therefore, the interval we will calculate based on a sample depends on <?page no="94"?> 6.3 Confidence intervals based on a specific sample 95 1. the sample mean, 2. the sample variance and 3. the distributional assumption of the sample mean. However, we stated right at the beginning of the text, that there is not much to say about a realised sample. We are more interested in some general insights into the estimation procedure. Of course, based on a specific sample, we can calculate an interval. However, this interval will either contain the population mean or it will not. Hence, there is no probability statement possible as every ingredient is fixed, the parameters of the population as well as the parameters of the sample. We will not be able to judge the goodness of the specific interval calculated based on a realized sample. However, it is of interest to know how our procedure (calculate the sample mean, calculate the variance estimator, calculate an interval based on a distributional assumption (or not: Chebyshev)) fares “in general” or “on average”. We take up the example of estimating the population average wage based on a sample of size n . We draw B = 10 , 000 samples and for each sample we calculate an 90%-interval based on the assumption of normality (which is a false assumption as we know) and based on the Chebyshev-inequality (which results in wide intervals as we know). For the B samples, we look at the distribution of the width of the intervals as well as at the fraction containing the population mean. We use a function to estimate the variance of the sample mean: mf_var <function(x,N){ n <length (x) f <n/ N (1-f)/ n* var (x) } We draw B = 10 , 000 samples of size n = 100 and calculate both intervals for each sample: n <- 100 B <- 10000 ni <matrix (NA,B,2) ci <ni cc <sqrt (10) a <qnorm (0.95) set.seed (123) for (b in 1: B) { <?page no="95"?> 96 6 Confidence intervals 20000 40000 60000 80000 0.00000 0.00002 0.00004 0.00006 0.00008 Interval-width Density Chebyshev normal Figure 6.2: Widths of 95%-intervals based on normal assumption and Chebyshev, n = 100. y <sample (Y,n) m <mean (y) v <mf_var (y,N) ni[b,] <c (m-a* sqrt (v),m+a* sqrt (v)) ci[b,] <c (m-cc* sqrt (v),m+cc* sqrt (v)) } wn <ni[,2]-ni[,1] wc <ci[,2]-ci[,1] nin <mean (Ybar>ni[,1] & Ybar<ni[,2]); nin ## [1] 0.8354 cin <mean (Ybar>ci[,1] & Ybar<ci[,2]); cin ## [1] 0.9645 The Chebyshev intervals are 1.92 times on average as wide as the intervals based on the normal assumption. Both intervals are supposed to contain the population mean with a probability of 0.95. We find that 96.5% of the Chebyshev intervals contain the population mean but only 83.5% of the intervals based on the normality assumption. We display the distributions in figure 6.2. plot ( density (wc),lwd=2,xlab="Interval-width", ylim= c (0,9e-05),main="") lines ( density (wn),lwd=2,lty=3) legend ('topright', c ("Chebyshev","normal"), lwd= c (2,2),lty= c (1,3),bty="n", cex=1.3) <?page no="96"?> 6.4 Some general remarks 97 6.4 Some general remarks 6.4.1 No approximate normality Most often approximate normality is simply assumed and confidence intervals for population parameter θ are calculated as ˆ θ ± u 1−α/ 2 [V( ˆ θ )] 1 2 . (6.8) We have seen in our empirical example based on the PSID wage data, that the distribution of the sample mean deviates strongly from the normal distribution. We have also seen that the distribution of the variance estimator for the sample mean was even more non-normal. A noteworthy fraction of samples result in a strong underestimation of the variance and hence, much too small confidence intervals. For such samples with strong underestimation of the variance, even the application of the conservative Chebyshevinequality may result in too optimistic (i.e. too small) intervals. 6.4.2 Simplified variance estimators In this chapter, we only considered the example of simple random sampling. In practice, simple random sampling is hardly ever used. Most often a combination of stratified and clustered sampling is applied. The correct estimation of the variance of estimating functions is then complicated and seldom to accomplish. The main problem is properly taking into account the covariances between the inclusion indicators. One possible resort is to ignore these dependencies between inclusion indicators and to calculate variances as if the sample had been drawn with replacement. When drawing with replacement, the individual draws are independent and the covariances are 0. Therefore, the counterfactual assumption of independence for a sample drawn without replacement strongly simplifies the derivation of variance estimators. As the covariances are typically negative, this will result in an overestimation of the variance on average and therefore in conservative confidence intervals. The true probability of a type one error α will then in fact be smaller than the given level of α . <?page no="97"?> 98 6 Confidence intervals 6.4.3 Effect of simplification in the simple random sampling case We illustrate this procedure for the π -estimator of the total in the case of simple random sampling. The sample size is n and the sampling fraction is f = n/ N . The population total is t = ∑ U y k (6.9) and the π -estimator is ˆ t π = ∑ s y k π k = ∑ s y k n N = N ¯ y s . (6.10) The variance of the π -estimator is V = V( ˆ t π ) = N 2 (1 − f ) 1 n S 2 y with S 2 y = 1 N − 1 ∑ U ( y k − ¯ y ) 2 (6.11) and the unbiased variance estimation function based on a sample is ̂ V = ̂ V( ˆ t π ) = N 2 (1 − f ) 1 n S 2 ys with S 2 ys = 1 n − 1 ∑ s ( y k − ¯ y ) 2 . (6.12) The correction factor for finite samples 1 − f is not considered in the case of drawing with replacement and the variance estimator is ̂ V 0 = N 2 1 n S 2 ys . (6.13) The relative bias that results from ignoring the dependencies and counterfactually assuming drawing with replacement is E( ̂ V 0 ) − V V = N 2 1 n S 2 y − N 2 (1 − f ) 1 n S 2 y N 2 (1 − f ) 1 n S 2 y = N 2 1 n S 2 y N 2 1 n S 2 y 1 − (1 − f ) (1 − f ) = f 1 − f . (6.14) E.g., if the sampling fraction is 10%, the bias is about 11%. <?page no="98"?> 6.4 Some general remarks 99 6.4.4 Effect of simplification for the general π-estimator The result for the simple random sampling can be generalised towards the π -estimator for less trivial designs. The π -estimator of the total is ˆ t π = ∑ s y k π k (6.15) and the variance is V = − 1 2 ∑∑ U ∆ kl (ˇ y k − ˇ y l ) 2 . (6.16) The Yates-Grundy estimator of the variance is defined as ̂ V = − 1 2 ∑∑ s ˇ ∆ kl (ˇ y k − ˇ y l ) 2 . (6.17) and requires the handling of the double sum. Note that even for a moderate sample size of n = 100 one has to calculate 100 inclusion probabilities π k and expanded variances ∆ kk and 0 . 5 · 100 · 99 = 4950 expanded covariances. If counterfactually assuming drawing with replacement, the covariances vanish ̂ V 0 = 1 n ( n − 1) ∑ s ( n y k π k − ˆ t π ) 2 = 1 n ( n − 1) ∑ s ( y k p k − ˆ t π ) 2 . (6.18) p k = π k / n denotes for the n independent draws the one-draw inclusion probabilities. 6.4.5 Effect of simplification in stratified and clustered samples Stratified sampling and cluster sampling are discussed in detail in the following chapters. Here, we state for completeness the simplifications for these designs. The π -estimator of the total with H strata h = 1 , .., H is ˆ t π = H ∑ h=1 ˆ t πh ˆ t πh = ∑ s h y k π k . (6.19) <?page no="99"?> 100 6 Confidence intervals The simplified variance estimator is ̂ V 0 = H ∑ h=1 1 n h ( n h − 1) ∑ s h ( n h y k π k − ˆ t πh ) 2 (6.20) and in the case of simple random sampling, i.e. identical inclusion probabilities within strata, we find ̂ V 0 = H ∑ h=1 N 2 h 1 n h S 2 ys h . (6.21) In the case of clustered sampling with i indexing the individual clusters, the π -estimator is ˆ t = ∑ s I ˆ t i π Ii (6.22) with simplified variance estimator ̂ V 0 = 1 n I ( n I − 1) ∑ s I ( n I ˆ t i π Ii − ˆ t ) 2 . (6.23) 6.4.6 Bootstrap If the analytic derivation of variance estimators is complicated, the bootstrap approach is an alternative. The idea is to treat the obtained sample s as the population and apply the sampling design to this sample to draw (bootstrap-) samples from s with replacement. The bootstrap method may also be used to obtain confidence intervals based on the distribution of the estimation function obtained from the bootstrap samples. <?page no="100"?> 6.5 Exercises 101 6.5 Exercises 1. Derive the Chebyshev-inequality. 2. Explain how you can obtain a conservative confidence interval for your estimator without any specific knowledge about the distribution of the estimating function. 3. Despite the fact that an interval based on the Chebyshevinequality is conservative, it may be too small for a realized sample. Explain. 4. Explain why, in general, variance estimation functions for drawing with replacement are much simpler. 5. We consider sampling with replacement with individual onedraw probabilities p k , k = 1 , . . . , N . Because of drawing with replacement, the individual draws are independent and the probabilities are p k , k = 1 , . . . , N in each draw. Hansen and Hurwitz [1943] proposed the following estimation function ( p− expanded with replacement, in short ‘pwr’) ˆ t pwr = 1 n n ∑ i=1 y k i p k i = 1 n n ∑ i=1 Z i with Z i = y k p k if p k i = p k and P ( Z i = y k p k ) = p k with k = 1 , . . . , N and N ∑ k=1 p k = 1 . The variance V( ˆ t pwr ) of the estimating function ˆ t pwr is V( ˆ t pwr ) = 1 n ∑ U ( y k p k − t ) 2 p k . An estimating function for the variance V( ˆ t pwr ) is ̂ V( ˆ t pwr ) = 1 n ( n − 1) n ∑ i=1 ( y k i p k i − ˆ t pwr ) 2 . (Note: Use the random variate Z i to solve the exercises.) <?page no="101"?> 102 6 Confidence intervals a) Obtain the expected value of the estimation function ˆ t pwr . b) Derive the variance of the estimation function ˆ t pwr . c) Proof that the variance estimation function is unbiased. d) An alternative estimation function is ˆ t π = ∑ ˜ s y k π k = ∑ ˜ s ˇ y k , ˜ s denotes the set of unique elements in the sample. i. Proof that this alternative estimation function is unbiased. ii. Derive a variance estimation function for the alternative estimation function. iii. Proof that the variance estimation function of this alternative estimation function is unbiased. 6. Show for the small numerical example that the estimation function proposed by Hansen and Hurwitz is unbiased x 1 = 1 , x 2 = 3 , x 3 = 6 p 1 = 0 . 25 , p 2 = 0 . 35 , p 3 = 0 . 4 . 7. Consider the case of simple random sampling without replacement. Obtain the relative bias of the variance estimation when counterfactually assuming drawing with replacement. Use as an numerical example a sample of size n = 100 and a population of size N = 2500. 8. Consider the population of the PSID data and simple random sampling without replacement. a) Calculate a symmetric interval based on the Chebyshevinequality for a given probability 1 − α = 0 . 9 for the average wage of a sample of size n = 20. b) Calculate an analogue interval for the sample mean of the logarithmic wage. Which interval will be more conservative (why? )? c) Approximately obtain the true probabilities of the two intervals which are supposed to have a probability of 1 − α = 0 . 9 by drawing B = 10 , 000 samples. <?page no="102"?> 7 Stratified sampling In practice, stratified sampling is applied frequently. The population is partitioned into strata according to an attribute of the population elements. Within each strata, some elements are drawn randomly. Stratification usually increases the efficiency in two ways. It lowers the probability of extreme samples and it allows obtaining estimation functions with smaller variance compared to simple random samples. 7.1 Introduction 104 7.2 Some notation and an example 104 7.2.1 Notation 104 7.2.2 Example: Sectors of employment as strata 105 7.3 Estimation of the total 108 7.3.1 Simple random sampling within strata108 7.3.2 Example: simple random sampling within sectors 110 7.4 Choosing the sample size for individual strata110 7.4.1 The minimization problem 111 7.4.2 The Cauchy-Schwarz inequality 112 7.4.3 Solving the minimization problem 113 7.4.4 Example: optimal sampling size within sectors 115 7.5 Sample allocation and efficiency 116 7.5.1 Variance comparisons based on the variance decomposition 117 7.5.2 No stratification versus proportional allocation 118 7.5.3 Proportional allocation versus optimal allocation 118 7.5.4 No stratification versus optimal allocation 119 7.5.5 An efficiency comparison with R 120 7.6 Exercises 123 <?page no="103"?> 104 7 Stratified sampling 7.1 Introduction To obtain a stratified sample, the population U is partitioned into H strata U 1 , . . . , U h , . . . , U H . Within each stratum U h , the number of population elements is N h and a sample of size n h is drawn. 7.2 Some notation and an example We introduce the notation for stratified sampling and use the familiar PSID data set to exemplify stratification using the information on sector of employment. 7.2.1 Notation To allow the application of stratified sampling, the attribute according to which the strata are defined, has to be known for all N elements of the population U . Typically, the variance of estimation functions for stratified sampling is smaller compared to the variance in simple random sampling. The variance decreasing effect is the stronger, the higher the correlation between the attribute used for stratification and the variable of interest Y . The population consists of N elements U = {u 1 , . . . , u k , . . . , u N }, (7.1a) or in short notation U = { 1 , . . . , k, . . . , N }. (7.1b) In stratified sampling the population U is partitioned into H strata U 1 , . . . , U h , . . . , U H . We use the index k to refer to the elements within strata U h U h = {k : k ∈ U h }. (7.2) From each stratum U h a random sample s h is drawn. The drawing within one stratum is assumed to be independent from the drawing in other strata. The H samples s h will be combined to the stratified sample s s = s 1 ∪ s 2 ∪ . . . ∪ s H . (7.3) <?page no="104"?> 7.2 Some notation and an example 105 Because of the independence of the drawing within different strata, it holds that p ( s ) = p ( s 1 ) · p ( s 2 ) · . . . · p ( s H ) . (7.4) The number of population elements in stratum h is denoted with N h and the number of elements sampled from stratum h with n h . The size of the population is the sum of elements within all H strata. Population and sample size are given by N = H ∑ h=1 N h and n = H ∑ h=1 n h . (7.5) The population total is the sum of the totals of the H strata t = ∑ U y k = H ∑ h=1 t h = H ∑ h=1 N h ¯ y U h with t h = ∑ U h y k = N h ∑ k=1 y k and ¯ y U h = 1 N h N h ∑ k=1 y k . (7.6) Accordingly, the mean is defined as ¯ y U = H ∑ h=1 W h ¯ y U h with W h = N h N . (7.7) 7.2.2 Example: Sectors of employment as strata We regard the elements in the PSID data set as the population and use the information on sector of employment to stratify the population. The sector variable is coded as follows: 0: Missing, 1: AgricEnergyMin, 2: Manufacturing, 3: Construction, 4: Trade, 5: Transport, 6: Finance, 7: Services. We read the data file and analyse the wage Y using the sector information Z for stratification. We first obtain for each of the H strata the size, the average wage and the variance within strata. d <read.csv2 ("psid_2007_sp.csv") names (d) ## [1] "pid" "hid" "sex" "age" ## [5] "eduyears" "hours" "wage" "sector" table (d$sector) <?page no="105"?> 106 7 Stratified sampling ## ## 0 1 2 3 4 5 6 7 ## 97 24 128 78 102 45 86 440 N <nrow (d) Y <d$wage Z <d$sector Nh <tapply (Y,Z,length) Wh <- Nh/ N mY <mean (Y) mYh <tapply (Y,Z,mean) vYh <tapply (Y,Z,var) sYh <sqrt (vYh) We display the results graphically (see figure 7.1). nam <c ("0: Missing", "1: AgricEnergyMin", "2: Manufacturing", "3: Construction", "4: Trade", "5: Transport", "6: Finance", "7: Services") # set margins par (mar= c (9,4,4,2)+0.1) # figure a plot1 <barplot ( rbind (0: 7,mYh/ 1000),horiz=F,space=2, ylim= c (0,70),col=1,las=2,axes=FALSE, axisnames=FALSE,ylab="(in thousands)") text (plot1, par ('usr')[3], labels=nam, srt=45, adj= c (1.1,1.1), xpd=TRUE, cex=1.6) axis (2) # figure b plot2 <barplot ( rbind (0: 7,sYh/ 1000),horiz=F,space=2, ylim= c (0,70),col=1,las=2,axes=FALSE, axisnames=FALSE,ylab="(in thousands)") text (plot1, par ('usr')[3], labels=nam, srt=45, adj= c (1.1,1.1), xpd=TRUE, cex=1.6) axis (2) The variance decomposition informs about the fraction of the variance that can be attributed to the sectors. anova ( lm (Y~ as.factor (Z))) <?page no="106"?> 7.2 Some notation and an example 107 (in thousands) 0: Missing 1: AgricEnergyMin 2: Manufacturing 3: Construction 4: Trade 5: Transport 6: Finance 7: Services 0 10 20 30 40 50 60 70 (a) mean (in thousands) 0: Missing 1: AgricEnergyMin 2: Manufacturing 3: Construction 4: Trade 5: Transport 6: Finance 7: Services 0 10 20 30 40 50 60 70 (b) standard deviations Figure 7.1: Sectoral mean wages and standard deviations. ## Analysis of Variance Table ## ## Response: Y ## Df Sum Sq Mean Sq F value ## as.factor(Z) 7 74728719986 10675531427 3.06 ## Residuals 992 3464354551095 3492292894 ## Pr(>F) ## as.factor(Z) 0.0035 ** ## Residuals ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 We also calculate the variances between and within the strata and the fraction of explained variance using simple R commands. V <var (Y); V ## [1] 3542625897 # Between strata Vext <sum ((mYh-mY)^2*Nh)/ (N-1); Vext ## [1] 74803524 # Within strata Vint <sum (vYh*(Nh-1))/ (N-1); Vint ## [1] 3467822373 # Fraction explained (%) round (Vext/ V*100,1) ## [1] 2.1 Despite the considerable differences in mean wages, the sector information only explains about two percent of the variance of wages. <?page no="107"?> 108 7 Stratified sampling 7.3 Estimation of the total The π -estimator of the total is the sum of the H π -estimators of the strata totals ˆ t π = H ∑ h=1 ˆ t hπ with ˆ t hπ = ∑ U h I k ˇ y k = N h ∑ k=1 I k y k π k = ∑ s h ˇ y k = n h ∑ k=1 y k π k , (7.8) where ˆ t hπ is the π -estimator of the total of stratum h . Because of the independence between strata, the variance of the π -estimator is simply the sum of the variances of the H total estimators for the strata V( ˆ t π ) = H ∑ h=1 V( ˆ t hπ ) , (7.9a) where the stratum specific variance is obtained through applying the variance estimator to stratum h V( ˆ t hπ ) = ∑ U h V( I k )ˇ y 2 k + ∑∑ U h k6=l Cov( I k , I l )ˇ y k ˇ y l (7.9b) = ∑ U h ∆ kk ˇ y 2 k + ∑∑ U h k6=l ∆ k,l ˇ y k ˇ y l (7.9c) = ∑∑ U h ∆ k,l ˇ y k ˇ y l . (7.9d) The estimating function of the variance for stratum h is obtained based on the n h elements in the sample s h drawn from U h ̂ V( ˆ t hπ ) = ∑∑ s h ˇ ∆ k,l ˇ y k ˇ y l with ˇ ∆ k,l = ∆ kl π kl . (7.10) 7.3.1 Simple random sampling within strata So far, we considered general estimating functions defined for all designs within the strata. A simplification is the case of simple random sampling within each stratum. The sampling fraction for stratum h is f h = n h N h . (7.11) <?page no="108"?> 7.3 Estimation of the total 109 The estimator of the total simplifies towards ˆ t π = H ∑ h=1 ˆ t hπ = H ∑ h=1 N h ¯ y s h with ¯ y s h = 1 n h n h ∑ k=1 y k . (7.12) The variance simplifies towards V( ˆ t π ) = H ∑ h=1 V( ˆ t hπ ) = H ∑ h=1 N 2 h 1 − f h n h S 2 yU h with S 2 yU h = 1 N h − 1 N h ∑ k=1 ( y k − ¯ y U h ) 2 and ¯ y U h = 1 N h N h ∑ k=1 y k , (7.13) which can be estimated using the unbiased estimating functions based on the samples s h ̂ V( ˆ t π ) = H ∑ h=1 ̂ V( ˆ t hπ ) = H ∑ h=1 N 2 h 1 − f h n h S 2 ys h with S 2 ys h = 1 n h − 1 n h ∑ k=1 ( y k − ¯ y s h ) 2 . (7.14) Note that the effect of stratification is twofold: the effect on sampling and the effect on estimation. The stratification results in lower probabilities for extreme samples or even prevents extreme samples with certainty. For the intuition: assume that we are interested in the income Y and partition the population into two strata, the rich and the poor. If we now sample from both strata, samples containing almost only rich or almost only poor individuals are impossible to occur. This effects works the stronger, the more the attribute Z used for stratification is correlated with the variable of interest Y . In addition, it will result in a lower variance of the estimating function, even if we ignore the stratification in estimation. Regarding the second effect, the estimating function for stratified samples eliminates the variance between the strata, which also results in a lower variance compared to estimators that ignore the stratification. <?page no="109"?> 110 7 Stratified sampling 7.3.2 Example: simple random sampling within sectors We consider the case of equal absolute sample sizes in all strata and choose n h = 20 for all h ∈ 1 , . . . , H . nh <- 20 H <- 8 # Drawing the sample s <list () set.seed (123) for(h in 1: H){ Yh <- Y[Z==(h-1)] s[[h]] <sample (Yh,nh) } # estimator for the population mean mY.d <sum (Wh* unlist ( lapply (s,mean))); mY.d ## [1] 44326 # variance of the mean estimator fh <nh/ Nh vmY.d <- 1/ N^2* sum (Nh^2*(1-fh)/ nh*vYh) sqrt (vmY.d) ## [1] 6905 # estimated variance of the mean estimator vdmY.d <- 1/ N^2* sum (Nh^2*(1-fh)/ nh* unlist ( lapply (s,var))) sqrt (vdmY.d) ## [1] 5359 Based on the specific obtained sample, we underestimate the mean as well as the variance. 7.4 Choosing the sample size for individual strata The sample s consists of the H subsamples s h and the size of s is n = H ∑ h=1 n h . (7.15) In the case of simple random sampling within strata, the variance of the π -estimator of the total is V( ˆ t π,ST ) = H ∑ h=1 V( ˆ t hπ ) = H ∑ h=1 N 2 h 1 − f h n h S 2 yU h . (7.16) <?page no="110"?> 7.4 Choosing the sample size for individual strata 111 We now turn to the question of how to partition the sample size n onto the H strata. The problem will be solved in a generalized setting. Either we want to minimize the variance of the estimator given some specific amount of costs (e.g. for interviewers, travelling, equipment and so on) or we want to minimize the costs given a fixed variance of the estimator. 7.4.1 The minimization problem The sampling costs we assume to be a linear function of the number of sampled elements C = c 0 + H ∑ h=1 c h n h , (7.17) where c 0 denotes the overall fixed costs and c h denotes the stratumspecific variable costs for each sampled element. Because of 1 − f h n h = 1 − n h N h n h = 1 n h − 1 N h (7.18) the variance V( ˆ t π,ST ) can be expressed as V( ˆ t π,ST ) = H ∑ h=1 N 2 h ( 1 n h − 1 N h ) S 2 yU h = H ∑ h=1 1 n h N 2 h S 2 yU h − H ∑ h=1 N h S 2 yU h . (7.19) Only the first sum depends on n h whereas the second sum is constant. Cost minimization given a fixed variance and variance minimization given fixed costs can generally be solved by minimizing the product V C of variance V and costs C V C = ( H ∑ h=1 1 n h N 2 h S 2 yU h − H ∑ h=1 N h S 2 yU h ) × ( c 0 + H ∑ h=1 c h n h ) . (7.20) Because c 0 is fixed as well as the second sum in the variance expression, we only regard the product of the variable costs and <?page no="111"?> 112 7 Stratified sampling the first sum of the variance. Both terms are functions of the strata-specific sample size n h V ′ C ′ = H ∑ h=1 N 2 h S 2 yU h n h H ∑ h=1 c h n h . (7.21) Minimizing V ′ C ′ with respect to n h given n , results in the optimal partition of n onto the H strata. 7.4.2 The Cauchy-Schwarz inequality It is useful to consider the following manipulations n ∑ i=1 n ∑ j=1 ( x i y j − x j y i ) 2 = n ∑ i=1 n ∑ j=1 ( x i y j ) 2 + n ∑ i=1 n ∑ j=1 ( x j y i ) 2 − 2 n ∑ i=1 n ∑ j=1 x i y j x j y i = n ∑ i=1 n ∑ j=1 x 2 i y 2 j + n ∑ i=1 n ∑ j=1 x 2 j y 2 i − 2 n ∑ i=1 n ∑ j=1 x i y j x j y i = n ∑ i=1 x 2 i n ∑ j=1 y 2 j + n ∑ j=1 x 2 j n ∑ i=1 y 2 i − 2 n ∑ i=1 x i y i n ∑ j=1 x j y j = 2 n ∑ i=1 x 2 i n ∑ j=1 y 2 j − 2 ( n ∑ i=1 x i y i ) 2 , (7.22) which result in 1 2 n ∑ i=1 n ∑ j=1 ( x i y j − x j y i ) 2 = n ∑ i=1 x 2 i n ∑ j=1 y 2 j − ( n ∑ i=1 x i y i ) 2 . (7.23) As the left hand side of the equation is the sum of n squares, hence, it is non-negative ( ≥ 0). It follows that n ∑ i=1 x 2 i n ∑ j=1 y 2 j − ( n ∑ i=1 x i y i ) 2 ≥ 0 (7.24) and we obtain the Cauchy-Schwarz inequality n ∑ i=1 x 2 i n ∑ j=1 y 2 j ≥ ( n ∑ i=1 x i y i ) 2 . (7.25) <?page no="112"?> 7.4 Choosing the sample size for individual strata 113 Equality holds, if n ∑ i=1 x 2 i n ∑ i=1 y 2 i = n ∑ i=1 x i y i n ∑ i=1 x i y i . (7.26) If we assume that X is proportional to Y , i.e. x i = ay i , we find n ∑ i=1 ( ay i ) 2 n ∑ i=1 y 2 i = n ∑ i=1 ay i y i n ∑ i=1 ay i y i a 2 n ∑ i=1 y 2 i n ∑ i=1 y 2 i = a 2 n ∑ i=1 y 2 i n ∑ i=1 y 2 i . (7.27) Hence, if proportionality is given, i.e. x i / y i = a , the inequality becomes an equality. 7.4.3 Solving the minimization problem The minimization problem of finding minimal variance given costs or minimal costs given variance has been stated previously. We obtained the following expression for the product of variance and costs, regarding only the parts of the variance and cost expressions which depend on the strata-specific sample sizes n h : V ′ C ′ = H ∑ h=1 N 2 h S 2 yU h n h H ∑ h=1 c h n h . (7.28) We observe that the minimization problem has the structure of the Cauchy-Schwarz inequality. The analogy allows to circumvent the usual procedure of finding the minimum through obtaining the derivatives, which here would be rather cumbersome. We replace the variables in the Cauchy-Schwarz inequality with the corresponding expressions of the minimisation problem: n ∑ i=1 x 2 i n ∑ i=1 y 2 i ≥ ( n ∑ i=1 x i y i ) 2 with x i = N h S yU h √n h , y i = √c h n h and x i y i = √c h N h S yU h . (7.29) <?page no="113"?> 114 7 Stratified sampling Therefore, the partition of the sample size n onto the H strata is optimal, if H ∑ h=1 N 2 h S 2 yU h n h H ∑ h=1 c h n h ≥ ( H ∑ h=1 √c h n h N h S yU h √n h ) 2 = ( H ∑ h=1 √c h N h S yU h ) 2 . (7.30) The minimum is obtained if proportionality y i / x i = a ∗ holds √c h n h N h S yUh √n h = √c h n h N h S yU h = a ∗ . (7.31) Given proportionality, we find n h = a ∗ 1 √c h N h S yU h . (7.32) The sample size s is the sum of H strata sample sizes n h n = H ∑ h=1 n h = a ∗ H ∑ h=1 1 √c h N h S yU h . (7.33) The fraction of the sample size n that is allocated towards stratum h is n h n = 1 √c h N h S yU h ∑ H h=1 1 √c h N h S yU h . (7.34) We find that the optimal relative sample size for stratum h is the larger • the smaller the variable cost c h per element in stratum h , • the larger the size N h of stratum U h in the population, • the larger the standard deviation of Y in U h . If the variable costs are identical in all strata, i.e. c h = c , we find n h n = N h S yU h ∑ H h=1 N h S yU h . (7.35) In this case the relative sample size is proportional to the product of size and standard deviation of the strata. This optimal allocation is also called ‘Neyman allocation’ after Neyman [1934]. This allocation has already been derived independently by Tschuprow [1923]. 1 1 Neyman, J. (1934), On two different aspects of the representative method, the method of stratified sampling and the method of purposive selection, <?page no="114"?> 7.4 Choosing the sample size for individual strata 115 7.4.4 Example: optimal sampling size within sectors We choose the sample size n = 160 and obtain the optimal sample sizes n h for all h ∈ 1 , . . . , H strata according to the formula we just derived under the assumption of identical costs in all strata. # function correcting for rounding problems cn <function(n,nhe){ nh <round (nhe) snh <sum (nh); snh while(snh<n){nhe <nhe*1.001 nh <round (nhe); snh <sum (nh)} while(snh>n){nhe <nhe*0.999 nh <round (nhe); snh <sum (nh)} return (nh)} # optimal allocation n <- 160 nh <cn (n,Nh*sYh/ sum (Nh*sYh)*n); nh ## 0 1 2 3 4 5 6 7 ## 8 4 23 10 15 3 14 83 # Drawing the sample s <list () set.seed (12) for(h in 1: H){ Yh <- Y[Z==(h-1)] s[[h]] <sample (Yh,nh[h]) } # estimator for the population mean mY.d <sum (Wh* unlist ( lapply (s,mean))); mY.d ## [1] 41958 # variance of the mean estimator fh <nh/ Nh vmY.d <- 1/ N^2* sum (Nh^2*(1-fh)/ nh*vYh) sqrt (vmY.d) ## [1] 4148 Journal of the Royal Statistical Society, 97, 558-606. Tschuprow, A.A. (1923a), On the Mathematical Expectation of the Moments of Frequency Distributions in the Case of Correlated Observations, Metron, 2, 461-493 and 646-680. <?page no="115"?> 116 7 Stratified sampling # estimated variance of the mean estimator vdmY.d <- 1/ N^2* sum (Nh^2*(1-fh)/ nh* unlist ( lapply (s,var))) sqrt (vdmY.d) ## [1] 3919 Based on the specific sample which has been drawn, we slightly underestimate the mean as well as the variance. 7.5 Sample allocation and efficiency We consider three different sampling designs: • ran: Simple random sampling (no stratification), • prop: Stratified sampling with simple random sampling within strata and sample size n h proportional to N h , • opt: Stratified sampling with simple random sampling within strata and optimal sample size n h proportional to S yU h N h . The variance of the total estimator for simple random sampling without stratification is V( ˆ t π ) = N 2 1 − f n S 2 y . (7.36) For the case of sample sizes being proportional to population strata size the variance of the total estimator is V( ˆ t π,ST ) = H ∑ h=1 V( ˆ t hπ ) = H ∑ h=1 N 2 h 1 − f n N h N S 2 yU h = N n (1 − f ) H ∑ h=1 N h S 2 yU h with n h = n N h N and f h = n h N h = n N = f. (7.37) Given n h = n N h S yU h ∑ H h=1 N h S yU h , (7.38) the variance of the total estimator in the case of optimal allocation of the sample size onto the strata is V( ˆ t π,ST ) = H ∑ h=1 N 2 h ( 1 n h − 1 N h ) S 2 yU h <?page no="116"?> 7.5 Sample allocation and efficiency 117 = H ∑ h=1 N 2 h S 2 yU h 1 n h − H ∑ h=1 N 2 h S 2 yU h 1 N h = H ∑ h=1 N 2 h S 2 yU h 1 ( n N h S yUh ∑ H h=1 N h S yUh ) − H ∑ h=1 N h S 2 yU h (7.39) = 1 n H ∑ h=1 N 2 h S 2 yU h ∑ H h=1 N h S yU h N h S yU h − H ∑ h=1 N h S 2 yU h = 1 n ( H ∑ h=1 N h S yU h ) 2 − H ∑ h=1 N h S 2 yU h . 7.5.1 Variance comparisons based on the variance decomposition The variance decomposition splits the total variance into within strata (first term on the right hand side) and between strata variance (second term on the right hand side) ( N − 1) S 2 = ∑ h ∑ i ( y hi − ¯ y ) 2 = ∑ h ∑ i ( y hi − ¯ y h ) 2 + ∑ h N h (¯ y h − ¯ y ) 2 = ∑ h ( N h − 1) S 2 h + ∑ h N h (¯ y h − ¯ y ) 2 . (7.40) Using N h − 1 N − 1 ≈ N h N − 1 ≈ N h N , (7.41) we find S 2 = ∑ h ( N h − 1) N − 1 S 2 h + ∑ h N h N − 1 (¯ y h − ¯ y ) 2 = ∑ h W h S 2 h + ∑ h W h (¯ y h − ¯ y ) 2 with W h = N h − 1 N ≈ N h N − 1 ≈ N h N . (7.42) <?page no="117"?> 118 7 Stratified sampling Using this decomposition, the variance for simple random sampling can be expressed as V ran = N 2 1 − f n S 2 y = N 2 1 − f n ∑ h W h S 2 yU h + N 2 1 − f n ∑ h W h (¯ y h − ¯ y ) 2 . (7.43) 7.5.2 No stratification versus proportional allocation We observe that the variance when stratifying with proportional sample size can be expressed as V prop = N 2 (1 − f ) n H ∑ h=1 W h S 2 yU h (7.44) and equals just the within variance term of the variance with simple random sampling. Hence, it holds that V ran = V prop + N 2 1 − f n ∑ h W h (¯ y h − ¯ y ) 2 . (7.45) 7.5.3 Proportional allocation versus optimal allocation Turning to the comparison of V opt V opt = N 2 n ( H ∑ h=1 W h S yU h ) 2 − N H ∑ h=1 W h S 2 yU h (7.46) with V prop V prop = ( N 2 n − N ) H ∑ h=1 W h S 2 yU h = N 2 n H ∑ h=1 W h S 2 yU h − N H ∑ h=1 W h S 2 yU h , (7.47) we find V prop − V opt = N 2 n H ∑ h=1 W h S 2 yU h − N 2 n ( H ∑ h=1 W h S yU h ) 2 <?page no="118"?> 7.5 Sample allocation and efficiency 119 = N 2 n ( H ∑ h=1 W h S 2 yU h − ¯ S 2 ) (7.48) = N 2 n ( H ∑ h=1 W h S 2 yU h − ¯ S 2 H ∑ h=1 W h ) = N 2 n H ∑ h=1 W h ( S 2 yU h − ¯ S 2 ) = N 2 n H ∑ h=1 W h ( S yU h − ¯ S ) 2 . The last manipulation uses the following identity ∑ W ( S − ¯ S ) 2 = ∑ W ( S 2 + ¯ S 2 − 2 ¯ SS ) = ∑ W S 2 + ∑ W ¯ S 2 − 2 ∑ W ¯ SS = ∑ W S 2 + ¯ S 2 ∑ W − 2 ¯ S ∑ W S (7.49) = ∑ W S 2 + ¯ S 2 − 2 ¯ S 2 = ∑ W S 2 − ¯ S 2 = ∑ W S 2 − ∑ W ¯ S 2 with ∑ h W h = 1 . 7.5.4 No stratification versus optimal allocation Based on the two comparisons V ran − V prop and V prop − V opt , we can obtain V ran − V opt : V ran − V opt = (V ran − V prop ) + (V prop − V opt ) = N 2 1 − f n ∑ h W h (¯ y h − ¯ y ) 2 + N 2 n H ∑ h=1 W h ( S yU h − ¯ S ) 2 (7.50) The first term of the difference is the larger, the more the means of the strata vary. I.e. the larger the fraction of variance which is explained by the strata. The second term is the larger, the more the standard deviations of the strata vary. <?page no="119"?> 120 7 Stratified sampling Summing up, the following order of variance estimating functions holds V ran ≥ V prop ≥ V opt . (7.51) 7.5.5 An efficiency comparison with R We choose the sample size n = 160 and obtain the variance of the mean estimator for the three different sampling designs (ran, prop, opt). n <- 160 # ran f <n/ N; f ## [1] 0.16 vY <var (Y) v.ran <- (1-f)/ n*vY # prop nh <cn (n,Nh/ N*n); nh ## 0 1 2 3 4 5 6 7 ## 16 4 21 12 16 7 14 70 fh <nh/ Nh v.prop <- 1/ N^2* sum (Nh^2*(1-fh)/ nh*vYh) # opt nh <cn (n,Nh*sYh/ sum (Nh*sYh)*n); nh ## 0 1 2 3 4 5 6 7 ## 8 4 23 10 15 3 14 83 fh <nh/ Nh v.opt <- 1/ N^2* sum (Nh^2*(1-fh)/ nh*vYh) v.ran; v.prop; v.opt ## [1] 18598786 ## [1] 18313099 ## [1] 17209682 We find that the gain in efficiency using proportional sample sizes for stratification by sector compared to simple random sampling is rather small. Whereas optimal allocation of the sample size onto the H strata decreases the variance considerably. To obtain the approximate distributions of the three discussed estimating functions for the mean, we carry out a simulation. We draw B = 50 , 000 simple random samples, stratified samples with proportional allocation and stratified samples with optimal allocation. <?page no="120"?> 7.5 Sample allocation and efficiency 121 set.seed (123) B <- 50000 n <- 160 nhp <cn (n, round (Nh/ N*n)); sum (nhp) ## [1] 160 nho <cn (n, round (Nh*sYh/ sum (Nh*sYh)*n)); sum (nho) ## [1] 160 # matrix for results e <matrix (NA,B,3) # population strata as list Uh <list () for(h in 1: H)Uh[[h]] <- Y[Z==(h-1)] # lists for samples sp <list () so <list () # Drawing samples set.seed (123) for (b in 1: B){ # generating the three samples for(h in 1: H){ Yh <- Uh[[h]] sp[[h]] <sample (Yh,nhp[h]) so[[h]] <sample (Yh,nho[h]) } sr <sample (Y,n) # mean estimators e[b,1] <mean (sr) e[b,2] <sum ( unlist ( lapply (sp,mean))*Wh) e[b,3] <sum ( unlist ( lapply (so,mean))*Wh) } colMeans (e) ## [1] 44510 44494 44485 apply (e,2,var) ## [1] 18633364 18225271 17432468 For each of the three different designs we find the means of the B estimators to be close to the population mean and the variances to be close to the calculated variances based on the analytic variance expressions. We display the situation graphically in figure 7.2. <?page no="121"?> 122 7 Stratified sampling 30000 40000 50000 60000 0.00000 0.00002 0.00004 0.00006 0.00008 0.00010 ˆ¯ y Density ¯ y opt prop ran Figure 7.2: Approximate distributions of the three discussed estimating functions for the mean and sampling designs. plot ( density (e[,3]),lwd=2,lty=1,main="", xlab= expression ( hat ( bar (y)))) lines ( density (e[,2]),lwd=2,lty=2) lines ( density (e[,1]),lwd=2.5,lty=3) arrows (mY,0.00001,mY,0,length=0.1,angle=15) text (mY,0.000015, expression ( bar (y)),cex=1.3) legend ("topright", c ("opt","prop","ran"),lwd=2, lty=1: 3,bty="n",cex=1.3) Note that the expected allocation for simple random sampling is the proportional allocation. Therefore, the efficiency gain from proportional allocation through sampling is rather small. The gain in efficiency for optimal allocation can mainly be attributed towards the stratification effect of sampling. In sectors with high variances more elements are drawn, resulting in considerably smaller variances of the mean estimation in these strata. <?page no="122"?> 7.6 Exercises 123 7.6 Exercises 1. Consider a population with variable Y and attribute Z to be used for stratification. Y : 1 2 4 3 5 7 6 8 9 Z : 1 1 1 2 2 2 3 3 3 a) Obtain the mean and the variance of Y . b) Obtain the means and variances for the H = 3 strata. c) Calculate the fraction of V ( Y ) that is explained by stratification variable Z . From population U with size N = 9 a simple random sample of size n = 6 is drawn. d) What is the estimating function for the population mean? e) Proof that this estimating function is unbiased. f) Provide an unbiased estimating function for the variance of the mean estimator. g) Calculate the numerical estimator of the variance of the mean estimator. h) What is the smallest numerical estimate possible? What is the largest? 2. Now consider stratified sampling with simple random sampling in the strata and identical sampling fractions f h = 2 / 3 in all strata. a) Obtain the number M = |S| of all possible samples. b) How is the estimating function for the mean defined? c) Proof that the estimating function is unbiased. d) How is the estimating function for the variance of the mean estimator defined? e) Calculate the variance of the mean estimator for the population data given in 1. f) What is the smallest numerical estimate possible? What is the largest? <?page no="123"?> 124 7 Stratified sampling g) Why would you prefer to use a stratification variable being highly correlated with Y ? h) How do you value the quality of the stratification variable Z in the numerical example? i) Obtain the optimal Z for the numerical example. 3. Assume that this specific stratified sample s has been drawn: Y : 1 2 5 7 6 8 Z : 1 1 2 2 3 3 a) Calculate the numerical estimator for the population mean. b) Calculate the numerical estimator for the variance of the mean estimator. 4. Consider the PSID data set and the variables wage ( Y ) and eduyears. Generate a classification variable Z with three different values indicating whether the individual obtained less than 12 years of education, 12 or more but less than 16 years of education and 16 or more years of education. Use Z to stratify U . a) Obtain two vectors containing the size and the relative size of the strata. b) Obtain the means and variances for the H = 3 strata. c) Calculate the share of the variance of Y that is explained by the stratification variable Z . d) Calculate the variances for sample size n = 120 in the case of i. Simple random sampling without stratification. ii. Stratification and proportional sample size n h . iii. Stratification and optimal sample size n h . e) Generate one sample for each sampling design and calculate the estimator for the mean and the estimator for the variance of the mean. f) Draw B = 10 , 000 samples for each design and obtain the approximate distributions of the three different estimating functions for the population mean. Comment on your results. <?page no="124"?> 8 Cluster sampling Especially in surveys conducted in the form of personal interviews, regional dispersion of interviewees is expensive. To reduce costs, elements of the population are often combined towards geographical clusters and clusters are sampled. As a result, the sample elements are geographically concentrated. Since geographically clustered elements often reveal some similarities, cluster sampling tends to increase the variance of estimating functions. 8.1 Introduction 126 8.2 Notation 126 8.2.1 Clustering the population 126 8.2.2 Artificially clustering the PSID sample127 8.2.3 Sampling clusters 127 8.2.4 Inclusion probabilities 128 8.3 Estimating the population total 129 8.3.1 The π -estimator of the population total 129 8.3.2 Variance of the π -estimator of the population total 130 8.4 Simple random sampling of clusters (SIC) 131 8.4.1 The π -estimator of the population total 131 8.4.2 The π-estimator in the PSID example 131 8.4.3 Variance of the π -estimator of the population total 132 8.4.4 Variance of the mean estimator in the PSID example 133 8.5 Exercises 135 <?page no="125"?> 126 8 Cluster sampling 8.1 Introduction The main motivation for cluster sampling is to lower costs. If the data are sampled by means of personal interviews, costs increase with the geographical dispersion of sample elements. E.g., assume that 5000 interviews with pupils shall be conducted. If we sample a few schools and interview all pupils of these schools, the interview costs will be lowered considerably compared to simple random sampling of pupils dispersed over all schools. However, pupils from one school probably have some similarities and may differ systematically from pupils of other schools in several respects. Hence, the survey results may differ strongly depending on which specific schools are sampled. This is tantamount to high variances of estimating functions. Another motivation for cluster sampling is the lack of an up to date list of population units to sample from. In this case, some areas can be sampled randomly and interviews can be conducted within these areas without a sufficient list of the population elements. In practice, clusters are often built on several levels, e.g. states, cities, streets, apartment blocks, and so on. The main motivation is to reduce costs by geographical clustering, but inhabitants of small regional clusters are rather similar with respect to many characteristics. Therefore, in cluster sampling the variance of estimating functions is typically considerably higher compared to simple random sampling. 8.2 Notation 8.2.1 Clustering the population We consider the population U with elements u k or in short with elements k U = { 1 , . . . , k, . . . , N }. (8.1) The population is partitioned into N I clusters U 1 , . . . , U i , . . . , U N I (8.2a) or using the index for abbreviation only U I = { 1 , . . . , i, . . . , N I }. (8.2b) Hence, the population is the union of the N I clusters U = ∪ i ∈ U I U i (8.3) <?page no="126"?> 8.2 Notation 127 and the number of elements N in the population is the sum of the size of all clusters N = ∑ i ∈ U I N i = ∑ U I N i . (8.4) 8.2.2 Artificially clustering the PSID sample We use the PSID data set which we regard as the population. We use (admittedly very artificially) information on sector, years of schooling and age for building clusters and hence, we obtain clusters which contain mainly persons working in the same sector, having similar amounts of schooling and who are of similar age. The identical cluster size will be N i = 10 for all clusters and the variable cl is used to indicate the affiliation to the clusters. d <read.csv2 ("psid_2007_sp.csv") o <order (d$sector,d$eduyears,d$age) d <d[o,] Y <d$wage N <nrow (d); N ## [1] 1000 NI <- 100 Ni <- 10 cl <rep (1: NI,each=Ni) tI <tapply (Y,cl,sum) Figure 8.1 shows the N I = 100 cluster totals and the estimated density. plot ( density (tI),lwd=2,xlab="Cluster totals",main="") rug (tI) We observe that the majority of cluster totals is scattered around 400 , 000 US dollars. The distribution is right skewed and the maximum of cluster totals exceeds 2 million US dollars. 8.2.3 Sampling clusters n I clusters are sampled out of the N I clusters in the population. The sampling probabilities of clusters are determined by the sampling design p i ( · ). At simple (one stage) cluster sampling all N i elements of the sampled cluster i will be considered (interviewed). <?page no="127"?> 128 8 Cluster sampling 0 500000 1500000 2500000 Cluster totals Density ·10 −6 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Figure 8.1: Distribution of the cluster totals. A clustered sample s consists of n I clusters drawn from the N I population clusters. s I denotes the index set of clusters which have been sampled. A sample s consists of clusters U i with i ∈ s I s = ∪ i ∈ s I U i . (8.5) The sample size results from the number of elements contained in the n I sampled clusters n s = ∑ i ∈ s I N i . (8.6) Note that only in the very specific case in which all clusters are of equal size (as in our example) the sample size n s is determined by n I . If the clusters U i vary in size, the sample size n s depends on s I and therefore is a random variate. 8.2.4 Inclusion probabilities The inclusion probability π Ii of clusters i is determined by the sampling design p I ( · ) π Ii = ∑ s I 3 i p I ( s I ) ( cp. ∑ s 3 k p ( s ) ) . (8.7) Regarding two clusters i and j , we find π Iij = ∑ s I 3 i&j p I ( s I ) . (8.8) <?page no="128"?> 8.3 Estimating the population total 129 In one stage cluster sampling all elements of the sampled clusters are observed. Therefore, the inclusion probability of element k is identical to the inclusion probability of cluster i containing element k : π k = P( k ∈ s ) = P( i ∈ s I ) = π Ii . (8.9) For obtaining the second order inclusion probabilities π kl , we have to differentiate whether k and l belong to the same cluster or to different ones. If k and l belong to the same cluster, we obtain π kl = P( k & l ∈ s ) = P( i ∈ s I ) = π Ii . (8.10) If k and l belong to different clusters we obtain π kl = P( k & l ∈ s ) = P( i & j ∈ s I ) = π Iij . (8.11) 8.3 Estimating the population total We denote the total of variable Y in cluster i with t i t i = ∑ k ∈ U i y k . (8.12) The population total t can be expressed as t = ∑ k ∈ U y k = ∑ i ∈ U I t i = ∑ i ∈ U I ∑ k ∈ U i y k (8.13) and accordingly the population mean as ¯ y U = 1 N ∑ k ∈ U y k = 1 N ∑ i ∈ U I t i = 1 N ∑ i ∈ U I ¯ y i N i . (8.14) 8.3.1 The π-estimator of the population total The π -estimator is defined as ˆ t π = ∑ i ∈ s I t i π Ii = ∑ i ∈ s I ˇ t i , (8.15) i.e. the n I totals t i in the sample will be expanded using the cluster inclusion probabilities π Ii . Note that the cluster totals t i are known exactly as all elements of the sampled clusters are observed. <?page no="129"?> 130 8 Cluster sampling 8.3.2 Variance of the π-estimator of the population total The variance of the π -estimator of the total can be obtained as V( ˆ t π ) = ∑ i ∈ U I ∑ j ∈ U I Cov( I Ii , I Ij ) t i π Ii t j π Ij = ∑ i ∈ U I ∑ j ∈ U I ∆ Iij ˇ t i ˇ t j = ∑ i ∈ U I ∑ j ∈ U I ( π Iij − π Ii π Ij ) ˇ t i ˇ t j . (8.16) Based on a sample the variance of the estimator can be estimated through expanding the covariances using the second order inclusion probabilities: ˇ ∆ Iij = ∆ Iij π Iij . (8.17) The variance estimator results as ̂ V( ˆ t π ) = ∑ i ∈ s I ∑ j ∈ s I ˇ ∆ Iij ˇ t i ˇ t j . (8.18) If the number of clusters in the sample n I is fixed a priori, the variance can be simplified towards V( ˆ t π ) = − 1 2 ∑ i ∈ U I ∑ j ∈ U I ∆ Iij ( ˇ t i − ˇ t j ) 2 = − 1 2 ∑ i ∈ U I ∑ j ∈ U I ∆ Iij ( t i π Ii − t j π Ij ) 2 . (8.19) Note that if all expanded totals were identical, i.e. ˇ t i = ˇ t j , the variance would be 0. With a fixed number of clusters n I , the variance estimator based on a sample is ̂ V( ˆ t π ) = − 1 2 ∑ i ∈ s I ∑ j ∈ s I ˇ ∆ Iij ( ˇ t i − ˇ t j ) 2 = − 1 2 ∑ i ∈ s I ∑ j ∈ s I ˇ ∆ Iij ( t i π Ii − t j π Ij ) 2 . (8.20) We observe that for the case of rather similar cluster totals and identical inclusion probabilities the variance would be small. If the cluster totals differ, inclusion probabilities which are strongly correlated with the totals would result in similar expanded totals and in a small variance. Unfortunately, in practice we often find strong differing cluster totals and identical inclusion probabilities which results in large variances. <?page no="130"?> 8.4 Simple random sampling of clusters (SIC) 131 8.4 Simple random sampling of clusters (SIC) 8.4.1 The π-estimator of the population total In cluster sampling with simple random sampling of clusters, all clusters have identical inclusion probabilities, i.e. π Ii = n I / N I . The π -estimator can be expressed as average cluster total in the sample multiplied by the number of clusters N I in the population ˆ t π = ∑ i ∈ s I t i π Ii = ∑ i ∈ s I t i n I N I = N I 1 n I ∑ i ∈ s I t i = N I ¯ t s I with ¯ t s I = 1 n I ∑ i ∈ s I t i . (8.21) 8.4.2 The π-estimator in the PSID example We consider cluster samples of n I = 10 clusters which result in a sample size of n = 100. We find the number of possible samples |S| = M as nI <- 10 choose (NI,nI) ## [1] 17310309456440 As M is rather big, we will not completely enumerate all possible samples but approximate the distribution of the π -estimator by drawing B = 100 , 000 samples. For each of the samples, we calculate the numeric estimator of the population mean. We additionally draw B simple random samples of size n = 100 for comparison. mY <mean (Y) n <- 100 fI <nI/ NI; fI ## [1] 0.1 B <- 100000 e1 <rep (NA,B) e2 <e1 set.seed (123) for (i in 1: B){ e1[i] <- 1/ fI/ N* sum ( sample (tI,nI)) e2[i] <mean ( sample (Y,n)) } <?page no="131"?> 132 8 Cluster sampling 30000 40000 50000 60000 70000 80000 ˆ¯ y Density ·10 −5 0 2 4 6 ¯ y cluster sampling simple random sampling Figure 8.2: Approximate distribution of the mean estimator. The approximated distribution of the π -estimator of the population mean is shown in figure 8.2. plot ( density (e2),lwd=2.5,lty=3,main="", xlab= expression ( hat ( bar (y)))) lines ( density (e1),lwd=2,lty=1) arrows (mY,1.3e-05,mY,0,length=0.15,angle=15) text (mY,1.6e-05, expression ( bar (y)),cex=1.3) legend ('topright', c ("cluster sampling","simple random sampling"), lwd=2,lty= c (1,3),bty="n",cex=1.3) 8.4.3 Variance of the π-estimator of the population total The variance using the sampling fraction of clusters f I simplifies towards V SIC ( ˆ t π ) = N 2 I 1 − f I n I S 2 tU I with f I = n I N I , S 2 tU I = 1 N I − 1 ∑ i ∈ U I ( t i − ¯ t U I ) 2 (8.22) and ¯ t U I = 1 N I ∑ i ∈ U I t i . <?page no="132"?> 8.4 Simple random sampling of clusters (SIC) 133 The variance can be estimated unbiasedly based on the sample as ̂ V SIC ( ˆ t π ) = N 2 I 1 − f I n I S 2 ts I with f I = n I N I , S 2 ts I = 1 n I − 1 ∑ i ∈ s I ( t i − ¯ t s I ) 2 (8.23) and ¯ t s I = 1 n I ∑ i ∈ s I t i . The variance depends solely on the differences between cluster totals. Based on the variance decomposition, the variance can be expressed as the variance within clusters and between clusters. As the variance of the cluster estimator of the population total contains only the differences between cluster totals, clusters should optimally be generated to have similar cluster totals resulting in a large variance within clusters and small variance between clusters. In practice, geographically clustered elements tend to be rather homogeneous, resulting in a small within cluster variance and a large between cluster variance. The variance of the estimator accordingly tends to be rather large and is most often considerably larger compared to the case of simple random sampling with same size n . 8.4.4 Variance of the mean estimator in the PSID example We calculate the variance of the mean estimator in our example analytically as well as approximately based on the B realized samples. V.SIC <- 1/ N^2*NI^2*(1-fI)/ nI* var (tI); V.SIC ## [1] 76209050 var (e1)/ V.SIC ## [1] 0.9957 The approximation error is negligible. Comparing the variance of the mean estimator for cluster sampling with the variance of the mean estimator for simple random sampling, we find V.SI <- 1/ n*(1-n/ N)* var (Y); V.SI ## [1] 31883633 V.SIC/ V.SI ## [1] 2.39 <?page no="133"?> 134 8 Cluster sampling Therefore, the homogeneity within clusters and heterogeneity of cluster totals (means) results in about 2 . 4 times the variance of simple random sampling. We finally look at the estimation of the variance based on the samples. We generate B = 100 , 000 samples and calculate for each sample the numerical variance estimator. B <- 100000 e <rep (NA,B) set.seed (123) for (i in 1: B) { e[i] <- 1/ N^2*NI^2*(1-fI)/ nI* var ( sample (tI,nI)) } mean (e) ## [1] 76139671 mean (e)/ V.SIC ## [1] 0.9991 mean (e<0.5*V.SIC) ## [1] 0.4264 We find that the mean of the B numeric variance estimates deviates only by about 0.1% from the true variance. The distribution of the variance estimator of the mean estimator is displayed in figure 8.3. plot ( density (e),lwd=2,lty=1,main="", xlab= expression ( hat (V)*"("* hat ( bar (y))*")")) arrows (V.SIC,4e-09,V.SIC,0,length=0.1,angle=15) text (V.SIC,5e-09, expression ( hat (V)*"("* hat ( bar (y))*")"),cex=1.3) Note that the estimating function of the variance has itself a very high variance and is strongly non-normally distributed. In about 43% of the samples the true variance will be underestimated by more than 50%, resulting in strongly misleading confidence intervals. <?page no="134"?> 8.5 Exercises 135 0 100000000 300000000 ̂ V(ˆ¯ y) Density ·10 −8 0.0 0.5 1.0 1.5 ̂ V(ˆ¯ y) Figure 8.3: Approximate distribution of the variance of the mean estimator. 8.5 Exercises 1. Consider the following population with variable Y and cluster indicator Z : Y : 1 2 4 3 5 7 6 8 9 Z : 1 1 1 2 2 2 3 3 3 The sample will consist of n I = 2 clusters drawn randomly with identical inclusion probabilities. a) Despite fixed n I in practice the sample size n s is often a random variate depending on the specific sample ( s ) chosen. Is n s random in this small example? b) Obtain the total of the population and the three possible numeric estimators of the population total based on samples with n I = 2. c) Obtain the variance of the total estimator based on complete enumeration as well as analytically. d) Compare the variance of the three estimators for the population mean numerically. • Simple random sampling with sample size n = 6 • Stratified sampling with identical sampling fraction f h in all strata and sample size n = 6. Z indicates the partition of U into H strata (see the exercises of chapter 7). <?page no="135"?> 136 8 Cluster sampling • Cluster sampling with n I = 2 and identical inclusion probabilities for all clusters. 2. Assume that the following cluster sample has been drawn: Y : 1 2 4 3 5 7 Z : 1 1 1 2 2 2 a) Calculate the numeric estimator of the population mean. b) Calculate the variance estimator of the mean estimator. c) Explain the variance decomposition in the context of cluster sampling. What is the share of the between cluster (within cluster) variance of the total variance for the population given in exercise 1? d) Do you regard the clustering in the example as optimal? e) How would you optimally aggregate units towards clusters? 3. Consider the PSID data set. Sort the data according to sector, years of education and age. Build n I = 200 clusters of identical cluster size N i = 5. a) Obtain a vector of length N I containing the cluster totals. b) Draw a cluster sample of size n I = 20 and calculate the numeric estimator of the population mean and of the estimator of its variance. c) Obtain the approximate distributions of the estimator of the population mean and of the estimator of its variance. Describe both distributions using descriptive statistics. d) Compare the distributions with the corresponding distributions for simple random sampling with identical sample size n = 100. <?page no="136"?> 9 Auxiliary variables Sometimes useful information is available for all elements of the population that can be used to increase the efficiency. Both, for the sampling design and for the estimating functions, auxiliary information can lead to more efficiency. 9.1 Introduction 138 9.2 The ratio estimator 139 9.2.1 Example of the ratio estimator using PSID data 140 9.2.2 Taylor series expansion 142 9.2.3 The approximate variance of the ratio estimator 145 9.2.4 Estimating the approximate variance of the ratio estimator using PSID data146 9.2.5 Comparison of the ratio estimator with the simple π-estimator 148 9.2.6 The ratio estimator in the regression context 149 9.2.7 The linear regression model under specific heteroskedasticity assumption 151 9.3 The difference estimator 152 9.3.1 The difference estimator using regression notation 152 9.3.2 Properties of the difference estimator 153 9.3.3 The difference estimator of average wage using the PSID data 154 9.4 Exercises 158 <?page no="137"?> 138 9 Auxiliary variables 9.1 Introduction We want to gain information on variable Y by means of survey sampling. While Y is not known for the elements in population U , assume that a variable X is known for elements in U and assume also that it is known from previous analysis, that Y and X correlate rather strongly. In this case, the available information on the auxiliary variable X can be used to improve the sampling design as well as to increase the efficiency of the estimating function. As an example, we can think of a population of firms. We would like to learn about the total of sales ( t Y ) by means of a survey. There is information available on employment ( X ) for each element in the population, i.e. X k is known for all k ∈ U . Consider another example. Assume we want to estimate the average yearly wage ¯ Y U in the population U and have information on the number of educational years X k obtained for all k ∈ U . d <read.csv2 ("psid_2007_sp.csv") Y <d$wage X <d$eduyears cor (Y,X) ## [1] 0.2866 We find that in the population the correlation between years of education and wage is about Cor ( Y, X ) = 0 . 29. Figure 9.1 shows the scatter plot and the estimated linear relation of years of education and wages. Yt <- Y/ 1000 # figure a plot (X,Yt,xlim= c (5,18),xlab="Years of education", ylab="Wage") abline ( lm (Yt~X),lwd=2) # figure b plot (X,Yt,xlim= c (5,18),ylim= c (0,200), xlab="Years of education",ylab="Wage") abline ( lm (Yt~X),lwd=2) Of course, the exact relation between years of education and wages is in practice not known before the sample is drawn. However, we may very well assume that there is some correlation. This information can be used e.g. to stratify the population by years of education or to assign inclusion probabilities related to years of education. Besides using this information for sampling, we can <?page no="138"?> 9.2 The ratio estimator 139 6 8 10 12 14 16 18 Years of education 0 200 400 600 800 1000 Wage (a) all wages 6 8 10 12 14 16 18 Years of education 0 50 100 150 200 Wage (b) zoom of wage interval [0; 200] Figure 9.1: Years of education and wage (th. US Dollar). also use the information for estimation. In the following we will focus on the use of the auxiliary variable X in estimation. Based on the sample, we obtain estimators ˆ t X of the total of the auxiliary variable X and ˆ t y for the total of the variable of interest Y . If the true total t X is known, there are three possibilities to potentially “improve” the estimator ˆ t y : • Ratio estimator, • Difference estimator, • Regression estimator. 9.2 The ratio estimator The ratio of the two totals in the population U is r = t y t x . (9.1) As t y is not known, ratio r is also unknown. Based on a sample, we obtain the two estimators of the totals ˆ t y and ˆ t x . The ratio can be estimated by simply replacing the unknown population totals by their estimators based on the sample: ˆ r = ˆ t y ˆ t x . (9.2) While ˆ t y is an estimator of the total, the ratio estimator makes use of the sample information obtained on X contained in ˆ r ˆ t ra = ˆ rt x = ˆ t y ˆ t x t x = ˆ t y t x ˆ t x . (9.3) <?page no="139"?> 140 9 Auxiliary variables The ratio estimator ˆ t ra for the total t y “corrects” the free estimator ˆ t y using information of an underor overestimation of t x by ˆ t x . I.e. if we find that the total estimator ˆ t x underestimates t x and X and Y are correlated, we may also assume that total t y will be underestimated by ˆ t y for this sample. This idea may perhaps become more evident if we insert the estimating function for the π -estimator ˆ t y ˆ t ra = t x ˆ t x ˆ t y = ∑ s t x ˆ t x y k π k . (9.4) Again, we observe that each individual expanded value y k / π k is corrected by t x / ˆ t x to take into account whether the sample is “too small” or “too big”. Unfortunately, the estimating function contains ˆ r , which is the ratio of two random variates ˆ t y and ˆ t x f ( ˆ t y , ˆ t x ) = ˆ r = ˆ t y ˆ t x . (9.5) Due to this non-linearity ˆ t ra is only approximately unbiased and the variance of ˆ t ra can only be derived approximately. For the derivation of the variance, we have to make use of the Taylor series expansion. We first look at a small simulation using the PSID data before we introduce the Taylor series expansion using some basic examples. 9.2.1 Example of the ratio estimator using PSID data We consider simple random sampling and focus on the estimation using the ratio estimator. We continue the example of estimating the average wage of the population using information on years of education obtained as potential useful information. The ratio estimator of the mean wage is the ratio estimator of the total divided by N ˆ ¯ y ra = 1 N t x ˆ t x ˆ t y = ¯ y s ¯ x U ¯ x s . (9.6) We draw B = 10 , 000 samples of size n = 100 and estimate the density distribution of the ratio estimator based on the B realizations. mX <mean (X); mX ## [1] 12.96 mY <mean (Y); mY <?page no="140"?> 9.2 The ratio estimator 141 30000 40000 50000 60000 70000 Ratio estimator Density ·10 −5 0 2 4 6 8 ¯ y ratio estimator free estimator Figure 9.2: Approximate distribution of the ratio estimator and the free estimator, n = 100. ## [1] 44487 B <- 10000 e <matrix (NA,nrow=B,ncol=2) set.seed (123) for (i in 1: B){ s <sample (1000,100,replace=F) mxi <mean (X[s]) myi <mean (Y[s]) e[i,1] <myi*mX/ mxi e[i,2] <myi } colMeans (e) ## [1] 44409 44422 We observe (see figure 9.2) that the distributions of the two estimators resemble. The differences are rather negligible. plot ( density (e[,1]),lwd=2,main="", xlab="Ratio estimator") arrows (mY,1.3e-05,mY,0,length=0.1,angle=15) text (mY,1.6e-05, expression ( bar (y)),cex=1.3) lines ( density (e[,2]),lwd=2,lty=2) legend ('topright', c ("ratio estimator","free estimator"), lwd=2,lty= c (1,2),bty="n",cex=1.3) <?page no="141"?> 142 9 Auxiliary variables 9.2.2 Taylor series expansion If function f ( x ) is infinitely differentiable in the neighbourhood of x 0 , it can be expressed as f ( x ) = ∞ ∑ s=0 1 s ! ∂ s f ∂x s ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) s (9.7) Function f ( x ) at point x can be approximated through the expanded series of order n at x 0 f ( x ) = f ( x 0 ) 0! + 1 1! ∂f ( x ) ∂x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) + 1 2! ∂ 2 f ( x ) ∂x 2 ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) 2 (9.8) + . . . + 1 n ! ∂ n f ( x ) ∂x n ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) n + R n . The more derivatives we use, the smaller is the approximation error R n . The first order approximation uses only the first derivative f ( x ) = f ( x 0 ) 0! + 1 1! ∂f ( x ) ∂x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) + R 1 , (9.9) whereas the second order approximation uses additionally the second derivative f ( x ) = f ( x 0 ) 0! + 1 1! ∂f ( x ) ∂x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) + 1 2! ∂ 2 f ( x ) ∂x 2 ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) 2 + R 2 , (9.10) and the third order approximation uses the first, second and third derivative f ( x ) = f ( x 0 ) 0! + 1 1! ∂f ( x ) ∂x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) + 1 2! ∂ 2 f ( x ) ∂x 2 ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) 2 + 1 3! ∂ 3 f ( x ) ∂x 3 ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) 3 + R 3 . (9.11) <?page no="142"?> 9.2 The ratio estimator 143 We exemplify the expansion using a very simple function f ( x ) = 2 log x. At point x = 6 we find f (6) = 2 log 6 = 3 . 5835 . This function value can be approximated by expanding the function f ( x ) around x 0 = 5. The derivatives of f ( x ) are ∂f ( x ) ∂x = ∂ 2 log x ∂x = 2 x , ∂ 2 f ( x ) ∂x 2 = ∂ 2 2 log x ∂x 2 = ∂ 2 x ∂x = − 2 x 2 and ∂ 3 f ( x ) ∂x 3 = ∂ 3 2 log x ∂x 3 = ∂ ( − 2 x 2 ) ∂x = 4 x 3 . Using the first order Taylor series approximation, we find f ( x ) ≈ f ( x 0 ) 0! + 1 1! ∂f ( x ) ∂x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) = 2 log x 0 + 2 x ∣ ∣ ∣ ∣ x 0 ( x − x 0 ) f (6) = 2 log 5 + 2 5 (6 − 5) = 3 . 6189 . For the second order approximation, we find f (6) ≈ 2 log 5 + 2 5 (6 − 5) + 1 2! ( − 2 5 2 ) (6 − 5) = 3 . 5789 . And for the third order approximation, we find f (6) ≈ 2 log 5 + 2 5 (6 − 5) + 1 2! ( − 2 5 2 ) (6 − 5) + 1 3! 4 5 3 (6 − 5) = 3 . 5842 . In an analogue way, we can also approximate functions with two variables. If function f ( x, y ) is infinitely differentiable in the neighbourhood of x 0 , y 0 it can be expressed as f ( x, y ) = ∞ ∑ s,t=0 1 s ! t ! ∂ s+t f ∂x s ∂y t ∣ ∣ ∣ ∣ x 0 ,y 0 ( x − x 0 ) s ( y − y 0 ) t . (9.12) <?page no="143"?> 144 9 Auxiliary variables As an example, we look at the following function f ( x, y ) with two variables f ( x, y ) = 2 log x + xy 5 . As the expressions become increasingly complex, we restrict the example to a first and second order approximation. The first order approximation of f ( x, y ) expanded around ( x 0 , y 0 ) is f ( x, y ) = f ( x 0 , y 0 ) + ∂f ∂x ∣ ∣ ∣ ∣ x 0 ,y 0 ( x − x 0 ) + ∂f ∂y ∣ ∣ ∣ ∣ x 0 ,y 0 ( y − y 0 ) + R 1 . The second order approximation of f ( x, y ) expanded around ( x 0 , y 0 ) is f ( x, y ) = f ( x 0 , y 0 ) + ∂f ∂x ∣ ∣ ∣ ∣ x 0 ,y 0 ( x − x 0 ) + ∂f ∂y ∣ ∣ ∣ ∣ x 0 ,y 0 ( y − y 0 ) + ∂ 2 f ∂x∂y ∣ ∣ ∣ ∣ x 0 ,y 0 ( x − x 0 )( y − y 0 ) + 1 2! ∂ 2 f ∂x 2 ∣ ∣ ∣ ∣ x 0 ,y 0 ( x − x 0 ) 2 + 1 2! ∂ 2 f ∂y 2 ∣ ∣ ∣ ∣ x 0 ,y 0 ( y − y 0 ) 2 + R 2 . Assume that we want to find the function value f ( x = 2 . 8 , y = 1 . 1) and we expand the function around the initial point x 0 = 2 . 5 , y 0 = 1. The derivatives of the function are ∂ (2 log x + xy 5 ) ∂x = 1 x ( xy 5 + 2 ) , ∂ 2 (2 log x + xy 5 ) ∂x 2 = − 2 x 2 , ∂ (2 log x + xy 5 ) ∂y = 5 xy 4 , ∂ 2 (2 log x + xy 5 ) ∂y 2 = 20 xy 3 and ∂ (2 log x + xy 5 ) ∂x∂y = 5 y 4 . The true function value f ( x, y ) is f ( x = 2 . 8 , y = 1 . 1) = 2 log 2 . 8 + 2 . 8 · 1 . 1 5 = 6 . 5687 . <?page no="144"?> 9.2 The ratio estimator 145 The first order approximation of f ( x, y ) based on the expanded series around point ( x 0 , y 0 ) results in f ( x = 2 . 8 , y = 1 . 1) ≈ 2 log 2 . 5 + 2 . 5 · 1 5 + 1 2 . 5 ( 2 . 5 · 1 5 + 2 ) (2 . 8 − 2 . 5) + 5 · 2 . 5 · 1 4 (1 . 1 − 1) = 6 . 1226 . The second order approximation of f ( x, y ) based on the expanded series around point ( x 0 , y 0 ) results in f ( x = 2 . 8 , y = 1 . 1) ≈ 2 log 2 . 5 + 2 . 5 · 1 5 + 1 2 . 5 ( 2 . 5 · 1 5 + 2 ) (2 . 8 − 2 . 5) + 5 · 2 . 5 · 1 4 (1 . 1 − 1) + 5 · 1 4 (2 . 8 − 2 . 5)(1 . 1 − 1) + 1 2 ( − 2 2 . 5 2 ) (2 . 8 − 2 . 5) 2 + 1 2 · 20 · 2 . 5 · 1 3 (1 . 1 − 1) 2 = 6 . 5082 9.2.3 The approximate variance of the ratio estimator The ratio estimator is a function of the two total estimators ˆ t y and ˆ t x , hence we are interested in the function value f ( ˆ t y , ˆ t x ) = ˆ t y / ˆ t x . We approximate this function through the Taylor series expansion around point ( t y , t x ). Using the first order approximation, we find ˆ t y ˆ t x = f ( ˆ t y , ˆ t x ) ≈ f ( t y , t x ) + ∂f ∂ ˆ t y ∣ ∣ ∣ ∣ t x ,t y ( ˆ t y − t y ) + ∂f ∂ ˆ t x ∣ ∣ ∣ ∣ t x ,t y ( ˆ t x − t x ) = t y t x + 1 t x ( ˆ t y − t y ) − t y t 2x ( ˆ t x − t x ) = t y t x + ˆ t y t x − t y t x − t y ˆ t x t 2x + t y t x = t y t x + ˆ t y t x − t y ˆ t x t 2x = t y t x + 1 t x ( ˆ t y − r ˆ t x ) . (9.13) The approximation ˆ t y ˆ t x ≈ t y t x + 1 t x ( ˆ t y − r ˆ t x ) (9.14) <?page no="145"?> 146 9 Auxiliary variables we insert into the definition of the ratio estimator ˆ t ra = ˆ rt x = ˆ t y ˆ t x t x ≈ [ t y t x + 1 t x ( ˆ t y − r ˆ t x ) ] t x = t y + ˆ t y − r ˆ t x . (9.15) Note that the first order approximation is now a linear function of random variates. The expected value of the first order approximation is E ( t y + ˆ t y − r ˆ t x ) = E ( t y ) + E ( ˆ t y ) − E ( r ˆ t x ) = t y + t y − t y t x t x = t y . (9.16) The variance of the linear approximation is used as an approximation of the true variance of the ratio estimator AV ( ˆ t ra ) = V ( t y + ˆ t y − r ˆ t x ) = V ( ˆ t y ) + r 2 V ( ˆ t x ) − 2 r Cov ( ˆ t y , ˆ t x ) with Cov ( ˆ t y , ˆ t x ) = E [( ˆ t y − t y ) ( ˆ t x − t x )] . (9.17) The approximate variance can also be expressed using population variances AV ( ˆ t ra ) = N 2 1 n ( 1 − n N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) = N 2 ( 1 n − 1 N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) . (9.18) As the variances of the population are unknown, the variances have to be estimated based on the sample. The estimator of the approximate variance is ̂ AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) ( S 2 y,s + ˆ r 2 S 2 x,s − 2ˆ rS xy,s ) . (9.19) 9.2.4 Estimating the approximate variance of the ratio estimator using PSID data The approximative variance of the ratio estimator for the population mean for sample size n = 100 using formula AV ( ˆ ¯ y ra ) = ( 1 n − 1 N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) . (9.20) is <?page no="146"?> 9.2 The ratio estimator 147 N <length (Y) n <- 100 r <mY/ mX av.r <- (1/ n-1/ N)*( var (Y)+r^2* var (X)-2*r* cov (X,Y)); av.r ## [1] 30069761 as.r <sqrt (av.r); as.r ## [1] 5484 v.si <var (Y)/ n s.si <sqrt (v.si); s.si ## [1] 5952 v.si/ av.r ## [1] 1.178 We draw B = 10 , 000 samples of size n = 100 to obtain an approximation of the distribution of the estimator of the approximate variance of the ratio estimator. The estimating function based on a sample is ̂ AV ( ˆ ¯ y ra ) = ( 1 n − 1 N ) ( S 2 y,s + ˆ r 2 S 2 x,s − 2ˆ rS xy,s . ) (9.21) B <- 10000 e <rep (NA,B) set.seed (123) for (i in 1: B){ s <sample (1000,100,replace=F) x <- X[s] y <- Y[s] rs <mean (y)/ mean (x) e[i] <- (1/ n-1/ N)*( var (y)+r^2* var (x)-2*rs* cov (x,y)) } mean (e) ## [1] 29580045 mean (e)/ av.r ## [1] 0.9837 We find that on average, we underestimate the approximate variance by about 2%. The distribution of the variance estimator is bimodal (see figure 9.3). <?page no="147"?> 148 9 Auxiliary variables 0 50000000 100000000 150000000 Variance estimator Density ·10 −8 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 AV(ˆ¯ y ra ) Figure 9.3: Approximate distribution of variance estimator, n = 100. plot ( density (e),lwd=2,main="",xlab="Variance estimator") arrows (av.r,0.5e-08,av.r,0,length=0.1,angle=15) text (av.r,0.7e-08,cex=1.3, expression ( paste ("AV(", hat ( bar (y))[ra],")", sep=""))) The probability of underestimating the variance by more than 50% is about 0 . 36. The probability to overestimate the variance by factor 2 is about 0 . 1. 9.2.5 Comparison of the ratio estimator with the simple π-estimator The approximate variance of the ratio estimator has been derived as AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) . (9.22) The variance of the simple total estimator in the case of simple random sampling is V ( ˆ t π ) = N 2 ( 1 n − 1 N ) S 2 y,U . (9.23) Comparing the two variances, we see that the ratio estimator has smaller variance if it holds that r 2 S 2 x,U − 2 rS xy,U < 0 <?page no="148"?> 9.2 The ratio estimator 149 S xy,U > 1 2 rS 2 x,U S xy,U S y,U S x,U > 1 2 r S 2 x,U S y,U S x,U (9.24) % xy > 1 2 N ¯ y U N ¯ x U S x,U S y,U % xy > 1 2 CV x,U CV y,U , where % xy denotes the coefficient of correlation and CV = S ¯ x is the coefficient of variation. Summing up the result, we find % xy > 1 2 CV x,U CV y,U = ⇒ AV ( ˆ t ra ) < V ( ˆ t π ) . (9.25) For the special case that both variables Y and X have similar standard errors relative to their averages (i.e. similar coefficients of variation), we find as a simple rule of thumb that the ratio estimator has a smaller variance if the correlation of Y and X exceeds 0.5. For the empirical example we find cor (X,Y) ## [1] 0.2866 0.5*( sd (X)/ mX)/ ( sd (Y)/ mY) ## [1] 0.06384 Therefore, we can expect that the ratio estimator may have a slightly smaller variance compared to the sample mean. 9.2.6 The ratio estimator in the regression context Through Taylor series expansion, we obtained a linear approximate expression for the ratio estimator ˆ t ra ≈ t y + ˆ t y − r ˆ t x . (9.26) Inserting the expressions for the π -estimators of the total, we can express the linear approximation as ˆ t ra ≈ t y + ˆ t y − r ˆ t x = t y + ∑ s y k π k − r ∑ s x k π k = t y + ∑ s 1 π k ( y k − rx k ) = t y + ∑ s e k π k . (9.27) <?page no="149"?> 150 9 Auxiliary variables 0 5 10 15 Years of education 0 200 400 600 800 1000 Wage (a) all wages 0 5 10 15 Years of education 0 50 100 150 200 Wage (¯ x, ¯ y) (b) zoom of wage interval [0; 200] Figure 9.4: Homogeneous regression of wage (th. US- Dollar) on years of education, n = 100. The relation between years of education and wages is displayed in figure 9.4. rt <sum (Yt)/ sum (X); rt ## [1] 3.434 # figure a plot (X,Yt,pch=1,cex=0.8,main="",xlim= c (0,18), xlab="Years of education", ylab="Wage (th. US-Dollar)") lines (0: 18,0: 18*rt,lwd=2) # figure b plot (X,Yt,pch=1,cex=0.8,main="",xlim= c (0,18), ylim= c (0,200),xlab="Years of education", ylab="Wage (th. US-Dollar)") lines (0: 18,0: 18*rt,lwd=2) mYt <mean (Yt) arrows (5,100,mX,mYt,length=0.1,angle=20) text (4,107,cex=2, expression ( paste ( "(" , bar (x),",", bar (y),")", sep=""))) Note that r is the population ratio of totals r = t y / t x which does not equal the regression parameter of the homogeneous population regression function. e k is the residual of the regression using ratio r as regression coefficient. The sum of expanded residuals ∑ s e k / π k is an estimator of the sum of the population residuals (which contrary to the linear regression with intercept is not 0). <?page no="150"?> 9.2 The ratio estimator 151 As the ratio of population totals r is unknown, it is estimated using the sample totals ˆ r = ˆ t y ˆ t x . (9.28) The estimates of the residuals are obtained using the estimated ratio ˆ r ˆ e k = y k − ˆ rx k . (9.29) The approximate variance AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) (9.30) can be expressed using the population residuals obtained with ratio r as regression coefficient AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) S 2 e,U . (9.31) For estimating based on a sample, we make use of estimated residuals ˆ e ̂ AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) S 2 ˆ e,s . (9.32) 9.2.7 The linear regression model under specific heteroskedasticity assumption We noticed that without taking into account any form of heterogeneity, the ratio of totals r does not equal the population regression parameter of the homogeneous regression. As the stochastic regression model does not refer to any existing population but assumes a random mechanism (also called superpopulation model) to generate social facts, we introduce the parameter β for this model. The parameter for the population U is denoted by ˜ β to hint for the idea of the “stochastic model” that the facts observed for the population U are generated by the assumed random mechanism and that other populations could very well have been generated. We also use B for the population parameter which is seen as constant in the survey sampling context. The constant parameter B can be calculated exactly based on the population data and has only to be estimated if a sample is drawn. The estimator for B based on a sample is denoted by ̂ B . We now assume that the variance of the stochastic regression error term is proportional to X y k = βx k + ε k (9.33a) <?page no="151"?> 152 9 Auxiliary variables V( ε k ) = x k σ 2 (9.33b) To account for this specific heterogeneity in the regression, weights 1 / x k have to be used in calculations. For the population regression we then obtain ˜ β = B = ∑ U x k y k x k ∑ U x k x k x k = t y t x = r. (9.34) Hence, in this special case the regression parameter and the ratio r coincide. For the population residuals, it therefore holds that the residuals of the regression approach equal the residuals using r as regression coefficient ˆ ε k = y k − ˜ βx k = y k − rx k = e k . (9.35) Note that the assumption of this very specific form of heterogeneity may be unjustified for the data under analysis. 9.3 The difference estimator The idea of the difference estimator is quite similar to the idea of the ratio estimator. But whereas the ratio estimator assumed a proportional relation between the variable of interest Y and auxiliary variable X , the difference estimator assumes a constant shift, independent of the level of X . 9.3.1 The difference estimator using regression notation Using the notation of the stochastic regression model, the difference estimator assumes the following model to hold y k = β + 1 · x k + ε k y k − x k = β + ε k . (9.36) Under the assumption of homoskedasticity, the variance of the residual is constant V( ε k ) = σ. (9.37) Based on the data for population U , we can calculate B (or ˜ β in the notation of the stochastic regression model), which may be seen as an estimate of the parameter β of the stochastic model ˜ β = B = 1 N ∑ U ( y k − x k ) = 1 N ∑ U y k − 1 N ∑ U x k = ¯ y U − ¯ x U . (9.38) <?page no="152"?> 9.3 The difference estimator 153 This implies that the sum of residuals in the population U is 0 ∑ U e k = ∑ U ( y k − x k − ˜ β ) = 0 . (9.39) The population parameter B can be estimated based on a sample ̂ B = 1 n ∑ s ( y k − x k ) = 1 n ∑ s y k − 1 n ∑ s x k = ¯ y s − ¯ x s . (9.40) The difference estimator ˆ t y,diffof total t y is ˆ t y,diff = t x + N ̂ B = N ¯ x U + N (¯ y s − ¯ x s ) = N ¯ y s + N (¯ x U − ¯ x s ) . (9.41) Note that the free estimator N ¯ y s is corrected for the sample being “too big” (¯ x U − ¯ x s < 0) or “too small” (¯ x U − ¯ x s > 0). 9.3.2 Properties of the difference estimator The difference estimator is unbiased E( ˆ t y,diff ) = E( N [¯ x U + ¯ y s − ¯ x s ]) = N [E(¯ x U ) + E (¯ y s ) − E(¯ x s )] = N [¯ x U + ¯ y U − ¯ x U ] = N ¯ y U = t y,U . (9.42) The variance of the difference estimator is V( ˆ t y,diff ) = V( N [¯ x U + ¯ y s − ¯ x s ]) = N 2 ( 1 n − 1 N ) ( S 2 y,U + S 2 x,U − 2 S xy,U ) = N 2 ( 1 n − 1 N ) S 2 d,U , (9.43) where d k = (¯ y s − ¯ x s ) − (¯ y U − ¯ x U ) . (9.44) Based on a sample, the variance of the difference estimator is estimated using the estimating function ̂ V( ˆ t y,diff ) = N 2 ( 1 n − 1 N ) ( S 2 y,s + S 2 x,s − 2 S xy,s ) = N 2 ( 1 n − 1 N ) S 2 d,s . (9.45) <?page no="153"?> 154 9 Auxiliary variables The comparison of the variance of the difference estimator V ( ˆ t diff ) = N 2 ( 1 n − 1 N ) ( S 2 y,U + S 2 x,U − 2 S xy,U ) (9.46) and the free estimator V ( ˆ t π ) = N 2 ( 1 n − 1 N ) S 2 y,U (9.47) reveals the following relation S 2 x,U − 2 S xy,U < 0 S xy,U > 1 2 S 2 x,U S xy,U S y,U S x,U > 1 2 S 2 x,U S y,U S x,U % xy > 1 2 S x,U S y,U , (9.48) where % xy denotes the coefficient of correlation. Therefore, we find the following rule % xy > 1 2 S x,U S y,U = ⇒ V ( ˆ t diff ) < V ( ˆ t π ) . (9.49) 9.3.3 The difference estimator of average wage using the PSID data We apply the difference estimator towards the PSID data. Obviously, the difference estimator is not invariant to the unit of measurement. Therefore, the auxiliary variable ( X ) has to be of the same magnitude as the variable of interest ( Y ). Assume that we want to estimate the average of the yearly wage ( Y ) using information of the full-time equivalent income ( X ) assumed to be known for all elements in the population. The full-time equivalent income we calculate as hourly wage time a standard yearly working amount of 2000 hours. Figure 9.5 shows the scatter plot of X and Y and the distribution of the differences Y − X . d <read.csv2 ("psid_2007_sp.csv") Y <d$wage X <- Y/ d$hours*2000 N <length (Y) cor (X,Y) <?page no="154"?> 9.3 The difference estimator 155 x (in thousands) y ·10 5 0 2 4 6 8 10 0 100 300 500 (a) scatterplot y − x (in thousands) Density ·10 −5 01234567 -20 -10 0 10 20 (b) density Figure 9.5: X and Y and distribution of differences Y −X. ## [1] 0.9137 mY <mean (Y); mY ## [1] 44487 mX <mean (X); mX ## [1] 42988 md <mean (Y-X); md ## [1] 1499 # figure a plot (X,Y,pch=1,cex=0.8,main="", xlab="x",ylab="y") # figure b plot ( density (Y-X,n=512*4),pch=1,cex=0.8, xlim= c (-25000,25000),main="",xlab="y-x") points (md,0,pch=19,col=2,cex=1.3) We draw B = 10 , 000 samples and calculate the difference estimator for the average wage. For comparison, we also calculate the free estimator which in the case of simple random sampling is just the sample mean. B <- 10000 e <matrix (NA,B,2) set.seed (123) for (i in 1: B){ s <sample (1000,100,replace=F) x <- X[s] y <- Y[s] b <mean (y)mean (x) <?page no="155"?> 156 9 Auxiliary variables Diff. estimator (in thousands) diff. estimator sample mean Density ·10 −4 0.0 0.5 1.0 1.5 35 40 45 50 55 (a) distribution of difference estimator and sample mean Error in SI (in thousands) Correction ·10 2 -150 -100 -50 0 50 100 -20 -10 0 10 (b) error and correction Figure 9.6: Comparison. e[i,1] <mean (y)+mXmean (x) e[i,2] <mean (y) } colMeans (e) ## [1] 44456 44422 a <apply (e,2,var); a ## [1] 6050281 31274515 a[1]/ a[2] ## [1] 0.1935 cor (mY-e[,2],e[,1]-e[,2]) ## [1] 0.9133 The distributions of the difference estimator and the sample mean are displayed in figure 9.6a. # figure a plot ( density (e[,1]),lwd=2,xlab="Diff. estimator", main="") lines ( density (e[,2]),lwd=2,lty=2) legend ('topright', c ("diff. estimator","sample mean"), lwd=2,lty= c (1,2),bty="n",cex=1.3) # figure b plot (mY-e[,2],e[,1]-e[,2],cex=0.5, xlab="Error in SI",ylab="Correction") According to the simulation, the difference estimator has a smaller variance by about 81%. The right panel displays the B <?page no="156"?> 9.3 The difference estimator 157 differences between the sample mean and the population mean ¯ y U (sampling error) on the x -axis and the difference between the difference estimator and the sample mean on the y -axis (implicit correction through the difference estimator). We observe that the “error” and the “correction” correlate rather strongly (Cor = 0 . 91) in this example . <?page no="157"?> 158 9 Auxiliary variables 9.4 Exercises 1. Consider the following small population with variable Y and auxiliary variable X : Y : 2 4 6 8 X : 1 2 4 3 a) Calculate t y , t x , r = t y / t x , Cov( Y, X ) and Cor( Y, X ). The population total t y shall be estimated based on a simple random sample of size n = 2. b) Obtain the number of possible samples |S| = M . c) Obtain for all M possible samples the free estimator ˆ t and the ratio estimator ˆ t ra = ˆ t y t x ˆ t x . d) Calculate the expected value of ˆ t and ˆ t r by complete enumeration. e) Calculate the variance of ˆ t and ˆ t r by complete enumeration. f) Obtain for all M possible samples the estimator of the variance of ˆ t and the estimator of the approximate variance of (obtained by first order Taylor series approximation) ˆ t r . g) Obtain the mean square error of both estimating functions for the population total. Which estimating function would you prefer? h) Obtain the variance of the free total estimator ˆ t according to the following expression V ( ˆ t π ) = N 2 ( 1 n − 1 N ) S 2 y,U and the approximate variance of the ratio estimator according to AV ( ˆ t ra ) = N 2 ( 1 n − 1 N ) ( S 2 y,U + r 2 S 2 x,U − 2 rS xy,U ) 2. Consider the PSID data set. Use wage as variable Y and age as auxiliary variable X . <?page no="158"?> 9.4 Exercises 159 a) Do you expect the ratio estimator of the population total t y to be more efficient than the free estimator of the total? b) Draw B = 10 , 000 simple random samples of size n = 50 from the population and calculate for all samples both estimators ˆ t r and ˆ t . c) Compare the approximate distributions of both estimating functions graphically. d) Which estimating function has smaller variance? <?page no="160"?> 10 Regression Most researchers and students will be familiar with the stochastic regression model. In this model, the variable of interest Y is generated according to a random mechanism and is therefore a random variate. In the design approach of survey sampling, the values of Y k for all elements k of the population are regarded as fixed. The random aspect is confined to the random sampling of elements from the population. 10.1 Introduction 162 10.1.1 Regression without intercept 162 10.1.2 Regression with intercept 165 10.1.3 Multivariate linear regression with intercept 167 10.2 Variance of the parameter estimators 169 10.2.1 Linear approximation of the π-estimator 169 10.2.2 The variance of the linear approximation of the π-estimator 171 10.2.3 Simple regression through the origin 175 10.2.4 Simple regression with intercept 176 10.2.5 Simple regression with intercept and simple random sampling 177 10.2.6 Wage regression with PSID data and simple random sampling 177 10.3 Exercises 180 <?page no="161"?> 162 10 Regression 10.1 Introduction For the moment we will restrict the discussion to the case of simple linear regression. We denote the explanatory variable with X and the variable of interest with Y . For all N elements of population U the values y k , x k are regarded as fixed. The regression line is fitted by ordinary least squares and is meant to provide descriptive information. There is no speculation about the generation of the specific values of y k , x k , which in the social and economic context are the result of highly complex interactions. Therefore, given that the values y k , x k are non-random, the population least squares regression function will allow to obtain the regression parameters without any random error. The homogeneous regression function without an intercept passes through the origin and is given as y k = Bx k + E k . (10.1) Including an intercept, the regression function is y k = B 1 + B 2 x k + E k . (10.2) In the design approach, the regression without an intercept is less complicated. 10.1.1 Regression without intercept Population regression parameter and π-estimator In the case of one explanatory variable, the N observed values of the population are given as column vectors x =      x 1 x 2 . . . x N      , y =      y 1 y 2 . . . y N      . (10.3) The least squares criterion ∑ U ( y k − Bx k ) 2 (10.4) results in the analytic solution for the slope parameter B B = ( ∑ U x 2 k ) −1 ∑ U x k y k . (10.5) <?page no="162"?> 10.1 Introduction 163 The π -estimator based on a sample s is defined as ̂ B = ( ∑ s x 2 k π k ) −1 ∑ s x k y k π k . (10.6) A small numerical example We use a small numerical example to illustrate the π -estimator. We define a population of size N = 5 and consider the case of sample size n = 3. Hence, we have |S| = M = ( N n ) = ( 5 3 ) = 5! 3! 2! = 10 (10.7) possible samples. The values of x k , y k , p ( s ) are given as N <- 5 n <- 3 M <choose (5,3) x <- 1: N; x ## [1] 1 2 3 4 5 y <c (1,3,2,5,4); y ## [1] 1 3 2 5 4 p <- 1: M/ sum (1: M); p ## [1] 0.01818 0.03636 0.05455 0.07273 0.09091 0.10909 ## [7] 0.12727 0.14545 0.16364 0.18182 The definition of the inclusion probability is π k = P( k ∈ S ) = P( I k = 1) = ∑ s 3 k p ( s ) . (10.8) We denote the matrix of inclusion indicators with dimension N ×M by D , the column vector of sample probabilities by p and the column vector of inclusion probabilities by π . Given the definition of the inclusion probabilities, it holds that D × p = π. (10.9) We obtain the matrix of inclusion indicators D using a function indicating whether element k is in sample s U <- 1: N is <combn (N,n) Iks <function(k,s) as.numeric (k %in% s) D <apply (is,2,function(s) Iks (U,s)); D <?page no="163"?> 164 10 Regression ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 1 1 1 1 1 1 0 0 0 0 ## [2,] 1 1 1 0 0 0 1 1 1 0 ## [3,] 1 0 0 1 1 0 1 1 0 1 ## [4,] 0 1 0 1 0 1 1 0 1 1 ## [5,] 0 0 1 0 1 1 0 1 1 1 and the vector of inclusion probabilities as pik <as.vector (D%*%p); pik ## [1] 0.3818 0.5455 0.6364 0.6909 0.7455 For the M = 10 possible samples, we obtain the following numerical estimators ̂ B hB <rep (NA,M) for(i in 1: M){ ind <as.logical (D[,i]); ind xi <x[ind] yi <y[ind] pii <pik[ind] hB[i] <sum (xi*yi/ pii)/ sum (xi^2/ pii) } round (hB,3) ## [1] 0.957 1.286 0.930 1.027 0.773 0.985 1.106 0.859 ## [9] 1.043 0.920 The expected value can be calculated using the sample probabilities: # expected value of estimator sum (hB*p) ## [1] 0.971 # population parameter B <sum (x*y)/ sum (x^2); B ## [1] 0.9636 We find that the π -estimator ̂ B of the population parameter B is slightly biased. <?page no="164"?> 10.1 Introduction 165 10.1.2 Regression with intercept Population regression parameter and π-estimator The first column in matrix X contains ones and B is a 2 × 1 matrix X =      1 x 1 1 x 2 . . . . . . 1 x N      , y =      y 1 y 2 . . . y N      , B = [ B 1 B 2 ] . (10.10) The least squares criterion is ∑ U ( y k − x ′ k B ) 2 . (10.11) The analytic expression for the slope parameter is B 2 = N (∑ U x k y k ) − (∑ U x k ) (∑ U y k ) N (∑ U x 2 k ) − (∑ U x k ) 2 (10.12a) = ∑ U ( x k − ¯ x U ) ( y k − ¯ y U ) ∑ U ( x k − ¯ x U ) 2 (10.12b) and for the intercept B 1 = ¯ y U − B 2 ¯ x U . (10.12c) The π -estimator based on a sample s is defined as ̂ B 2 = (∑ s 1 π k ) (∑ s x k y k π k ) − (∑ s x k π k ) (∑ s y k π k ) (∑ s 1 π k ) (∑ s x 2 k π k ) − (∑ s x k π k ) 2 = ∑ s (x k −˜ x s )(y k −˜ y s ) π k ∑ s (x k −˜ x s ) 2 π k , ̂ B 1 = ˜ y s − ̂ B 2 ˜ x s with ˜ y s = ˆ t y,π ˆ N and ˜ x s = ˆ t x,π ˆ N . (10.13) Note that in the case of known population size N we would insert the π -estimator of the mean ˆ ¯ y = ˆ t x,π N (10.14) for ˜ y and likewise ˆ ¯ x for ˜ x . <?page no="165"?> 166 10 Regression A small numerical example We continue the example from the simple homogeneous regression. For the M = 10 samples, we obtain the following numerical estimates of ̂ B hB2 <matrix (NA,2,M) for(i in 1: M){ ind <as.logical (D[,i]); ind xi <x[ind] yi <y[ind] mxi <mean (xi) myi <mean (yi) pii <pik[ind] hB2i <sum ((xi-mxi)*(yi-myi)/ pii)/ sum ((xi-mxi)^2/ pii) hB2[1,i] <myi-hB2i*mxi hB2[2,i] <hB2i } round (hB2,3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## [1,] 0.750 -0.103 0.754 -0.374 0.164 0.227 0.569 1.715 ## [2,] 0.625 1.330 0.717 1.140 0.723 0.932 0.922 0.385 ## [,9] [,10] ## [1,] 2.302 -0.544 ## [2,] 0.463 1.053 The expected values can be calculated using the sample probabilities: # expected value of estimator round (hB2%*%p,3) ## [,1] ## [1,] 0.663 ## [2,] 0.790 # population parameter mx <mean (x) my <mean (y) B2.2 <sum ((x-mx)*(y-my))/ sum ((x-mx)^2) B2.1 <my-B2.2*mx B2 <c (B2.1,B2.2); B2 ## [1] 0.6 0.8 We find that the π -estimator ̂ B of the population parameter vector B is slightly biased. <?page no="166"?> 10.1 Introduction 167 10.1.3 Multivariate linear regression with intercept Population regression parameter and π-estimator We now consider the general case with q explanatory variables, including the vector of ones for the intercept (i.e. x 11 , x 21 , . . . , x N 1 take value 1) X =      x 11 · · · x 1q x 21 x 2q . . . . . . x N 1 · · · x N q      , y =      y 1 y 2 . . . y N      , B =      B 1 B 2 . . . B q      . (10.15) Indexes j = 1 , . . . , q and j ′ = 1 , . . . , q are used for the different explanatory variables. To a scalar we refer to with two indexes k and j , i.e. x kj . If using only index k , e.g. x k , we refer to a column vector of length q with elements taken from the k -th row of matrix X . Hence, x ′ k is a row vector of length q . Sums of products of two variables x j and x ′ j over all units k = 1 , . . . , N of the population are denoted by t jj ′ t jj ′ = ∑ U x kj x kj ′ = t j ′ j . (10.16) Variable Y is referred to using index 0 t j0 = ∑ U x kj y k . (10.17) T denotes the symmetric q × q -matrix with elements t jj ′ T = ∑ U x k x ′ k . (10.18) t is a column vector with q elements t j0 t = ∑ U x k y k . (10.19) The population parameter vector is obtained as B = T −1 t. (10.20) Therefore, B is a function of q ( q − 1) / 2 totals t jj ′ and of q totals t j0 . In the π -estimator ̂ B these population totals are replaced by their individual π -estimators of the totals ˆ t jj ′ ,π = ∑ s x kj x kj ′ π k (10.21) <?page no="167"?> 168 10 Regression and ˆ t j0,π = ∑ s x kj y k π k . (10.22) The π -estimator ̂ B is defined as ̂ B = ̂ T −1 ˆ t = ( ∑ s x k x ′ k π k ) −1 ∑ s x k y k π k with ̂ T = ∑ s x k x ′ k π k and ˆ t = ∑ s x k y k π k (10.23) and using matrix notation ̂ B = ( X ′ s Π −1 s X s ) −1 ( X ′ s Π −1 s y s ) . (10.24) A small numerical example We continue the small numerical example and define an additional explanatory variable. x2 <c (9,5,2,7,4) X <cbind (1,x,x2) hB3 <matrix (NA,3,M) for(i in 1: M){ ind <as.logical (D[,i]) Xi <- X[ind,] yi <y[ind] Ipii <solve ( diag (pik[ind])) hB3[,i] <solve ( t (Xi)%*%Ipii%*%Xi)%*% t (Xi)%*%Ipii%*%yi } round (hB3,3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 38 1.6 4.909 -3.176 -0.889 -10 -0.5 -0.5 -0.5 -0.5 ## [2,] -10 1.2 0.182 1.529 0.889 2 0.5 0.5 0.5 0.5 ## [3,] -3 -0.2 -0.455 0.294 0.111 1 0.5 0.5 0.5 0.5 The expected value can be calculated using the sample probabilities: # expected value of estimator round (hB3%*%p,3) ## [,1] ## [1,] -0.695 ## [2,] 0.591 ## [3,] 0.363 <?page no="168"?> 10.2 Variance of the parameter estimators 169 # population parameter B3 <solve ( t (X)%*%X)%*% t (X)%*%y round (B3,3) ## [,1] ## -0.905 ## x 0.954 ## x2 0.193 We observe again that the estimating function is biased. 10.2 Variance of the parameter estimators We have observed that the estimating function ̂ B for the population regression parameter vector B resulted in ratios of expressions, containing estimated totals in the nominator as well as in the denominator. Therefore, no analytic expressions of the variance V( ̂ B ) are available. To obtain approximate variances, the estimators ̂ B are substituted by first order Taylor series approximations ̂ B 0 in a first step. In a second step, variances V( ̂ B 0 ) for the linear approximations are obtained. These exact variances V ( ̂ B 0 ) of approximated estimating functions ̂ B 0 are used as approximate variances AV( ̂ B ) of the original estimating functions ̂ B . 10.2.1 Linear approximation of the π-estimator The estimator ̂ B is a function of two terms ̂ T −1 and ˆ t , both including total estimators. Therefore, we are interested in the function value f ( ̂ T −1 ˆ t ) = ̂ T −1 ˆ t . We approximate this function through the Taylor series expansion around point ( T , t ). Using the first order approximation, we find ̂ B = ̂ T −1 ˆ t = f ( ̂ T −1 ˆ t ) ≈ ̂ B 0 = f ( T −1 t ) + ∂f ( ̂ T −1 ˆ t ) ∂ ̂ T −1 ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t ( ̂ T −1 − T −1 ) + ∂f ( ̂ T −1 ˆ t ) ∂ ˆ t ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t ( ˆ t − t ) (10.25) = B + ∂ ̂ B ∂ ̂ T −1 ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t ( ̂ T −1 − T −1 ) <?page no="169"?> 170 10 Regression + ∂ ̂ B ∂ ˆ t ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t ( ˆ t − t ) . Since we consider q explanatory variables, the term ̂ T −1 contains q ( q − 1) / 2 total estimators ˆ t jj ′ ,π = ∑ s x kj x kj ′ π k (10.26) and ˆ t contains q total estimators ˆ t j0,π = ∑ s x kj y k π k . (10.27) We denote the evaluated derivatives of the estimator with respect to the individual total estimators contained in ̂ T −1 and in ˆ t respectively as a jj ′ = ∂ ̂ B ∂ ˆ t jj ′ ,π ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t (10.28) and a j0 = ∂ ̂ B ∂ ˆ t j0,π ∣ ∣ ∣ ∣ ∣ ̂ T =T,ˆ t=t . (10.29) Inserting these expressions into the linear approximation results in ̂ B ≈ ̂ B 0 = B + q ∑ j=1 ∑ j ′ ≤j a jj ′ ( ˆ t jj ′ ,π − t jj ′ ) + q ∑ j=1 a j0 ( ˆ t j0,π − t j0 ) . (10.30) The derivatives with respect to total estimators, we obtain as ∂ ̂ B ∂ ˆ t jj ′ ,π = ∂ ( ̂ T −1 ˆ t ) ∂ ˆ t jj ′ ,π = ( ∂ ̂ T −1 ∂ ˆ t jj ′ ,π ) ˆ t (10.31a) = ( − ̂ T −1 Λ jj ′ ̂ T −1 ) ˆ t = − ̂ T −1 Λ jj ′ ̂ B (10.31b) and as ∂ ̂ B ∂ ˆ t j0,π = ∂ ( ̂ T −1 ˆ t ) ∂ ˆ t j0,π = ̂ T −1 ( ∂ ˆ t ∂ ˆ t j0,π ) = ̂ T −1 λ jj ′ . (10.31c) <?page no="170"?> 10.2 Variance of the parameter estimators 171 The evaluation of the derivatives results in replacing the estimator ̂ B with true parameter vector B and the estimator ̂ T with true T in the derivatives ̂ B 0 = B + q ∑ j=1 ∑ j ′ ≤j −T −1 Λ jj ′ B ( ˆ t jj ′ ,π − t jj ′ ) + q ∑ j=1 T −1 λ jj ′ ( ˆ t j0,π − t j0 ) = B − T −1 ( ̂ T − T ) B + T −1 ( ˆ t − t ) = B − ( T −1 ̂ T + T −1 T ) B + T −1 ˆ t − T −1 t = B − T −1 ̂ T B + B + T −1 ˆ t − B = B + T −1 ( ˆ t − ̂ T B ) . (10.32) Note that the first order approximation is now a linear function of random variates. 10.2.2 The variance of the linear approximation of the π-estimator Derivation of the variance expression The linear approximation of the parameter estimation function is ̂ B ≈ ̂ B 0 = B + T −1 ( ˆ t − ̂ T B ) . (10.33) Since B and T are fixed, we have a closer look at ˆ t − ̂ T B ˆ t − ̂ T B = ∑ s x kj y k π k − ∑ s x kj x ′ kj ′ π k B = ∑ s 1 π k x k ( y k − x ′ k B ) = ∑ s 1 π k x k E k (10.34) with population residuals E k = y k − x ′ k B. (10.35) Hence, ˆ t − ̂ T B is a vector of length q containing the π -estimators ∑ s 1 π k x k E k . (10.36) <?page no="171"?> 172 10 Regression The variance of the linear approximation ̂ B 0 is used as an approximate variance of the non-linear estimator ̂ B AV( ̂ B ) = V( ̂ B 0 ) = V[ B + T −1 ( ˆ t − ̂ T B )] = V[ T −1 ( ˆ t − ̂ T B )] = T −1 V[( ˆ t − ̂ T B )] T −1 = T −1 Σ T −1 . (10.37) Σ is a symmetric matrix of dimension q × q Σ = V[ ˆ t − ̂ T B ] = V [ ∑ s 1 π k x k E k ] with elements σ jj ′ = Cov ( ∑ s 1 π k x k E k , ∑ s 1 π k x k E k ) = ∑∑ U ∆ kl ( x kj E k π k )( x j ′ l E l π l ) and Cov( I k , I l ) = π kl − π k π l = ∆ kl . The estimation of the variance based on a sample uses π -estimators for unknown population totals, residuals, and covariances ̂ V( ̂ B 0 ) = ̂ T −1 ̂ Σ ̂ T −1 = ( ∑ s x k x ′ k π k ) −1 ̂ Σ ( ∑ s x k x ′ k π k ) −1 (10.38) with elements of ̂ Σ ˆ σ jj ′ = ∑∑ s ˇ ∆ kl ( x kj e k π k )( x j ′ l e l π l ) and estimated residuals e k = y k − x ′ k ̂ B, for k ∈ s and expanded covariances ˇ ∆ kl = ∆ kl π kl = π kl − π k π l π kl . <?page no="172"?> 10.2 Variance of the parameter estimators 173 A small numerical example with R We continue our small numerical example. The population has the size N = 5, the sample size is n = 3, the matrix of inclusion indicators is denoted by D , the vector of inclusion probabilities by pik and the vector of sample probabilities by p . To obtain the second order inclusion probabilities π kl , we have to consider for all M samples the N × N matrix of pairs of inclusion probabilities, remembering that I k I l = 1 only if I k = 1 & I l = 1. To obtain the array of multiplied pairs of inclusion indicators, we use the command outer () using the shortcut \%o\%: # second order inclusion indicators D2 <array (NA,dim= c (N,N,M)) for(i in 1: M) D2"i] <- D[,i]%o%D[,i] # Applying sample prob. p to all samples and summing up pa <array ( rep (p,each=N*N), c (N,N,M)) pikl <apply (D2*pa, c (1,2),sum) round (pikl,2) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.38 0.11 0.18 0.22 0.25 ## [2,] 0.11 0.55 0.29 0.33 0.36 ## [3,] 0.18 0.29 0.64 0.38 0.42 ## [4,] 0.22 0.33 0.38 0.69 0.45 ## [5,] 0.25 0.36 0.42 0.45 0.75 The covariances ∆ kl are defined as π kl − π k π l : D.kl <matrix (NA,N,N) for (i in 1: N){ for (j in 1: N){ D.kl[i,j] <pikl[i,j]-pik[i]*pik[j] }} round (D.kl,2) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.24 -0.10 -0.06 -0.05 -0.03 ## [2,] -0.10 0.25 -0.06 -0.05 -0.04 ## [3,] -0.06 -0.06 0.23 -0.06 -0.06 ## [4,] -0.05 -0.05 -0.06 0.21 -0.06 ## [5,] -0.03 -0.04 -0.06 -0.06 0.19 We have already obtained the π -estimator of B for the M samples. <?page no="173"?> 174 10 Regression round (hB3,2) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] 38 1.6 4.91 -3.18 -0.89 -10 -0.5 -0.5 -0.5 -0.5 ## [2,] -10 1.2 0.18 1.53 0.89 2 0.5 0.5 0.5 0.5 ## [3,] -3 -0.2 -0.45 0.29 0.11 1 0.5 0.5 0.5 0.5 B3 <solve ( t (X)%*%X)%*% t (X)%*%y; round ( t (B3),3) ## x x2 ## [1,] -0.905 0.954 0.193 Now we calculate the variance of the π -estimator using the vector E of population regression residuals. E <y-X%*%B3; round ( t (E),3) ## [,1] [,2] [,3] [,4] [,5] ## [1,] -0.786 1.032 -0.344 0.737 -0.639 q <ncol (X); q ## [1] 3 We first calculate matrix V which is V = V( ˆ t − ̂ T B ) V <matrix (NA,q,q) for (i in 1: q){ for (j in 1: q){ XEpi.i <as.vector (X[,i]*E/ pik) XEpi.j <as.vector (X[,j]*E/ pik) V[i,j] <sum (D.kl*(XEpi.i%o%XEpi.j)) }} round (V,2) ## [,1] [,2] [,3] ## [1,] 3.25 6.41 21.98 ## [2,] 6.41 16.64 39.68 ## [3,] 21.98 39.68 159.23 and finally the covariance matrix of the linear approximated estimation function, which is an approximate variance of the non-linear estimation function ̂ B Ti <solve ( t (X)%*%X) AV.B <- Ti%*%V%*%Ti round (AV.B,3) ## x x2 ## 1.403 -0.266 -0.073 ## x -0.266 0.067 0.002 ## x2 -0.073 0.002 0.015 <?page no="174"?> 10.2 Variance of the parameter estimators 175 We compare the variances obtained for the linear approximation of the π -estimator with the true variances obtained through complete enumeration. V.B <apply (hB3,1, function(z,a) sum ((zsum (z*a))^2*a),p) round ( rbind (AV.B= diag (AV.B),V.B),3) ## x x2 ## AV.B 1.403 0.067 0.015 ## V.B 39.048 2.356 0.316 Note that in this very small numerical example, which serves to exemplify the mechanics of the calculation of the approximate variance, AV ( ̂ B ) differs strongly from the true variance of the estimator V( ̂ B ). 10.2.3 Simple regression through the origin The population parameter B is defined as B = ∑ U x k y k ∑ U x 2 k (10.39) and estimated using the π -estimator ̂ B = ∑ s x k y k / π k ∑ s x 2 k / π k . (10.40) The approximate variance of ̂ B is AV( ̂ B ) = [ ∑ U ( x 2 k ) ] −2 ∑∑ U ∆ kl x k E k π k x l E l π l with E k = y k − Bx k . (10.41) The estimate based on a sample is ̂ V( ̂ B ) = [ ∑ s x 2 k π k ] −2 ∑∑ s ˇ ∆ kl x k e k π k x l e l π l with e k = y k − ̂ Bx k . (10.42) <?page no="175"?> 176 10 Regression 10.2.4 Simple regression with intercept The population parameters B 1 and B 2 are defined as B 2 = ∑ U ( x k − ¯ x U ) ( y k − ¯ y U ) ∑ U ( x k − ¯ x U ) 2 and B 1 = ¯ y U − B 2 ¯ x U . (10.43) The π -estimators based on a sample s are defined as ̂ B 2 = [ ∑ s ( x k − ˜ x s ) 2 π k ] −1 ∑ s ( x k − ˜ x s ) ( y k − ˜ y s ) π k , ̂ B 1 = ˜ y s − ̂ B 2 ˜ x s with ˜ y s = ˆ t y,π ˆ N and ˜ x s = ˆ t x,π ˆ N . (10.44) Note that in the case of known population size N , we would insert the π -estimator of the mean ˆ ¯ y = ˆ t x,π N (10.45) for ˜ y and likewise ˆ ¯ x for ˜ x . The approximate variance of ̂ B 2 is AV( ̂ B 2 ) = [ ∑ U ( x k − ¯ x U ) 2 ] −2 × ∑∑ U ∆ kl ( x k − ¯ x U ) E k π k ( x l − ¯ x U ) E l π l with E k = y k − B 1 − B 2 x k = y k − ¯ y U − B 2 ( x k − ¯ x U ) . (10.46) The estimator of the variance based on a sample is ̂ V( ̂ B ) = [ ∑ s ( x k − ˜ x s ) π k 2 ] −2 × ∑∑ s ˇ ∆ kl ( x k − ˜ x s ) e k π k ( x l − ˜ x s ) e l π l with e k = y k − ̂ B 1 − ̂ B 2 x k = y k − ˜ y s − ̂ B 2 ( x k − ˜ x s ) . (10.47) <?page no="176"?> 10.2 Variance of the parameter estimators 177 10.2.5 Simple regression with intercept and simple random sampling We have considered the general expressions for any design using individual inclusion probabilities π k . In the case of simple random sampling (SI-Design), the formulas simplify considerably. In simple random sampling all inclusion probabilities are identical and given as π k = n N for all k ∈ U . The estimators simplify towards ̂ B 2 = ∑ s ( x k − ¯ x s ) ( y k − ¯ y s ) ∑ s ( x k − ¯ x s ) 2 and ̂ B 1 = ¯ y s − ̂ B 2 ¯ x s . (10.48) The variance of the slope parameter is approximated as AV SI ( ̂ B 2 ) = N 2 1−f n 1 N−1 ∑ U ( x k − ¯ x U ) 2 E 2 k [∑ U ( x k − ¯ x U ) 2 ] 2 (10.49) and estimated based on a sample as ̂ V SI ( ̂ B 2 ) = (1 − f ) n n−1 ∑ s ( x k − ¯ x s ) 2 e 2 k [∑ s ( x k − ¯ x s ) 2 ] 2 . (10.50) Note that even in the case of simple random sampling, the variance estimator from the stochastic regression model ̂ V( ˆ β 2 ) = 1 n−2 ∑ s e 2 k ∑ s ( x k − ¯ x s ) 2 (10.51) differs considerably from the variance estimator in the design regression. 10.2.6 Wage regression with PSID data and simple random sampling We consider a simple linear regression for annual wages being a linear function of years of education. The population regression function is obtained using the function for linear models lm (). d <read.csv2 ("psid_2007_sp.csv") Y <d$wage X <d$eduyears reg <lm (Y~X) B2 <reg$coef[2] sreg <summary (reg) <?page no="177"?> 178 10 Regression sreg ## ## Call: ## lm(formula = Y ~ X) ## ## Residuals: ## Min 1Q Median 3Q Max ## -73961 -22745 -8826 8882 974344 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -55371 10719 -5.17 0.00000029 *** ## X 7707 816 9.45 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 57100 on 998 degrees of freedom ## Multiple R-squared: 0.0821,Adjusted R-squared: 0.0812 ## F-statistic: 89.3 on 1 and 998 DF, p-value: <2e-16 We now generate B = 100 , 000 samples of size n = 100 and obtain for each sample the estimator ̂ B 2 as well as the variance estimator ̂ V( ̂ B 2 ). N <nrow (d) n <- 100 f <n/ N U <- 1: N B <- 100000 Bv <rep (NA,B) Vv <- Bv set.seed (123) for(i in 1: B){ ind <sample (U,n) x <- X[ind] y <- Y[ind] mx <mean (x) my <mean (y) xs <x-mx xs2 <xs^2 sxs2 <sum (xs^2) B2i <sum (xs*y)/ sxs2 B1i <my-B2i*mx e <y-B1i-B2i*x Vv[i] <- (1-f)*n/ (n-1)* sum (xs2*e^2)/ sxs2^2 Bv[i] <- B2i } <?page no="178"?> 10.2 Variance of the parameter estimators 179 Xs <- Xmean (X) AV.B2 <- N^2*(1-f)/ n*1/ (N-1)* sum (Xs^2*reg$resid^2)/ ( sum (Xs^2))^2 # approximate variance AV.B2 ## [1] 12778495 # variance of B=100000 parameter estimates V.B2 <var (Bv); V.B2 ## [1] 13156436 # mean of variance estimator mean (Vv) ## [1] 12673324 We display the result graphically in figure 10.1. # figure a plot ( density (Bv),lwd=2,xlab= expression ( hat (B)[2]), main="") arrows (B2,3e-05,B2,0,length=0.1,angle=25) text (B2,3.5e-05, expression (B[2])) # figure b plot ( density (Vv),lwd=2,main="", xlab= expression ( paste ( hat (V),"(", hat (B)[2],")", paste=""))) arrows (V.B2,4e-08,V.B2,0,length=0.1,angle=25) text (V.B2+500000,4.5e-08, expression ( paste ("V(", hat (B)[2],")",paste=""))) <?page no="179"?> 180 10 Regression 0 10000 20000 30000 ̂ B 2 Density ·10 −4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 B 2 (a) Approximated distribution of the estimation function for slope parameter 0 50000000 150000000 ̂ V( ̂ B 2 ) Density ·10 −7 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 V( ̂ B 2 ) (b) Approximated distribution of the estimation function for variance of slope parameter Figure 10.1: Approximated distributions for n = 100. 10.3 Exercises 1. Consider a population with two variables X and Y : x : 6 8 9 10 17 y : 4 8 7 10 11 a) Calculate t y , t x , s 2 x , s 2 y , Cov( y, x ), Cor( y, x ). b) Calculate the population parameters B 1 and B 2 . c) Calculate the population residuals E k . 2. Consider the following sampling design: s 1 : {u 1 , u 2 , u 3 } p ( s 1 ) = 0 . 5 s 2 : {u 1 , u 2 , u 5 } p ( s 2 ) = 0 . 3 s 3 : {u 3 , u 4 , u 5 } p ( s 3 ) = 0 . 2 a) Obtain the first order inclusion probabilities π k and the second order inclusion probabilities π k,l . b) Obtain the covariances ∆ kl . c) Obtain for sample s 1 the π -estimators ̂ B 1 and ̂ B 2 . d) Obtain for sample s 1 the estimated residuals e k . e) Obtain the variance estimator ̂ V ( ̂ B 2 ) based on sample s 1 . <?page no="180"?> 10.3 Exercises 181 3. Now consider the case of simple random sampling with sample size n = 3. a) Obtain the number of possible samples |S| = M . b) Obtain the inclusion probabilities π k and π kl . c) Obtain the covariances ∆ kl . d) Consider the specific sample s = {u 1 , u 2 , u 3 } and obtain for the SI-design the π -estimators ̂ B 1 and ̂ B 2 . e) Estimate the residuals e k for the specific sample. f) Obtain the variance estimator ̂ V ( ̂ B 2 ) for the specific sample. 4. Use the PSID data and consider the variables wage ( Y ) and age ( X ). a) Obtain the population regression parameters B 1 and B 2 for the linear regression of wage on age. b) Draw B = 1000 samples of size n = { 20 , 50 , 100 , 500 } and plot the approximate distribution of the slope estimator ̂ B 2 for the different sample sizes. <?page no="182"?> Indices Functions Index [], 26 % o %, 173 anova, 106 apply, 28, 47, 65, 155, 173, 175 array, 173 arrows, 69 as.factor, 106 as.matrix, 30 as.numeric, 47, 65 as.vector, 174 barplot, 106 cbind, 46 choose, 36, 48, 64, 131 colMeans, 37, 155 colSums, 47 combn, 36, 64 cor, 69 data.frame, 26 density, 127, 179 dev.copy, 34 diag, 31 dnorm, 33, 91 for, 28, 48, 66, 82, 115, 140, 147, 155, 173 hist, 33 if, 28 ifelse, 28 is.element, 47, 65 lapply, 28, 110, 115 length, 64 lm, 34 matrix, 28, 66 mean, 26 names, 34 nrow, 81 order, 127 paste, 28 plot, 33 pnorm, 33 qnorm, 33 quantile, 69, 90 rbind, 175 read.csv, 31, 81 read.table, 31 regression simple random sampling, 177 rep, 27, 48 rnorm, 33 rowSums, 65 sample, 37, 68, 82, 110, 115, 131, 140, 147, 155 seq, 27 set.seed, 33 183 <?page no="183"?> 184 10 Indices solve, 31, 174 summary, 35 t, 31 tapply, 28, 105, 127 text, 69 title, 69 unlist, 110 vcov, 35 write.csv, 31 write.table, 31 <?page no="184"?> Subject Index R 2 , 35 π -estimator, 59, 76 distribution, 90 expected value, 59 mean, 76, 79, 90 properties, 59 proportion, 77, 80 regression, 163 SI, 78 total, 76, 78 variance, 59, 77 variance estimation, 99 allocation optimal, 118 proportional, 118 analysis of variance, 106 anova, 106 approximate normality, 97 auxilary variable, 138 average wage, 81 between variance, 107 bias, 59 binomial, 47 binomial distribution, 32 bootstrap, 100 Cauchy-Schwarz inequality, 112 Chebyshev interval, 94 Chebyshev-inequality, 92 application, 94 cluster sampling, 21, 126 variance, 99 variance estimator, 130 with simple random sampling, 131 clustered sample, 127 clustered sampling homogeneity, 133 variance increasing effect, 134 coefficient of determination, 35 combinations without replacement, 44, 47 condition, 27 confidence interval, 94 covariance, 51 inclusion indicator, 51 properties, 52 simple random sampling, 52 covariance matrix, 53 dataframe, 26 decomposition of variance, 106 density estimation, 127 design, 43 design approach, 16 regression, 162 diagonal matrix, 31 dichotomous, 42 difference estimator, 139, 152 distribution, 156 properties, 153 PSID, 154 distribution quantile, 32 drawing random sampling, 37 efficiency comparison, 120 sample allocation, 116 185 <?page no="185"?> 186 10 Indices efficiency comparison, 120 estimating function, 58, 76 properties, 58 estimator, 58 expansion, 142 for-loop, 28 graphics, 33 export, 34 help-function, 26 heterogeneity, 151 histogram, 33 homogeneity, 133 homogeneous regression, 162 Horvitz-Thompson estimator, 60, 70 inclusion indicator, 43 covariance, 51 expected value, 46, 51 properties, 50 variance, 51 inclusion probabilities unequal, 68 with clusters, 128 inclusion probability, 45 second order, 45 simple random sampling, 50 inference, 19 inverse, 31 Laplace, 49 linear regression, 34, 162 linear regression model, 151 loop, 27 matrix inverse, 31 matrix multiplication, 30 mean stratified sampling, 105 mean estimator distribution with clusters, 132 mean square error, 59 mse, 59 multiplication, 30 multivariate regression, 167 numerical example, 168 non-normality, 97 normal distribution, 32 normality, 97 object orientation, 26 one-draw probabilities, 72 optimal allocation sectors, 115 optimal stratification, 111 poisson distribution, 32 population, 42 population regression, 162, 177 PPS, 70 probabilities proportional to size, 70 probability distribution, 32 probability sampling, 19 proportion, 77 proportional allocation, 118 PSID, 68, 81, 90, 105, 127, 131, 138 clustered sampling, 131, 133 difference estimator, 154 ratio estimator, 140 wage regression, 177 R-functions, 26 <?page no="186"?> Subject Index 187 random generator, 16 random number, 32 random number generator initialisation, 33 random sample, 18 ratio estimator, 139 and regression, 149, 150 approximate variance, 145 of the total, 139 variance estimation, 146 regression, 34 π -estimator, 163 approximate variance, 169 approximative variance, 171 homogeneous, 162, 166 linear, 162 multivariate, 167 numerical example, 163, 173 stochastic, 177 variance, 169 variance estimation, 178 with intercept, 165 without intercept, 162 regression estimator, 139 regression function, 150 regression parameters variance, 169 replication, 27 Sampford-algorithm, 72 sample clustered, 127 sample allocation, 111 efficiency, 116 sample size for strata, 110 optimal allocation, 115 proportional allocation, 118 sample space, 43 sampling cluster, 21 simple random, 20 stratified, 20, 104 sampling clusters, 127 sampling design, 43 sampling designs, 20 sampling distribution, 18 sampling fraction, 52 sampling frame, 19 sampling with R, 37 scatterplot, 33 sectors as strata, 110 Sen, 62 sequence, 27 simple random sampling, 20, 44, 47, 77, 98 inclusion probability, 50 of clusters, 131 with strata, 108 stochastic models, 16 stochastic regression, 177 strata, 104 sectors, 105, 110 stratification effect of, 109 stratified sampling, 20, 104 optimal allocation, 111 sample size, 110 simple random sampling, 108 variance, 99 subset, 42 survey characteristics, 17 Taylor series, 142, 149 <?page no="187"?> 188 10 Indices total, 42, 76 estimation with clusters, 129 estimation with strata, 108 stratified sampling, 105 total estimation with clusters, 131 transpose, 31 uniform distribution, 32 variance, 42, 59 comparison, 117 estimation, 84 for clustered sampling, 130, 132 inclusion indicator, 51 linear approximation, 146 of ratio estimator, 145 regression parameters, 169 simplified estimator, 97 Yates-Grundy, 62 variance decomposition, 106, 117 variance estimation simplified, 99 with strata, 108 variance-covariance matrix, 35 wage, 81 within variance, 107 Yates-Grundy, 62, 99 <?page no="189"?> www.uvk-lucius.de Grundlagen mit Aufgaben Sinkende Geburtenrate, steigende Lebenserwartung und eine wachsende Zuwanderung: All das hat großen Einfluss auf die Wirtschaftsstatistik. Dieses Lehrbuch behandelt daher systematisch bevölkerungsstatistische Methoden, um anschließend die Wirtschaftsstatistik zu vermitteln. Der Stoff wird durch statistisches Datenmaterial zu Deutschland und durch zahlreiche Beispiele illustriert. Am Ende jeden Kapitels finden sich Aufgaben, die das Verständnis vertiefen. Themen des Buches sind unter anderem • demographische Prozesse und Projektionen, • Lebensdauern und Sterbetafeln, • Geburtenstatistiken, • Marktpreise und Preisstatistiken, • Input-Output-Analysen, • die Volkswirtschaftliche Gesamtrechnung. Das Lehrbuch richtet sich an Studierende der Wirtschafts- und der Sozialwissenschaften im Bachelor- und Masterstudium. Andreas Behr, Götz Rohwer Wirtschafts- und Bevölkerungsstatistik 1. Auflage 2012, 348 Seiten, flexibler Einband ISBN 978-3-8252-3679-3 <?page no="190"?> Von Schmalenbach bis zur verhaltenstheoretischen BWL www.uvk-lucius.de Günther Schanz Eine kurze Geschichte der Betriebswirtschaftslehre 1. Auflage 2014, 150 Seiten ISBN 978-3-8252-4106-3 Bereits in der Antike, im Mittelalter und in der Renaissance beschäftigten sich Gelehrte mit ökonomischen Fragestellungen. Die akademische Betriebswirtschaftslehre ist dennoch eine junge Disziplin, die erst im 20. Jahrhundert aufblühte. Ihre Geschichte wird vom Verfasser anhand der Wissenschaftsprogramme von Eugen Schmalenbach, Wilhelm Rieger, Heinrich Nicklisch, Erich Gutenberg, Edmund Heinen und Hans Ulrich kritisch nachgezeichnet. Dargestellt werden des Weiteren • die arbeitsorientierte Einzelwirtschaftslehre, • die ökologische Öffnung der Disziplin, • der Neue Institutionalismus und • die verhaltenstheoretische BWL. Das Buch richtet sich an Studierende der Betriebswirtschaftslehre, die die Geschichte der BWL verstehen und deren Entwicklungslinien nachvollziehen möchten. Zudem ist es auch für Doktoranden, Habilitanden und Professoren ein unverzichtbarer Lesestoff.