Methodology - Population, Housing, and Income Estimates
First a quick overview:
To build population estimates one needs several data sets. The population changes that occur in an area will be the addition of births, subtraction of deaths and the addition/subtraction of those who move. The starting point is the 2010 Redistricting BLOCK level data set. This has the most detailed and comprehensive numbers about where the entire population of the US lives and their race. Unfortunately the Redistricting data set does not have age breakouts so for that we turned to our existing breakouts from the US Census Bureau´s county level estimates, as this was the basis for all of our previously released estimate data sets
The 2009 American Community Survey (ACS) is not in fact true 2009 data but rather a summary of surveys administered over 5 years agglomerated into a single value when you run it at the Block Group level (2005-2009. So we cannot really consider these numbers to be indicative of 2009 numbers but rather are more like the midpoint 2007 findings. We can obtain only Block Group level age breakouts so we use those to weight several variables that are not available in the Redistricting, such as Income, Educational Attainment, etc.This data also has to be converted to the new 2010 boundaries since it is in the 2000 boundaries.
To progress from the 2010 data to current year estimates, we use the US Census Bureau´s (USCB) County and State level annual estimates to roll the numbers forward to the current year. But the USCB data is only available at the County and State level, so the next challenge is distributing the data down to the smaller geographies. To do this we utilize actuarial tables for births and deaths by age and race, and use them to create a model of "likelihood" of dying or likelihood of having a child. This then is what creates the engine driving the increase and decrease in population growth.
The third step is to look at immigration and emigration. Where are people moving "to" and where are they moving "from". The US Postal Service keeps track of all moves as a "to" and "from" location.
Now the more detailed explanation:
1. Working with the Census Bureau "estimation base" county level numbers.
This data is processed to obtain "race distribution" coefficients. However, the Census Bureau estimation base data do not include "other" race category. Also, "two or more races" category is much smaller than it is in Redistricting Census data. By comparing the estimation base to Redistricting county level data, it is possible to obtain some numeric ratios as to how "other race" and "two or more races" populations were distributed among the remaining races in the USCB´s estimation base. These coefficients allow us to re-map the block level data and redistribute the "other race" and part of the "two or more races" population among the 6 remaining mutually exclusive races.
2. The Redistricting block level data are processed with these new racial distribution coefficients. The resulting dataset is our estimation base. It includes 8 race/origin groups:
Native American alone
Two or more races
White, not Hispanic
We do not have the "Other Race" category in the estimates even though Census 2010 does, because the USCB dropped the "Other Race" data from its estimates. They switched to 8 races in 2001 and we had to follow. It is worth mentioning that the USCB redistributed the racial counts of Other Race completely and the counts for "2 or more Races" were partially redistributed between the rest of the races in their estimates. We did the same and therefore the racial breakdown differs from the Census 2010 but fits the 2001-9 USCB estimates. We believe that the USCB made these changes because there are no actuarial tables for "other" or "2 or more" races so they needed to redistribute those people into one of the race categories by which they could create estimates
3. Having dealt with Race we then turn to Age. The USCB groups the population into 18 age groups. These range from age 0 (under 1) to age 108. The age groups are each 5 year intervals (0-4, 5-9, etc) except the ages 85 and up (85-108) are treated as a single group.
4. Now that we have the entire population broken down into age and race categories we begin building the death-birth model. With the use of Actuarial tables we calculate the statistical likelihood for any given age/race group to die or to give birth. We then apply these coefficients to the 2010 data to create an estimation base for 2011, the coefficients are reapplied to create 2012, and so on until we get to the current year.
The model includes:
- transformation of age group distribution to "exact age" distribution. The resulting data set has population groups for each single year of age from 0 to 108.
- application of death probabilities for a specific age, sex and race group.
- application of birth rates for a specific age, sex and race group. The white population is treated as a mix of white not Hispanic and Hispanic population. The mix ratio is determined from the block data.
- 1 year shift.
- collecting the annual data into 5-year buckets.
- comparison of the results with Census Bureau estimates for this year.
- the results of comparison are used to tweak birth rates and death probabilities to make the numbers of both newborn and deceased in the model to be exactly equal to Census Bureau numbers for each county. The racial distribution is also tweaked to reflect that of Census Bureau data. It puts the annual estimates in sync with USCB data as much as possible.
5. The same model is applied to the results for the projections. This time, however, the "tweaking coefficients" are predicted (as we do not have any materials for comparison). The prediction algorithm is based on a linear regression approach (they actually fit the linear plot very nicely),
Methodology - Household Estimates
The household estimates were calculated from:
- the ACS Census data with household counts at the Block Group level
- the estimated data on the households
- the Census data on the age-race-sex
- the estimated data on the age-race-sex.
GeoLytics calculates the ratios of Census household variables to Census age-race-sex data and Census housing data and then used these ratios for estimated data of the same nature to get the estimated values. The underlying assumption being that the average family size by race will not have changed dramatically in the years since the 2009 ACS Census was compiled.
Methodology - Housing Estimates
The only way that the number of housing units (HU) changes is if new buildings are built or old ones torn down. Some houses can be built on empty lots, but if a lot of houses are built usually a whole new development gets put in. So the first thing that we do is to look at the TIGER/Line files. This is the USCB file that shows each and every street in the US and has the numbers of each housing unit. By looking at this dataset we can determine if new streets have been put in and by looking at the numbering we can determine about how many units are being built. We can also see if new numbers have been added to an existing street.
1. The TIGER/Lines records for the years 2009 and 2010 were analyzed. For each block, the sum of associated address ranges was calculated. As a result, each block was assigned a Change Coefficient (CC), a number representing the changes in the aggregate number of addresses within this block. The number is a fraction between -1 and +1. The number 0 represents a block that has not been changed within this time interval. The number +1 represents a block that did not have any addresses in 2009 and has some in 2010, and the number -1 is a block with no addresses in 2010 and had some addresses in 2009. The block changes were later summarized to BG level.
2. The Census Bureau Housing Units Estimates (at the county) for the years 2009 to 2010 were used to assess the number of HU per county for the year 2011 via a linear regression algorithm.
3. For each county, the Census Bureau HU growth/decline was distributed among BGs of this county so that:
- BGs with CC = 0 did not change any HU counts
- BGs with CC not equal to 0 received some parts of the county growth on proportional basis so that BGs with CC > 0 received some HUs and BGs with CC < 0 lose some HUs. The results vary from small changes (mostly, a few percent is a typical change) to some pretty dramatic changes of 3-5 times (rarely). These obviously are where large housing complexes went in and dramatically changed the number of housing units in the block group.
Once we had the change in the number of Housing Units we can then look at the other housing variables such as of number of rooms, vacancy status, tenure (own vs. rent) status, etc. People all live in either a household or a group quarter (military barracks, college dorms, nursing homes, prisons, mental institutions, half-way homes, etc). The group quarters were left stable so the changes in population were then accounted for in the changes in Housing Units that had now been calculated. So for example, if the housing units stayed the same but the population numbers dropped than the vacancy status would go up.
The sum of all changes for all BGs in a county is equal to the Census Bureau HU county growth estimates.
Methodology - Income Estimates
When calculating Income Estimates there are several components. First we needed to calculate the changes in income from 2000 to 2009 so that we would have a basis for estimating forward. We also needed to account for the age changes (everyone has aged since April 2000 so all of the age categories needed to shift up).
1. The first step was to create an Income Growth by Race number for each Block Group. Luckily, the 2009 ACS data is actually in the 2000 block group boundaries so we were able to compare data over the exact same geographies. Unfortunately though to roll it forward to the 2011 estimates we then have to normalize the data to the new 2010 geographic boundaries - unfortunately the 2010 Redistricting does not include any income variables otherwise we would have used this data set and not had to normalize the findings to comply with the new 2010 boundaries.
2. The BG-level racial growth data were applied to 2009 ACS Census data to obtain 2009 racial income growth coefficients for each BG area. First, the growth data for 2000-2009 were processed using a compound interest model. Second, the calculated "interest rates" were applied to 2000 racial income data to get the 2009 growth data.
The Income Growth data by Race were not available for many BG for some races because if there are very few households of a given race in a block group than numbers were suppressed by the USCB. For these cases, we used the USCB Median Income Estimates for years 2000-2009 to get 2011 state median income growth data using a linear regression algorithm, and then used these state growth data for Block Groups and races.
3. The racial aggregate income data were processed in the same manner as racial median income data.
4. The Householder age distributions were estimated by using estimated Householder totals from our dataset and an age shift model. Namely, for each age group, a calculated number of householders was moved to the next age group. The first and last age groups were processed in a special way to take into account both new and dead householders. The sum of all householder age brackets is equal to our estimated HH total.
5. The area income range data were estimated using a distribution shift model. First, we assumed that the Census 2000 income brackets represent the "best fit curve" frequency distribution, and then applied a linear stretch transformation to the income scale. Finally, we calculated the new income bracket values produced by this linear stretching of the frequency distribution. The stretch coefficient was equal to the median income growth ratio for this area. This means that the income increase moves some households from its income bracket in 2009 to the next income bracket in 2011. The number of such households can be estimated mathematically if we know the exact number of households for each income value. This exact number can be estimated using the "best fit curve" model.
6. Finally, the BG data (both medians and aggregates) were tuned so that summary state median values were exactly equal to the state median data for 2009, as estimated from Census Bureau´s ACS data set. This was done by using a two-section linear mapping scheme. The scheme
- moves the actual state median so it becomes equal to the target value;
- leaves state minimum and maximum median values for state BGs intact;
- is a*x + b - linear a) between state minimum median value for all state BGs and state median, and b) between state median and state maximum median value for all state BGs (with different a and b within these two segments).
Methodology - Income Disparity
For each area, we calculate 2 counts:
- FamLowInc = sum of family counts from FMUNDER10K to EFamIncEst.FM35_40K inclusive
- FamHighInc = sum of family counts from FM150_200K to FM200KPLUS inclusive.
Family disparity is 100 * log (FamLoInc / FamHighInc)
or (in equivalent form)
100 * (log (FamLoInc) - log (FamHighInc))
So it is essentially the 100 times the log of those who are poor (family income under $40K) divided by those who are really rich (family income over $150K).