26 November 2008

R-values 1, gut instinct 0

I wrote the following paper in 2007 while studying econometrics with Professor Dick Startz. Statistical methods can be powerful tools, notwithstanding the old line about lies, damn lies, and statistics. The analysis below is just the tip of the iceberg in how data can be used in politics.

Improving Voter Targeting by Political Campaigns

Executive Summary

Political campaigns use a variety of methods to target the voters they want to contact. During the 2006 election cycle I worked with the Darcy Burner campaign in a support role, and saw how they used various types of data to choose which voters to pursue. It was striking to me that statistical analysis and methodological sampling were not part of the plan. The campaign relied instead on a large army of volunteers to contact as many voters as possible within a fairly wide range of potential contacts.

I have taken a look in hindsight at how well the prediction variables performed. I investigated two primary areas, the likelihood of a voter to return a ballot, and the likelihood that a voter would be a supporter. I regressed ballot returns on past vote history, and found a good correlation. The campaign correctly used past vote history as a predictor, although my model improves the predictive value by assigning weights to each past election. Three variables the campaign relied on to determine support included age, sex, and “ST score”. I found that there was a statistically significant correlation between each of these variables and the average support level for the candidate, which is what the campaign would hope for. However, the overall amount of variability in support explained by these three variables is quite low, with an R Squared of only 0.018. This leaves enormous room for improvement in the future by campaigns who are willing and able to apply statistical methods to the data they are collecting.

Introduction

Identification of likely supporters is a critical component of running a successful political campaign. The identification effort runs for months or even years in advance of an election, and culminates in a massive effort to encourage supporters to get to the polls and vote on the day of the election. In political lingo, this is known as GOTV (get out the vote.) It is often the effectiveness of the GOTV effort that determines the victor in a close race. During the 2006 election cycle, one of the closest and most contested races in the nation was here in Washington in the 8th Congressional District, pitting incumbent Dave Reichert ( R) against Darcy Burner (D). The final result was a win for Reichert, who garnered 51.5% of the vote. These two candidates are slated for a rematch in the 2008 election, and the race is widely expected to be extremely close once again.

Methodology and Definitions:

Data transformation was performed on the data, assigning numeric values to text fields for easier analysis. Each voter in the database has several pieces of corresponding data. The simplest includes basic demographic information such as age and sex. Each voter also has a complete vote history for the past 6 years, for both primary and general elections. All vote history data is entered as a 1 or 0, where 1 indicates the voter turned in a ballot and 0 is no ballot returned. The “ST score” is a score from 0 to 100 which was provided by an outside vendor, based upon a variety of proprietary demographic information. The “Pre 2006 Grade” is a ranking from 1 to 5 which tracks the partisan affiliation of a voter based on voter contact prior to the 2006 election. These grades range from 1 to 5, with a 1 indicating a partisan Republican, a 5 being a partisan Democrat, 3 is Independent, and 2 or 4 indicate someone leaning toward a party. “Cantwell ID” is a ranking from 1 to 5, with 1 indicating a strong opponent and 5 a strong supporter of Senator Maria Cantwell. Sex was converted to a numeric data field also, setting female=0 and male=1.
All regressions were run on the same random sample of 45,097 voters from the 8th Congressional District. This reduces by 5,000 voters the number sampled in my initial investigation. These voters were removed from the random overall sample of 50,000 due to not having been assigned an “ST Score” from the original vendor, thus a different scoring methodology was used on this subset and these voters should not have been included in my initial analysis. Also, my initial investigation only had ST Score data grouped into broad categories, whereas this improved analysis uses raw data scored from 0 to 100 for each individual voter.

Goals:

· Create a model to help campaigns identify groups of voters who are more likely to be supporters and to eliminate groups of voters who are likely opponents.
· Investigate the relationship between past vote history and whether a voter returned a ballot during the 2006 general election.
· Look at how accurate the “ST score” was in terms of predicting whether a voter would actually be a supporter of Darcy Burner.
· Look at age and sex as possible factors in determining support for Darcy Burner.

Results

First, I calculated the average support level for the sample. The average of 3.36 is a bit above 3, which is the undecided level.(1) This does not mean the average support level across all voters is actually greater than 3 however, since the support level is likely skewed upwards. This is due to the effort by the campaign to identify supporters, and their likely exclusion of groups of likely opponents from voter contact efforts. Voter contact is a time consuming and expensive process, and only a fraction of eligible voters ever wind up with a known support level. As the included number of observations shows, only 6980 voters in this sample of 50,000 voters were tagged with a support level.

Next, I looked at the predictive value of the past party affiliation data.(2) A statistically significant positive correlation exists between voter grade and support level. The voter grade from prior to 2006 predicts 16% of the variation in the level of candidate support for Senator Cantwell. This shows that while voter grade can be helpful in identifying supporters, it is only a small piece of the puzzle and cannot be relied upon exclusively.

The next set of data I was interested in exploring were three variables which were used together during the campaign to search for potential supporters- Age, sex, and the ST score.(3) While all of these variables are statistically significant, the overall R-square of .018 is very low. This is slightly increased from .015 from my initial investigation. I definitely expected to see a bigger improvement as I moved from the roughly categorized scores I used initially to the specific scores here, which are stated to two decimal places! Thus, the use of these predictors only marginally improves on finding supporters over a purely random sample. There does appear to be a higher level of support for Darcy Burner amongst women (sex=0, coefficient=1.87) than amongst men (sex=1, coefficient=1.63). The campaign did work under the assumption that Burner was likely to find support amongst women rather than men, given she was up against a male opponent, and this assumption appears correct.

Campaigns often focus their GOTV efforts on voters who are in the middle range of turnout likelihood. It is assumed that it is wasted money to try to contact someone who is very unlikely to vote, and likewise someone who is guaranteed to vote. The standard predictors used in making a determination of likelihood to vote are the recorded vote history. It was assumed that general election turnout was a more important indicator than primary turnout, particularly since the focus was on the upcoming general election. In hindsight, we can regress the general election turnout on the complete vote history for the previous 6 years worth of voting to determine the estimated coefficients for each election and use this to estimate the likelihood to turnout.(4) The data confirms that general elections tend to be more significant than primaries overall, and that more recent elections are more significant than older elections. Thus, the general model the campaign was using was correct, although it could be more finely tuned by using this new model to assign real weights to each election. Notably, the R squared is .42, which intuitively seems about correct. It would be unlikely to find a model that could predict with near certainty whether an individual voter would actually vote, but this model does provide some good guidance in terms of general trends.

In conclusion, the results of this analysis show that the campaign was making decent educated guesses about what factors to look at when trying to direct their efforts. The piece of information they were lacking though was how significant each variable was. Without knowing how to weigh age versus sex, or ST score, the campaign was left making important decisions based on gut instinct. Moreover, they were not aware how well their predictor variables were performing, and thus had no incentive to search for more significant variables. It is very possible that a concerted use of regression analysis along with a careful plan to sample cross sections of the voting populace would have led to more efficient use of resources. For example, the campaign made an effort to target young women. Looking at the regression of support on age, expanded to show all ages by year, it appears the support level for younger women is actually lower than that of older women. (5) The sample size is small enough that the standard errors for these coefficients introduce some level of uncertainty with this result, but the data does seem to indicate a real trend here. I removed 19 year olds and those over 93 due to high standard errors and created a scatter plot to illustrate this trend, with a fitted regression line. It becomes quite clear now that targeting younger women may not have been an optimal campaign strategy.

This scatter plot illustrates how regression analysis could have improved voter targeting and potentially changed the outcome of a Congressional race, by focusing on older women rather than younger women. Last, to return to a previous point, the ST Score which was given great credence during the campaign cycle explained less than 2 % of the variation in a voter’s support level. The fact that the ST Score was reported to two decimal places creates an impression that it is extremely precise, and that it can be used in a very fine manner to target voters. In reality, there was almost no difference between broad categories of the ST Score and the very fine scoring used in my second analysis. This suggests that campaign directors should take modeling scores with a grain of salt at best, at least until a real world verification of their effectiveness has been performed.

Data Output

1. ls us_house_of_represen c

Dependent Variable: US_HOUSE_OF_REPRESEN
Method: Least Squares
Sample: 1 50000 IF STRATTELEMETRY_SCORE>0
Included observations: 6980

Coefficient Std. Error t-Statistic Prob.
C 3.359026 0.019662 170.8387 0.0000

R-squared 0.000000 Mean dependent var 3.359026
Adjusted R-squared 0.000000 S.D. dependent var 1.642687
S.E. of regression 1.642687 Akaike info criterion 3.830687
Sum squared resid 18832.28 Schwarz criterion 3.831669
Log likelihood -13368.10 Hannan-Quinn criter. 3.831025
Durbin-Watson stat 1.952899

2. ls us_house_of_represen c pre2006_grades

Dependent Variable: US_HOUSE_OF_REPRESEN
Method: Least Squares
Sample: 1 50000 IF STRATTELEMETRY_SCORE>0
Included observations: 3696

Coefficient Std. Error t-Statistic Prob.
C 0.930433 0.100374 9.269688 0.0000
PRE2006_GRADES 0.781945 0.028890 27.06640 0.0000

R-squared 0.165498 Mean dependent var 3.565476
Adjusted R-squared 0.165272 S.D. dependent var 1.625753
S.E. of regression 1.485344 Akaike info criterion 3.629711
Sum squared resid 8149.880 Schwarz criterion 3.633074
Log likelihood -6705.707 Hannan-Quinn criter. 3.630908
F-statistic 732.5898 Durbin-Watson stat 2.136515
Prob(F-statistic) 0.000000

3. ls us_house_of_represen @expand(sex) age strattelemetry_score

Dependent Variable: US_HOUSE_OF_REPRESEN
Method: Least Squares
Sample: 1 50000 IF STRATTELEMETRY_SCORE>0
Included observations: 6974

Coefficient Std. Error t-Statistic Prob.
AGE 0.007527 0.001264 5.954138 0.0000
STRATTELEMETRY_SCORE 0.024752 0.003550 6.973313 0.0000
SEX=0 1.870616 0.186871 10.01023 0.0000
SEX=1 1.635649 0.183821 8.898059 0.0000

R-squared 0.018417 Mean dependent var 3.359048
Adjusted R-squared 0.017995 S.D. dependent var 1.642550
S.E. of regression 1.627704 Akaike info criterion 3.812791
Sum squared resid 18466.46 Schwarz criterion 3.816720
Log likelihood -13291.20 Hannan-Quinn criter. 3.814145
Durbin-Watson stat 1.974734

4. LS GENERAL06 C GENERAL05 GENERAL04 GENERAL03 GENERAL02 GENERAL01 GENERAL00 PRIMARY06 PRIMARY05 PRIMARY04 PRIMARY03 PRIMARY02 PRIMARY01 PRIMARY00

Dependent Variable: GENERAL06
Method: Least Squares
Sample: 1 50000 IF STRATTELEMETRY_SCORE>0
Included observations: 45097

Coefficient Std. Error t-Statistic Prob.
C 0.147413 0.004184 35.23009 0.0000
GENERAL05 0.283711 0.004473 63.43001 0.0000
GENERAL04 0.206254 0.005198 39.68120 0.0000
GENERAL03 0.014962 0.005228 2.861771 0.0042
GENERAL02 0.108505 0.005184 20.93248 0.0000
GENERAL01 0.023894 0.005332 4.481124 0.0000
GENERAL00 0.042168 0.004811 8.765151 0.0000
PRIMARY06 0.191973 0.004664 41.15622 0.0000
PRIMARY05 0.022330 0.004948 4.512808 0.0000
PRIMARY04 0.066253 0.004644 14.26506 0.0000
PRIMARY03 -0.018960 0.005473 -3.464221 0.0005
PRIMARY02 -0.036906 0.005503 -6.705928 0.0000
PRIMARY01 -0.032836 0.005603 -5.860115 0.0000
PRIMARY00 0.001494 0.005162 0.289372 0.7723

R-squared 0.422976 Mean dependent var 0.656163
Adjusted R-squared 0.422810 S.D. dependent var 0.474993
S.E. of regression 0.360866 Akaike info criterion 0.799692
Sum squared resid 5870.913 Schwarz criterion 0.802398
Log likelihood -18017.86 Hannan-Quinn criter. 0.800544
F-statistic 2542.093 Durbin-Watson stat 1.961476
Prob(F-statistic) 0.000000

5. smpl if strattelemetry_score>0 and sex=0 ls us_house_of_represen @expand(age)

Dependent Variable: US_HOUSE_OF_REPRESEN
Method: Least Squares
Sample: 1 50000 IF STRATTELEMETRY_SCORE>0 AND SEX=0
Included observations: 3865

Coefficient Std. Error t-Statistic Prob.
AGE=19 3.000000 1.609292 1.864174 0.0624
AGE=20 3.000000 0.608255 4.932141 0.0000
AGE=21 3.181818 0.280142 11.35789 0.0000
AGE=22 3.666667 0.280142 13.08861 0.0000
AGE=23 3.060606 0.280142 10.92521 0.0000
AGE=24 3.000000 0.369197 8.125747 0.0000
AGE=25 3.592593 0.309708 11.59992 0.0000
AGE=26 3.800000 0.321858 11.80644 0.0000
AGE=27 3.782609 0.335561 11.27251 0.0000
AGE=28 3.272727 0.343102 9.538638 0.0000
AGE=29 3.533333 0.293815 12.02570 0.0000
AGE=30 3.230769 0.315608 10.23665 0.0000
AGE=31 2.925926 0.309708 9.447359 0.0000
AGE=32 3.500000 0.284485 12.30292 0.0000
AGE=33 3.450000 0.254451 13.55858 0.0000
AGE=34 2.875000 0.232281 12.37724 0.0000
AGE=35 3.313725 0.225346 14.70506 0.0000
AGE=36 3.254545 0.216997 14.99812 0.0000
AGE=37 3.028986 0.193736 15.63461 0.0000
AGE=38 3.250000 0.179924 18.06315 0.0000
AGE=39 3.461538 0.182216 18.99685 0.0000
AGE=40 3.465753 0.188353 18.40028 0.0000
AGE=41 3.202247 0.170585 18.77220 0.0000
AGE=42 3.393939 0.198090 17.13331 0.0000
AGE=43 3.584270 0.170585 21.01169 0.0000
AGE=44 3.224490 0.162563 19.83532 0.0000
AGE=45 3.524272 0.158568 22.22559 0.0000
AGE=46 3.377358 0.156308 21.60704 0.0000
AGE=47 3.301887 0.156308 21.12420 0.0000
AGE=48 3.291262 0.158568 20.75613 0.0000
AGE=49 3.240741 0.154854 20.92770 0.0000
AGE=50 3.385965 0.150724 22.46467 0.0000
AGE=51 3.591304 0.150067 23.93130 0.0000
AGE=52 3.261682 0.155576 20.96519 0.0000
AGE=53 3.344828 0.172534 19.38646 0.0000
AGE=54 3.471698 0.156308 22.21059 0.0000
AGE=55 3.741573 0.170585 21.93383 0.0000
AGE=56 3.543478 0.167780 21.11976 0.0000
AGE=57 3.386364 0.171551 19.73968 0.0000
AGE=58 3.705882 0.174552 21.23080 0.0000
AGE=59 3.826667 0.185825 20.59285 0.0000
AGE=60 3.694737 0.165110 22.37745 0.0000
AGE=61 3.333333 0.175588 18.98382 0.0000
AGE=62 3.371429 0.192347 17.52783 0.0000
AGE=63 3.774194 0.204380 18.46653 0.0000
AGE=64 3.794521 0.188353 20.14576 0.0000
AGE=65 3.615385 0.199608 18.11242 0.0000
AGE=66 4.050000 0.254451 15.91660 0.0000
AGE=67 3.500000 0.254451 13.75508 0.0000
AGE=68 3.765957 0.234739 16.04315 0.0000
AGE=69 3.545455 0.242610 14.61381 0.0000
AGE=70 3.800000 0.254451 14.93409 0.0000
AGE=71 3.163265 0.229899 13.75938 0.0000
AGE=72 4.000000 0.261062 15.32206 0.0000
AGE=73 3.418605 0.245415 13.92991 0.0000
AGE=74 3.400000 0.293815 11.57190 0.0000
AGE=75 3.731707 0.251329 14.84789 0.0000
AGE=76 4.100000 0.254451 16.11310 0.0000
AGE=77 3.250000 0.328495 9.893597 0.0000
AGE=78 3.250000 0.284485 11.42414 0.0000
AGE=79 3.608696 0.335561 10.75423 0.0000
AGE=80 3.842105 0.369197 10.40666 0.0000
AGE=81 3.516129 0.289037 12.16497 0.0000
AGE=82 4.230769 0.315608 13.40514 0.0000
AGE=83 4.130435 0.335561 12.30906 0.0000
AGE=84 3.750000 0.328495 11.41569 0.0000
AGE=85 3.666667 0.328495 11.16201 0.0000
AGE=86 4.066667 0.415517 9.786997 0.0000
AGE=87 3.400000 0.359849 9.448419 0.0000
AGE=88 4.750000 0.568971 8.348411 0.0000
AGE=89 3.666667 0.656991 5.581003 0.0000
AGE=90 4.000000 0.568971 7.030241 0.0000
AGE=91 4.600000 0.719697 6.391578 0.0000
AGE=92 3.666667 0.656991 5.581003 0.0000
AGE=93 3.400000 0.719697 4.724210 0.0000
AGE=94 4.000000 1.137941 3.515121 0.0004
AGE=95 3.000000 1.609292 1.864174 0.0624
AGE=96 1.000000 1.609292 0.621391 0.5344
AGE=97 5.000000 1.609292 3.106957 0.0019
AGE=98 5.000000 1.609292 3.106957 0.0019
AGE=99 5.000000 1.609292 3.106957 0.0019

R-squared 0.027811 Mean dependent var 3.477102
Adjusted R-squared 0.007258 S.D. dependent var 1.615164
S.E. of regression 1.609292 Akaike info criterion 3.810200
Sum squared resid 9799.878 Schwarz criterion 3.941387
Log likelihood -7282.211 Hannan-Quinn criter. 3.856783
Durbin-Watson stat 1.903265

Read More...

"Grant Hadwin got a chainsaw and did something terrible"

So begins a fascinating story of idealism and uncompromising zeal, set on the bank of the Yakoun River in British Columbia. This is a story which leaves you wondering about the meaning of heroism, and of folly.

I'm not sure if the protagonist is Grant Hadwin or the singular golden spruce.

I recently carved a kayak paddle out of Sitka Spruce; I have a different feel for this tree now that I've felt how tough spruce fibers can be. I highly recommend this story and, thanks to the internet, it's available for preview on Google Books.

Read More...

Real time home energy monitoring

Imagine a gas pump that didn't tell you the price of the gas you'd just pumped. In this country, I imagine that would cause a near-riot.

Wait, that sounds like my electricity meter! After an $800 utility bill, I became intimately familiar with that meter.

So, in 2005 (and in shock), I put together a simple database to track my own electricity usage. The variation in my daily and hourly usage was enormous. Even modest changes in weather had a big impact. My house used electric radiant heat, and we must have been very popular down at Seattle City Light.

Having access to real-time feedback enabled me to reduce my energy use noticeably. Sometimes, information can be as powerful a tool as extra weather stripping. Commercial versions of real-time energy monitors are available, though fairly expensive at the retail level. I have not seen any examples of utilities promoting or subsidizing energy information systems. I feel this could be a wise and cost effective option to pursue for utilities and governmental agencies.

Here's a graph of my daily electricity usage, put into terms we can all understand: $$$$$$$

Read More...

23 November 2008

Abbreviated resume

Technical Skills

  • SQL, MySQL, MS Access, Excel, HTML, CSS, ASP, Visual Basic, Photoshop, Javascript

Website design
Summary: I enjoy databases. I am highly skilled in database architecture, including creation, modification and restoration. I am well versed in ETL procedures utilizing various data formats, as well as setup, optimization and maintenance of indexes. I enjoy solid, meticulous, and optimized query writing. My university studies included: statistics, econometrics, advanced calculus, and game theory.

Education
Bachelor of Science, Economics, University of Washington

Employment
Alaska Salmon Program, University of WA - Research Consultant
02/ 2007 – present
I currently work for a grant-funded university research department with many data contributors. My position has three primary components. First, I created a database to capture long term environmental, biological, and fisheries data. Second, I provide ongoing technical assistance to several teams of field researchers and data support for scientific papers published by the program. This has included onsite fieldwork in Alaska. Finally, I hire, train, and supervise a team of data entry personnel.

Washington State Democratic Central Committee - Data Analyst
08/2008 – present
Based on my successful work in 2006, I was contacted to work on a contract basis for the 2008 election cycle. I provide data analysis for the WSDCC in support of local, state, and national electoral contests.

Washington State Democratic Coordinated Campaign - Database Manager
07/2006 – 12/ 2006
I managed the Washington State Democratic Party voter-file database and provided technical support for Senatorial and Congressional campaigns. Key objectives for this position were ensuring data integrity and providing statistical analysis. I optimized the use of existing data using statistical analysis and I was able to significantly reduce statewide campaign spending by eliminating errant or redundant data and by improving data capture methods. I independently researched and directed successful micro-targeting analysis using demographic modeling. In this role, I provided management oversight of software vendors, support staff, and volunteers.

Western Wireless Inc. - Database Reporter
08/ 2005 – 05/2006
In this contract position, I developed sales and marketing reports querying various Oracle and SQL databases. Reports were primarily pivot-table report/chart presentation in Excel. I also reverse engineered and debugged pre-existing, undocumented work by prior programmers.

Planet Organics - Database and Website Developer
04/2001 – 08/2005
For this small but rapidly expanding business, I served as the jack-of-all trades for all computer needs from hardware installation to sophisticated analytics. I managed the design and development of databases in SQL, Great Plains, and MS Access. I also maintained data integrity between these databases and established back-up procedures. I oversaw the website and customer technical assistance. I coded, tested, debugged, and documented new applications. I also provided technical training to all staff. I administered security measures to restrict unauthorized use of data systems and databases, managed email lists, and supervised the quality of the data entry process. Finally, I tracked website metrics and used analysis of metrics to improve customer acquisition and retention, as well as to promote increased revenue. I consistently achieved #1 natural rankings on all major search engines for this business.

Read More...