Errors in State Data

My map engine compares many statistics about registration and voter turnout in three different general elections: 2014, 2016 and 2018.
  1. I obtained the voter registration information from the bookclosing reports published here: https://dos.myflorida.com/elections/data-statistics/voter-registration-statistics/bookclosing/
  2. I am calculating the turnout statistics from the December Voter Extract DVD published by the DOE. You can request a copy here, and I’m happy to share mine: https://dos.myflorida.com/elections/data-statistics/voter-registration-statistics/voter-extract-disk-request/
  3. I am doing statistics calculations from the voter history records. The state publishes overall turnout by county, but that doesn’t have the detailed breakdown I need for the map. The county totals are here: https://results.elections.myflorida.com/Index.asp?ElectionDate=11/8/2016&DATAMODE=
There is one voter history record per ballot cast in each election, and those records contain the Voter ID and County Code. This makes it very easy (no table joins or calculation required) to count the number of votes cast per county. For the other statistics I use the Voter ID to join against the Voter Registration and work out the votes cast by party registration, precinct, etc. I thought I might have SQL errors in those joins, so I dropped back and just looked at the county totals.
 
And, those totals didn’t match between source #2 and #3 above.
 
So then I got really suspicious and I compared the November DVD against the December DVD. Using a very simple script, I counted all the county voter history rows for every general election back to 2006.
 
My assumption is that:
  • The numbers between the November DVD and the December DVD should match. They do not match.
  • The numbers between the two DVDs and the website report (#3 above) should match. They do not match.
The differences are significant and make my map completely wrong.
Here is a link to my scripts, the raw data and a spreadsheet showing the data problems.

Because of the high interest in the 2018 General Election maps, I am documenting my process steps here for open review. Click to expand the sections below.

Data Import to SQL Server

I received a DVD in the mail which contains two ZIP files: Voter_History_20181211.zip and Voter_Registration_20181211.zip. DVD also contains the 10-2018 Voter Extract Disk File Layout PDF. Unzipped into two folders to reveal 67 county files per folder.

Using PowerShell, concatenate the respective text files into a single tab-delimited file for import.

PS D:\DataImport> Get-Content ???_20181211.txt | Set-Content ALL_DETAIL.tab
PS D:\DataImport> Get-Content ???_H_20181211.txt | Set-Content ALL_HISTORY.tab

Verify the line counts match.

Voter Registration

PS D:\DataImport> Get-Content ???_20181211.txt | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------
14188600

PS D:\DataImport> Get-Content ALL_DETAIL.tab | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------
14188600

Voter Detail

PS D:\DataImport> Get-Content ???_H_20181211.txt | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------
81528453

PS D:\DataImport> Get-Content ALL_HISTORY.tab | Measure-Object -Line

Lines Words Characters Property
----- ----- ---------- --------
81528453

Create the SQL tables.

Date fields are imported as varchar at this stage, will be converted to true date fields after import.


CREATE TABLE [dbo].[tblVoterRegistrations](
[CountyCode] [varchar](3) NULL,
[VoterID] [varchar](10) NULL,
[NameLast] [varchar](30) NULL,
[NameSuffix] [varchar](5) NULL,
[NameFirst] [varchar](30) NULL,
[NameMiddle] [varchar](30) NULL,
[Exemption] [char](1) NULL,
[Address1] [varchar](50) NULL,
[Address2] [varchar](40) NULL,
[City] [varchar](40) NULL,
[State] [varchar](2) NULL,
[Zip] [varchar](10) NULL,
[MailingAddress1] [varchar](40) NULL,
[MailingAddress2] [varchar](40) NULL,
[MailingAddress3] [varchar](40) NULL,
[MailingCity] [varchar](40) NULL,
[MailingState] [varchar](2) NULL,
[MailingZip] [varchar](20) NULL,
[MailingCountry] [varchar](40) NULL,
[GenderCode] [char](1) NULL,
[RaceCode] [char](1) NULL,
[BirthDate] [varchar](max) NULL,
[RegistrationDate] [varchar](max) NULL,
[PartyCode] [varchar](3) NULL,
[Precinct] [varchar](6) NULL,
[PrecinctGroup] [varchar](3) NULL,
[PrecinctSplit] [varchar](6) NULL,
[PrecinctSuffix] [varchar](3) NULL,
[VoterStatus] [varchar](3) NULL,
[CongressionalDistrict] [varchar](3) NULL,
[HouseDistrict] [varchar](3) NULL,
[SenateDistrict] [varchar](3) NULL,
[CountyCommissionDistrict] [varchar](3) NULL,
[SchoolBoardDistrict] [varchar](2) NULL,
[AreaCode] [varchar](3) NULL,
[PhoneNumber] [varchar](7) NULL,
[PhoneExtension] [varchar](4) NULL,
[EmailAddress] [varchar](100) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

CREATE TABLE [dbo].[tblVoterHistories](
[CountyCode] [varchar](3) NULL,
[VoterID] [varchar](10) NULL,
[ElectionDate] [varchar](max) NULL,
[ElectionType] [varchar](3) NULL,
[HistoryCode] [char](1) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Import the tab-delimited files into SQL server.

Here are the Integration Services packages I used for the import.

Registration.dtsx

History.dtsx

The number of rows imported match the line counts from the text files. Results shown below.

The imported data contain leading and trailing whitespace. I used this SQL stored procedure to trim all the data prior to analysis.


USE [Elections]
GO
/****** 
  Object: StoredProcedure [dbo].[usp_trim_all_string_columns]
  Script Date: 12/19/2018 11:51:32 AM 
******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
--DROP PROCEDURE usp_trim_all_string_columns
--GO
ALTER PROCEDURE [dbo].[usp_trim_all_string_columns]
@schema_Table_name VARCHAR(MAX)
AS
DECLARE @SQL AS VARCHAR(MAX)

SET @SQL=STUFF((SELECT ', ' + QUOTENAME([name])
+ ' = LTRIM(RTRIM(' + QUOTENAME([name]) + '))' FROM
sys.columns WHERE object_id=object_id(@schema_Table_name)
AND system_type_id IN(35,99,167,175,231,29)
FOR XML PATH('')),1,1,'')
PRINT @SQL

SET @SQL = 'UPDATE ' + @schema_Table_name + ' SET' + @SQL
PRINT @SQL

EXEC(@SQL)

Return value zero indicates success.

Next, I converted the date fields (which were imported as varchar) to Date data type in the SQL database & added appropriate indexes to the two tables to improve query speed.

This concluded the data load.

Data Validation

To validate the data load into SQL is accurate, I wrote a more extensive PowerShell script to count the lines per year and per election and compared those against SQL. Here is my PowerShell code:

# Usage: getLineCount [year] [optional ElectionType]
# Get the year from command line.
$year=$args[0]

# If specified, get the election type from the second command-line parameter
$electionType=$args[1]

# Define $tab as the tab character.
$tab = [char]9

# Set the search string. In this case, searching for a slash, followed by the year,
# then a tab, then "GEN" to find the general election for that year.
$searchString = "/$($year)$($tab)$($electionType)"

# Get all history text files in the 2018-12 folder
$files = Get-ChildItem '.\2018-12\20181211_VoterHistory\???_H_20181211.txt'

# set up the report filename.
$outfile = "$($year)_$($electionType)_Election_Report.txt"

# Initialize the output file and write the header
Set-Content $outfile " $($electionType) Votes in $year by County"
Add-Content $outfile "----------------------------------------"
Add-Content $outfile "- Searching for $searchString"
Add-Content $outfile "----------------------------------------"
Add-Content $outfile " "

# For each file...
foreach ($file in $files) {
# Get the file contents, Fild all the matching lines and count them.
$data = Get-Content $file | Select-String -Pattern $searchString | Measure-Object -Line
# write a line to the report file with the county code (derived from the filename)
# and the number of matching lines.
Add-Content $outfile "$($file.Name.Substring(0,3)) : $($data.Lines)"
}

The corresponding SQL code is below. If the data import is accurate, both sets of outputs should match exactly.

-- // Get all votes by county in 2014 for all elections.
-- // To get other years, change the year in line 9.
-- // To get only the General election, uncomment line 10.
SELECT
H.CountyCode, COUNT(H.CountyCode)
FROM
dbo.tblVoterHistories AS H
WHERE
(H.ElectionDate BETWEEN CONVERT(date, '2014.01.01', 102) AND CONVERT(date, '2014.12.31', 102))
-- AND (H.ElectionType = 'GEN')
GROUP BY H.CountyCode
ORDER BY H.CountyCode

Both sets of data do match perfectly. Here are the output files. I’ve attached a Data_Audit zip file containing the PowerShell script, the raw output files, and an Excel sheet comparing the SQL results.