How to Replace Ambiguous Values
Recently I had a friend tell me they were working on a project where they had to hand scrub the data in their database. In particular that day they were manually scrubbing values like 'Not Applicable', 'NA' and 'N/A' which all meant the same thing. But in order for the data to be consistent they need to go in by hand and change all these values to be the same. Now I want to say they were not working on SQL Server, in fact they were not even getting to work in a relational database at all, it was some obscure Linux based database with a very expensive support contract.
I told them they could use one of several methods in my world to remedy their data real quick. The first is a simple SQL Update statement, which in all fairness is all they really needed, if they were working in a real database. the next option I had was to use a Regular Expression to replace the values and restore them. This option could have been viable for them because to actually work with the data they had to export to a delimited file anyway (don't even ask).
So to use a SQL Update statement is it pretty easy, update the value of the field to the desired value where it equals the other possible values. It is a good idea first to see what possible values are available by running a select distinct on the column, sorting alphabetically.
select distinct Contact_Status from Customer
distinct Contact_Status from Customer order by Contact_Status ASC
by Contact_Status ASC This way you know you will have a full list of possible values that need to be updated. Now we can create and execute the replacing Update statement.
Update Customer
Customer Set Contact_Status = 'N/A'
Contact_Status = 'N/A' Where Contact_Status = 'Not Applicable' or Contact_Status = 'NA'
Contact_Status = 'Not Applicable' or Contact_Status = 'NA' This will do the trick for a standard SQL database or Access. But what if we have a delimited file, how do you do the same thing. Well it is pretty easy to run through the rows of the file, examining a field and then updating that field.
Public Sub UpdateFields(ByVal sFile As String, ByVal sTargetValue As String, ByVal sBadValues() As String)
If sBadValues.Length = 0 Then
Return
End If
Dim i As Integer = 0
Dim sRegEx As String = String.Empty
For i = 0 To sBadValues.Length - 1
If i > 0 Then
sRegEx = sRegEx & String.Format("|\b{0}\b", sBadValues(i))
Else
sRegEx = sRegEx & String.Format("\b{0}\b", sBadValues(i))
End If
Next
Dim _Regex As New Regex(sRegEx)
Dim line As String
Dim newLine As String
If File.Exists(sFile & ".temp") Then
File.Delete(sFile & ".temp")
End If
Dim sr As StreamReader = File.OpenText(sFile)
line = sr.ReadLine
While Not line Is Nothing
newLine = _Regex.Replace(line, sTargetValue)
My.Computer.FileSystem.WriteAllText(sFile & ".temp", "Text", True)
line = sr.ReadLine
End While
sr.Close()
If File.Exists(sFile & ".temp") Then
File.Delete(sFile)
File.Move(sFile & ".temp", sFile)
End If
End Sub
This method first checks to make sure we have at least one replacement value. It would be good to also validate the remaining parameters too. The next step is to build the RegularExpression to be used to find the value to replace. In this case it will be an expression looking for the words passed in the sBadValues array. The '\b' is the RegularExpression escape for a word break.
Next declare an instance of the RegEx object using the RegularExpression we created. Then I check to see if there is a temporary file to hold our new values. Then I open the file with the questionable data and loop through it, running each line through the RegEx object. The Replace method will replace any matching values with the desired or sTargetValue.
After looping through the file it is closed. Then I simply move the temporary file over the initial file.
Public Sub ReplaceNAValues()
Dim sBadValues() As String = {"Not Applicable", "NA"}
UpdateFields("c:\BadData.tsv", "N\A", sBadValues)
End Sub
This is an example method that calls our standard replace method. In this example I show how to setup the variables used the base method to replace the undesired values.