Welcome to Professional ASP.NET - Chris Love's Official Blog Sign in | Join | Help

Chris Love's Official ASP.NET Blog

Chris Love's Helpful tips, tricks and pragmatic development knowledge for the ASP.NET world.
Add to Technorati Favorites


ASP Insider
How to Replace Ambiguous Values

Recently I had a friend tell me they were working on a project where they had to hand scrub the data in their database. In particular that day they were manually scrubbing values like 'Not Applicable', 'NA' and 'N/A' which all meant the same thing. But in order for the data to be consistent they need to go in by hand and change all these values to be the same. Now I want to say they were not working on SQL Server, in fact they were not even getting to work in a relational database at all, it was some obscure Linux based database with a very expensive support contract.

I told them they could use one of several methods in my world to remedy their data real quick. The first is a simple SQL Update statement, which in all fairness is all they really needed, if they were working in a real database. the next option I had was to use a Regular Expression to replace the values and restore them. This option could have been viable for them because to actually work with the data they had to export to a delimited file anyway (don't even ask).

So to use a SQL Update statement is it pretty easy, update the value of the field to the desired value where it equals the other possible values. It is a good idea first to see what possible values are available by running a select distinct on the column, sorting alphabetically.

select distinct Contact_Status from Customer

distinct Contact_Status from Customer

order by Contact_Status ASC

by Contact_Status ASC

This way you know you will have a full list of possible values that need to be updated. Now we can create and execute the replacing Update statement.

Update Customer

Customer

Set Contact_Status = 'N/A'

Contact_Status = 'N/A'

Where Contact_Status = 'Not Applicable' or Contact_Status = 'NA'

Contact_Status = 'Not Applicable' or Contact_Status = 'NA'

This will do the trick for a standard SQL database or Access. But what if we have a delimited file, how do you do the same thing. Well it is pretty easy to run through the rows of the file, examining a field and then updating that field.

Public Sub UpdateFields(ByVal sFile As String,    ByVal sTargetValue As String, ByVal sBadValues() As String)

 

        If sBadValues.Length = 0 Then

            Return

        End If

 

        Dim i As Integer = 0

        Dim sRegEx As String = String.Empty

 

        For i = 0 To sBadValues.Length - 1

            If i > 0 Then

                sRegEx = sRegEx & String.Format("|\b{0}\b", sBadValues(i))

            Else

                sRegEx = sRegEx & String.Format("\b{0}\b", sBadValues(i))

            End If

        Next

 

        Dim _Regex As New Regex(sRegEx)

 

        Dim line As String

        Dim newLine As String

 

        If File.Exists(sFile & ".temp") Then

            File.Delete(sFile & ".temp")

        End If

 

        Dim sr As StreamReader = File.OpenText(sFile)

        line = sr.ReadLine

        While Not line Is Nothing

            newLine = _Regex.Replace(line, sTargetValue)

            My.Computer.FileSystem.WriteAllText(sFile & ".temp", "Text", True)

            line = sr.ReadLine

        End While

        sr.Close()

 

        If File.Exists(sFile & ".temp") Then

            File.Delete(sFile)

            File.Move(sFile & ".temp", sFile)

        End If

 

    End Sub

This method first checks to make sure we have at least one replacement value. It would be good to also validate the remaining parameters too. The next step is to build the RegularExpression to be used to find the value to replace. In this case it will be an expression looking for the words passed in the sBadValues array. The '\b' is the RegularExpression escape for a word break.

Next declare an instance of the RegEx object using the RegularExpression we created. Then I check to see if there is a temporary file to hold our new values. Then I open the file with the questionable data and loop through it, running each line through the RegEx object. The Replace method will replace any matching values with the desired or sTargetValue.

After looping through the file it is closed. Then I simply move the temporary file over the initial file.

Public Sub ReplaceNAValues()

Dim sBadValues() As String = {"Not Applicable", "NA"}

UpdateFields("c:\BadData.tsv", "N\A", sBadValues)

End Sub

This is an example method that calls our standard replace method. In this example I show how to setup the variables used the base method to replace the undesired values.

Posted: Saturday, September 01, 2007 12:53 PM

by Chris Love

Comments

No Comments

Leave a Comment

(required) 

(required) 

(optional)

(required) 

Comment Notification

If you would like to receive an email when updates are made to this post, please register here

Subscribe to this post's comments using RSS