Monday 7 July 2014

Function dbo.StripUnwantedCharacters - Cleaning up unwanted ASCII characters

Having determined I had some unwanted NUL characters in my data (see ASCII Value Character Counts Procedure) I set out to remove them.

I tried permutations of LIKE and CHAR to find the character as well as REPLACE and CHARINDEX , all to no avail. Then I spotted this -

 0x0000 (char(0)) is an undefined character in Windows collations and cannot be included in CHARINDEX.

This was on the MSDN page for CHARINDEX

You can read the whole string character by character however and recreate it, removing those characters. This is what my function below does.


CREATE FUNCTION dbo.StripUnwantedCharacters (@InputString VARCHAR(MAX)) 
RETURNS VARCHAR(MAX)
AS 
BEGIN

DECLARE @OutputString varchar(MAX)
DECLARE @CurrentCharacterValue INT
DECLARE @Position INT

SET @OutputString = ''
SET @Position = 1

WHILE @Position < LEN(@InputString)+1
 BEGIN
 SET @CurrentCharacterValue = ASCII(SUBSTRING(@InputString, @Position, 1))

 IF @CurrentCharacterValue >= 9 -- Exclude Characters where ASCII value less that 9
  BEGIN
  SET @OutputString = @OutputString + CHAR(@CurrentCharacterValue)
  END

 SET @Position = @Position + 1
 END
RETURN @OutputString

END
GO

Code to use this procedure is as simple as ...

UPDATE tablename
SET cleandata = dbo.StripUnwantedCharacters(dirtydata)

Or too generate sql for all columns ...

SELECT 
'UPDATE [' + TABLE_SCHEMA + '].[' + TABLE_NAME + '] SET ' + COLUMN_NAME + ' = RTRIM(dbo.StripUnwantedCharacters(' + COLUMN_NAME + '))' 
 from INFORMATION_SCHEMA.COLUMNS
 WHERE DATA_TYPE IN ('char','varchar')
GO

No comments: