Joseph Edmonds

MySQL String Cleanup Function

by Joseph Edmonds

by Joseph Edmonds

It’s fairly common to have scenarios where you are trying to push text data into a brittle system that can not handle text data beyond very simple ASCII.

You can try all kinds of approaches, but one fairly simple one is to create a function such as the one below and then you can easily reuse that function across various queries and columns. This is a nice application of DRY to MySQL.

CREATE DEFINER=`YOUR_DB_USER`@`%` FUNCTION `YOUR_DB_NAME`.`cleanString`( str longtext ) RETURNS longtext CHARSET latin1
    DETERMINISTIC
BEGIN 
DECLARE startTag, endTag INT DEFAULT 1;
    DECLARE i, len SMALLINT DEFAULT 1;
    DECLARE ret longtext DEFAULT '';
    DECLARE c text;
    IF str IS NULL THEN
        RETURN ret;
    END IF;
    IF str = '' THEN
        RETURN ret;
    END IF;
    -- strip HTML tags
    stripTags:
    LOOP
        SET startTag = LOCATE("<", str, startTag);
        IF (!startTag) THEN LEAVE stripTags; END IF;
        SET endTag = LOCATE(">", str, startTag);
        IF (!endTag) THEN SET endTag = startTag; END IF;
        SET str = INSERT(str, startTag, endTag - startTag + 1, "");
    END LOOP;
    SET len = CHAR_LENGTH(str);
    -- now remove any non ascii characaters
    REPEAT
        BEGIN
            SET c = MID(str, i, 1);
            IF c REGEXP '[ -~]' THEN
                SET ret = CONCAT(ret, c);
            END IF;
            SET i = i + 1;
        END;
    UNTIL i > len END REPEAT;
    RETURN ret;
END

The function will take basically any text you can throw at it thanks to using the longtext type. It will use a little bit of XML magic to remove any HTML tags and then it will break the string into characters and loop over them, skipping out any that don’t match characters between space and tilde. This removes line breaks and all kinds of special characters.

Note this is a hugely destructive approach and not something you should do if you are pushing the data to somewhere valuable and important where the data must be very accurate. It is a good quick fix though where you just need to push some text in and can accept a little bit of mangling as long as it just goes in there without any more fuss.

Facebook
Twitter
LinkedIn
About Joseph Edmonds

About Joseph Edmonds

Personal musings, tech tips, trivia and other things that might be interesting.

Get in touch today

Send me a message concerning your idea, problem, or question – and I’ll personally respond as soon as possible.

Connect with me on LinkedIn.