Aug. 19, 2008 at 5:00pmFinding the Needle in your MySQL Haystack

Fun with stored procedures pt. I

(First in what will probably be a series of blogs as we move all our projects to a replicated, MySQL 5.0 environment, and I finally get to start playing with all the useful features that come with it)

Say you've got a client, Mystery Client A. Mystery Client A has hired a marketing consultant. As a part of their rebranding efforts, they have decided to refactor their company spelling convention to MysteryClientA!, for whatever reason. That's fine for replacing a few logos, but MCA has a database-driven content management system, and their name is riddled throughout the database in page content, event description, news headlines and so forth. Your job is now to sift through the entire system and apply the newly crafted spelling to the entire database. So what do you do?

Solution 1 - Crawl the Site

Browse the site (or delegate an intern to browse the site) from front to back and left to right, looking for any page with the offending text. Fine if you've got a small site or someone with a lot of extra time on their hands, but becomes very quickly impractical.

Solution 2 - Search the content

Slightly more efficient is to go to each table that serves public content and search on the appropriate fields. Eliminates the hassle of dealing with large numbers of rows, but if your site is distributed across too many tables, or if you're stuck in a system that you don't know well enough, it is still tedious work.

Solution 3 - Stored Procedure

Run a procedure that checks every string-type field in the database, regardless of table structure, for matching content.

One line of code (after development, of course) to show you exactly what needs changing? Sounds like a winner. Before I get into the details, here is the final procedure code (first draft), from which I will be snippeting later.

(Disclaimer: I am still very much in the learning stage of MySQL stored routine programming, so I am quite open to any efficiency, speed, or just general intelligence improvements anyone wants throw at me. I will post improved iterations as I write them.)

The goal of this version is to return a result set in which each row contains enough information to find one relevant field: this breaks down to the table name, column name, table primary key fields and the values for those fields (remember, we want this to be structure-independent, so we can't assume an id number, or even a single-field key).

The problem can be broken down into two steps: find the columns and search the columns. But first, the stub:

DROP PROCEDURE IF EXISTS `fieldSearch`;
DELIMITER $$
CREATE PROCEDURE `fieldSearch`(IN search VARCHAR(255))
    READS SQL DATA
BEGIN
    -- utility variable for the exiting of loops
    DECLARE done TINYINT DEFAULT 0;

    -- to hold the table and column name of each relevant row
    DECLARE tab, col VARCHAR(64);

    -- to hold the list of columns making up the primary key of the table in tab
    DECLARE keyColumnList TEXT;

    ... the processing code goes here ...
END $$


Step 1: Find the columns

As long as we're in MySQL 5 mode, we've got more in our corner than just stored procedures. The metadata (I like saying metadata) in information_schema has just what we need to grab our list of string columns, using the following query:
SELECT c.table_name, c.column_name,
     GROUP_CONCAT(kcu.column_name ORDER BY kcu.ordinal_position) AS key_columns
FROM information_schema.columns c
    INNER JOIN information_schema.key_column_usage kcu
        ON kcu.table_schema = c.table_schema AND kcu.table_name = c.table_name
WHERE c.table_schema = 'ccf'
    AND c.data_type IN('varchar', 'char', 'text')
    AND kcu.constraint_name = 'PRIMARY'
GROUP BY c.table_name, c.column_name;
Grabbing table_name and column_name from information_schema.columns gets us our core data, filtered against the three main string data types: varchar, char and text. Joining to information_schema.key_column_usage and filtering against the key name 'PRIMARY' retrieves the key information needed to grab a particular unique row out of the given table. Since we only want one row per relevant column, and since we'll have a use for a comma-separated list of the information later, we use the GROUP_CONCAT aggregate function to get the primary key info all together, grouped by table_name and column_name.

Step 2: Search the columns

Before we can iterate through the column results, we need to prepare for output. The final result will make use of the following temporary table:
CREATE TEMPORARY TABLE IF NOT EXISTS MatchedColumns (
    matchTable VARCHAR(64),
    matchColumn VARCHAR(64),
    keyColumns TEXT,
    keyValues TEXT
);
(If you are going to be running this procedure more than once on a given connection, you'll need to make sure MatchedColumns is cleared out, either by dropping the table and then recreating it or by truncating or deleting the contents)

Now we iterate through the results with a cursor, populating the tab, col and keyColumnList variables, which can then be used to generate the dynamic INSERT statement which will, in turn, populate the MatchedColumns table. Since there is not currently support for dynamic table and column population in a query, we resort to a prepared statement, built out of a concatenated string.
SET @insertQuery =  CONCAT("
    INSERT INTO MatchedColumns
    SELECT '", tab, "' AS matchTable,
        '", col, "' AS matchColumn,
        '", keyColumnList, "' AS keyFields,
        CONCAT_WS(',', ", keyColumnList, ") AS keyValues
    FROM ", tab, "
    WHERE ", col, " LIKE '%", search, "%'");

-- execute query
PREPARE insertStmt FROM @insertQuery;
EXECUTE insertStmt;

This is where that comma-delimited listing of primary key columns comes in handy. If the initial query produces the following result:
+--------------------+-----------------+-------------------------+
| table_name         | column_name     | key_columns             |
+--------------------+-----------------+-------------------------+
| item_categories    | description     | item_id,category_id     |
+--------------------+-----------------+-------------------------+

the constructed query will be as follows:
INSERT INTO MatchedColumns
SELECT 'item_categories' AS matchTable,
     'description' AS matchColumn,
    'item_id,category_id' AS keyFields,
    CONCAT_WS(',', item_id,category_id) AS keyValues
FROM item_categories
WHERE description LIKE '%search_term%'
By virtue of already being comma-separated, the primary key field listing fits nicely into the CONCAT_WS function, providing a similarly comma-separated listing of the equivalent value set. Once the result set is exhausted, all that remains is to push the contents of MatchedColumns to the caller, using:
SELECT * FROM MatchedColumns;
This will produce something like:
+--------------------+-----------------+-------------------------+-------------+
| matchTable         | matchColumn     | keyColumns              | keyValues   |
+--------------------+-----------------+-------------------------+-------------+
| item_categories    | description     | item_id,category_id     | 238,123     |
+--------------------+-----------------+-------------------------+-------------+
| item_categories    | description     | item_id,category_id     | 114,22      |
+--------------------+-----------------+-------------------------+-------------+
| item_categories    | description     | item_id,category_id     | 7,93        |
+--------------------+-----------------+-------------------------+-------------+


Step 3: Handling the results


I know, I said two steps. And there were two steps in the procedure itself. What it gives you is all the information you need to find any given matching content block. It is not particularly useful like this, however. From there, it all depends on what you want to do. For a pure find and replace, as in the rebranding example, a lot can be stripped out. You can jump straight from finding the string columns to an update using REPLACE, which with these results will look like:
UPDATE item_categories
SET description = REPLACE(description, 'Mystery Client A', 'MysteryClientA!')
Finding each row individually becomes necessary when a more hands-on, case-by-case treatment is needed. The information in the table returned can be manipulated into a select statement one at a time on the outside, or the select statements can be generated instead of the table, depending on your needs.

More information:

As linked to earlier, the official documentation on stored procedures can be found here. I have been using the excellently written O'Reilly book MySQL Stored Procedure Programming, by Guy Harrison and Stephen Feuerstein to further my own knowledge, though it may be apparent to a stored procedure expert that I'm only a few chapters in so far.

Add your comment below

Leave a Comment

Remember me

Name:

Email:

URL:

Comment: * No HTML, http:// will auto-link
* required
Comment Guidelines