OK so it’s live – and achieved using no more than some C#/Java and SQL database skills! It helps me answer the question: Can I get to a particular product faster than via our standard web offering (of departments, aisles and shelves, or text searching)? Does it for you?
Go play at http://www.techfortesco.com/tescoinstant !!
The rest of this article is about the technical design and implementation, of interest to anyone wishing to look at similar solutions.
Step 1 – Design!
I created a database with the following tables and developed code that used the API to extract the products for a particular store (Greenford Dotcom Store which serves all homes in north London) and perform ‘lexicon analysis’ to make the data available efficiently.
My theory was that, if I broke all products down into their constituent words (called the ‘Lexicon’), and indexed those words efficiently using all possible key combinations up to 5 characters, then I could make product searching really fast!
Pencil, paper and SQL Enterprise Manager brought me to this design (note that there are extra clustered indices applied to the final design that are not shown in this diagram):
Step 2 – Get The Data
Using the Tesco grocery API I parsed through all the departments, aisles and shelves and copied some essential product data – just ProductID, BaseProductID (known as TPNB for future use when I hook up the application to the Tesco API for ‘add to basket’) and the product’s description name into a table called ProductsTable – totalling 21,638 rows.
Step 3 – Extract and store individual words to create the Lexicon. I took each product description and extracted individual words thanks to C#’s String.Split(‘ ‘) method using space ‘ ‘ as a delimiter. Each word was inserted into the LexiconTable (unless it was already there, in which case I just incremented the ProductCount value at the rate of one per product – a value which would be useful later).
Since I was creating a many-to-many relationship between ProductTable and LexiconTable (that is, products have more than word and each word can be used by more than one product), I also created a lookup table – LexiconProductTable. That way I could get to list the products quickly using any particular word much because it would be indexed (the alternative would have been to search for all products containing a particular word using the ‘LIKE’ keyword in SQL, which is highly inefficient and slow).
LexiconTable ended up with 11,237 unique words that described all the Tesco products, and LexiconProductTable had inserted 124,413 lookup rows.
If I sort by descending product count, I can uncover the top words used in most Tesco products! Intrigued? Here is the top 20:
10
Tesco
6288
4
And
2337
14
Pack
1320
136
Chicken
873
238
75Cl
732
311
500G
714
275
Sauce
665
150
Finest
661
49
400G
660
29
In
638
199
200G
627
60
Chocolate
623
47
250Ml
568
21
Value
532
24
1
517
465
2
517
266
250G
508
37
4
498
120
Organic
486
25
Litre
475
The left column is the LexiconID unique identifier, and the right column is the count of the number of products in which the word is used.
Step 4 – Build Indexed Searching based in anticipated input Now to get this application fast-acting when anyone typed in a search character. I needed a table that had a unique match for up to the first 5 letters typed in. That’s ANY 1, 2, 3, 4, or 5 characters from ‘aaaaa’ to ‘zzzzz’ as long as there at is at least one lexicon word that satisfies the particular combination. Index this match and the system should react like lightning!
To do this I wrote some code to look at each word in the LexiconTable that had an ‘a’ in it and, if so, inserted an entry into a new table, LexiconSearchCharTable. I then searched ‘b’, ‘c” and so on. Indeed I had a 5-nested search loop that efficiently checked for up to 5-letter combinations from ‘aaaaa’ to ‘zzzzz’ like this:
Does the current Lexicon word I am studying have an ‘a’? If yes, does it have an ‘aa’? If yes, does it have an ‘aaa’?
If yes, does it have an ‘aaaa’?
If yes, does it have an ‘aaaaa’?
In other words, I iterated through the alphabet from a-z, and if I got a ‘hit’ I then iterated again through the alphabet for a second character ‘aa-az’. Any hit I got I iterated through the alphabet using a third character ‘aaa’-‘azz’, then a fourth ‘aaaa’-‘azzz’ and finally a fifth character ‘aaaaa-azzzz’. This worked really well without going through every combination of ‘aaaaa’-‘zzzzz’ since of course words that have no characters ‘ab’ could not possibly have characters ‘aba’ through ‘abz’ either!
LexiconSearchCharTable ended up with 211,561 rows. Here is what the first 20 rows look like:
1
E
8973
2
H
8973
3
HE
8973
4
T
8973
5
TH
8973
6
THE
8973
7
B
7136
8
BE
7136
9
BER
7136
10
BERR
7136
11
BERRY
7136
12
C
7136
13
E
7136
14
ER
7136
15
ERR
7136
16
ERRY
7136
17
R
7136
18
RR
7136
19
RRY
7136
20
RY
7136
The left column is the table’s unique identifier for each row, the middle row contains the successfully tested characters, and the third column is the LexiconTable unique identifier for a word that matched. You can probably deduce that Lexicon ID 8973 is “THE” and 7136 has “BERRY” and a mysterious “C” – the word actually stored is “C/BERRY” (cranberry).
A user typing in ‘B’ will have row 7 returned. There will be other rows with just ‘B’ in them but they will be returned instantly since that column is indexed. Only rows with just ‘B’ in this column will be returned. Type in ‘BE’ and now row 8 is returned instead, really fast along with the other ‘BE’-only rows,
Step 5 – Extract appropriate data on demand QUICKLY via a stored procedure So now I have my data in a good place. I just needed a stored procedure to act swiftly on the tables to bring back the list of products quickly. Here we go:
CREATE PROCEDURE [dbo].[TescoGroceryInstantSearch] ( @SearchCharacters varchar(5) ) as SELECT TOP 20 pt.ProductID, pt.Description FROM ProductsTable pt INNER JOIN LexiconProductTable lpt ON pt.ProductID = lpt.ProductID INNER JOIN LexiconTable lt ON lt.LexiconID = lpt.LexiconID INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID WHERE SearchCharacters = @SearchCharacters ORDER BY lt.ProductCount DESC
Nice – an easy to read and a quick way to grab the data! As you type each characters into the search box on the web page, the list of products results quickly from this stored procedure. This is confirmed by the SQL Execution Plan revealed by my SQL Server – all the heavy percentage CPU costs are borne by clustered index seeking.
I may as well make the maximum use of the data so I have a second stored procedure that will run in parallel with the first one in my web application, revealing word suggestions to help the user get to the product even quicker. Here it is:
CREATE PROCEDURE [dbo].[TescoGroceryInstantSearchSuggestions] ( @SearchCharacters varchar(5) ) as SELECT TOP 15 lt.LexiconID, lt.LexiconWord FROM LexiconTable lt INNER JOIN LexiconSearchCharTable lsct on lsct.LexiconID = lt.LexiconID WHERE lsct.SearchCharacters = @SearchCharacters ORDER BY lt.ProductCount DESC
Again its clustered index seeking all the way. Nice! No need for third party software – just some intelligent thinking around data storage, formatting and extraction. I always like to try stuff out myself before I recommend buying anything, as I prefer to use the tools I already have.
Step 6 – Build a simple web application to make it all happen!
Now back to the whole reason for doing this. Can you get to a chosen product quicker? Go play at http://www.techfortesco.com/tescoinstant and let me know.
Comentarii