C# .NET - how to seatch data from different database using clustering algorithm

Asked By mani on 06-Mar-12 12:58 AM
hi..i have data in four different database..
and i need to serch that datas using clustering algorithm in C#.net??any ideas??
Somesh Yadav replied to mani on 06-Mar-12 01:10 AM

Standard clustering is done using a http://en.wikipedia.org/wiki/Vector_space model. The easiest way to do this is to create a file like a spread sheet, where each row is each document/instance and each column is a variable/feature. With your dataset, the standard method to start with would be to have each feature be "Does this set of tags contain X?", with a 1 if it does and a 0 if it doesn't. You can then apply k-means, such as through http://www.cs.waikato.ac.nz/ml/weka/, on the resulting dataset.

What this does, in practise for your dataset, is to group together sets of tags that are very similar, such as those that share 75% of common tags (depending, of course, on the parameters). You will probably get a similar result to your example.

Another area you can look at is graph based clustering. This builds a graph and splits the graph into subgraphs based on some criteria, which would achieve a similar result, but with potentially better results.

Finally, once you have your initial results, you may want to play around with what the features are, or the method of calculating distance between them. This gets a bit more advanced though and you may need to re-implement k-means to do this (someone comment if they know of a good k-means implementation that takes an arbitrary distance metric please!). One such distance metric you could try would be the ratio of the intersection of the tags to the union of the tags. Eg.

c#|conversion|datetime|j#
c#|datetime|database|j#

Have an intersection size of 3 (sharing C#, datetime and J#) and a union size of 5 (there are 5 different tags). The similarity would then be 3/5=0.6. This can be turned into a distance metric by subtracting it from 1 which is 1-0.6 = 0.4.