Data mining small datasets
I am new in data mining. For what I understand most techniques are intended to be used with large data sets, but I am curious to know if this is a must or just a general rule. In other words, is it ok to use data mining techniques in small data sets? Most examples work in small tables, but are there any limitations? Why?
Most data mining techniques are statistical approaches .
To get significant patterns, you need enough data. Otherwise anything measures may as well just be random deviations due to chance . The more data you have, the better your patterns could be.
But most data isn't "big" in the sense of "big data": a lot of methods would not scale to really big data sets. In most cases, you only have a few thousand (not a few exabyte) of data; in particular after preprocessing the data into the desired format.
I understand most techniques are intended to be used with large data sets, but I am curious to know if this is a must or just a general rule.
Using data mining techniques on small dataset is not "against the rules" as there are no rules on the size of your dataset. However, this recommendation comes from efficiency and accuracy!
Let's assume that you are working on a prediction engine, and in order for you to walk through all the use-cases you need to come up with certain set of rules. Now the data, well, you are building your prediction model, therefore, you cut your data to two sets where the first set is your training set and the other one is your testing set.
Your dataset is for accepting Credit Card application, you check credit history, age, income, and 10 other factors! and then a result of historically approved or declined!
you have a set of 1000 rows for the previous problem, you trained your system with 800 of it and tested with 200. What will be the AUC for your model. Whatever it is, it will be not true, as there is no way on earth that you have covered all use-cases, and you will never, therefore, the bigger the data, the better the mining model!
It depends on the problem you want to solve. Data mining domain is very large, but in the context of machine learning techniques, having a "good" dataset is extremely important. In machine learning, having a cold start can cause the creation of a model (=the implicit rules that the algorithm learnes through training) that is less robust, since the amount of training data is not sufficient to generalize to other, new observations.
More than the quantity of data, you have the quality issue. If your data is not balanced, erroneous or not related to the problem solving (in terms of feature relevance), then it does not matter the dataset size (or it would need a large amount of data cleansing and normalization anyway).
Therefore, the data quantity is an issue especially when combines with data quality issues. Usually, there is a balance between them since producing high-quality data comes with a cost. You can read more here
链接地址: http://www.djcxy.com/p/61384.html下一篇: 数据挖掘小数据集