What is association rule mining?
A:
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Definition
• Support count (\sigma): itemset 出現的次數
∘ Frequency of occurrence of an itemset
∘ E.g. \sigma({Milk, Bread,Diaper}) = 2
• Support: 某一個 itemset / itemset 的總數
∘ Fraction of transactions that contain an itemset
∘ E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
∘ An itemset whose support is greater than or equal to a minsup threshold
• Rule Evaluation Metrics
∘ Support (s): Fraction of transactions that contain both X and Y, i.e. \sigma({X, Y}}/T
∘ Confidence (c): Measures how often items in Y appear in transactions that contain X, \sigma({X, Y}}/ \sigma (X)
Maximal Frequent Itemset:
• An itemset is maximal frequent if none of its immediate supersets is frequent(最相近的 superset 都不是 frequent itemset)
• example:
∘ Items: a, b, c, d, e
∘ Frequent Itemset: {a, b, c}
∘ {a, b, c, d}, {a, b, c, e}, {a, b, c, d, e} are not Frequent Itemset.
∘ Maximal Frequent Itemsets: {a, b, c}
Closed Itemset:[2]
An itemset is closed if none of its immediate supersets has the same support as the itemset.
So for example, if {Bread, Milk} is an itemset that has support=4, and all of its supersets has support<4, then {Bread, Milk} is a closed itemset.
Counter e.g.: If, let's say, {Bread, Milk, Sugar} has support=4, then {Bread, Milk} is not a closed itemset anymore.
Note: The definition states "the same" and doesn't say "the same or more" because it's impossible for a superset to have a support greater than one of its subsets
What is it purpose to find the closed frequent itemset?
找 closed and frequent itemsets 的主要目的在於可以從 frequent itemset 但不是屬於 closed itemsets 中,找出所有 frequent itemsets 的 support 值,因為 maximun frequent itemset 可以找出所有 frequent itemsets,但是無法找出 frequent itemsets 的 support 值,所以必須找 closed itemset,然後求出 frequent itemset 的 support 值
Apriori Algorithm
• Apriori principle:
∘ If an itemset is frequent, then all of its subsets must also be frequent
• Apriori property:
∘
∘ Support of an itemset never exceeds the support of its subsets
∘ Apriori 使用 anti- monotone(單調,單音) 的特性,避免產生不必要的 candidate itemset, 以減小 search space[1]
∘ 例子參考ch6 ppt 14
• Apriori algorithm:
∘ uses a generate-and-test approach - generates candidate itemsets and tests if they are frequent
Apriori algorithm disadvantage
• Generation of candidate itemsets is expensive (in both space and time)
• Support counting is expensive
∘ Subset checking (computationally expensive)
∘ Multiple Database scans (I/O), 多次的掃描資料庫
Frequent Patten Growth(FP-Growth) Algorithm
FP-Growth Tree[3]
FP-Growth algorithm
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
min_support = 50%
-----------------------------------------------------------------------------------------------------------
[1] http://dm07course.blogspot.com/search/label/chapter
[2] http://wiki.answers.com/Q/What_is_closed_frequent_itemset_in_datamining#ixzz1AeRZ8Nea
[3] http://www.florian.verhein.com/teaching/2008-01-09/
A:
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Definition
• Support count (\sigma): itemset 出現的次數
∘ Frequency of occurrence of an itemset
∘ E.g. \sigma({Milk, Bread,Diaper}) = 2
• Support: 某一個 itemset / itemset 的總數
∘ Fraction of transactions that contain an itemset
∘ E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
∘ An itemset whose support is greater than or equal to a minsup threshold
• Rule Evaluation Metrics
∘ Support (s): Fraction of transactions that contain both X and Y, i.e. \sigma({X, Y}}/T
∘ Confidence (c): Measures how often items in Y appear in transactions that contain X, \sigma({X, Y}}/ \sigma (X)
Maximal Frequent Itemset:
• An itemset is maximal frequent if none of its immediate supersets is frequent(最相近的 superset 都不是 frequent itemset)
• example:
∘ Items: a, b, c, d, e
∘ Frequent Itemset: {a, b, c}
∘ {a, b, c, d}, {a, b, c, e}, {a, b, c, d, e} are not Frequent Itemset.
∘ Maximal Frequent Itemsets: {a, b, c}
Closed Itemset:[2]
An itemset is closed if none of its immediate supersets has the same support as the itemset.
So for example, if {Bread, Milk} is an itemset that has support=4, and all of its supersets has support<4, then {Bread, Milk} is a closed itemset.
Counter e.g.: If, let's say, {Bread, Milk, Sugar} has support=4, then {Bread, Milk} is not a closed itemset anymore.
Note: The definition states "the same" and doesn't say "the same or more" because it's impossible for a superset to have a support greater than one of its subsets
What is it purpose to find the closed frequent itemset?
找 closed and frequent itemsets 的主要目的在於可以從 frequent itemset 但不是屬於 closed itemsets 中,找出所有 frequent itemsets 的 support 值,因為 maximun frequent itemset 可以找出所有 frequent itemsets,但是無法找出 frequent itemsets 的 support 值,所以必須找 closed itemset,然後求出 frequent itemset 的 support 值
Apriori Algorithm
• Apriori principle:
∘ If an itemset is frequent, then all of its subsets must also be frequent
• Apriori property:
∘
∘ Support of an itemset never exceeds the support of its subsets
∘ Apriori 使用 anti- monotone(單調,單音) 的特性,避免產生不必要的 candidate itemset, 以減小 search space[1]
∘ 例子參考ch6 ppt 14
• Apriori algorithm:
∘ uses a generate-and-test approach - generates candidate itemsets and tests if they are frequent
Apriori algorithm disadvantage
• Generation of candidate itemsets is expensive (in both space and time)
• Support counting is expensive
∘ Subset checking (computationally expensive)
∘ Multiple Database scans (I/O), 多次的掃描資料庫
Frequent Patten Growth(FP-Growth) Algorithm
FP-Growth Tree[3]
FP-Growth algorithm
1. Scan DB once, find frequent 1-itemset (single item pattern)
2. Order frequent items in frequency descending order
3. Scan DB again, construct FP-tree
min_support = 50%
-----------------------------------------------------------------------------------------------------------
[1] http://dm07course.blogspot.com/search/label/chapter
[2] http://wiki.answers.com/Q/What_is_closed_frequent_itemset_in_datamining#ixzz1AeRZ8Nea
[3] http://www.florian.verhein.com/teaching/2008-01-09/
留言
張貼留言