基于約束傳遞的深度主動(dòng)時(shí)序聚類方法

霍緯綱; 朱旭; 張盼

doi:10.11999/JEIT240855

基于約束傳遞的深度主動(dòng)時(shí)序聚類方法

doi: 10.11999/JEIT240855 cstr: 32379.14.JEIT240855

中國民航大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院天津 300300

基金項(xiàng)目: 國家自然基金民航聯(lián)合基金重點(diǎn)項(xiàng)目(U2033205)，國家自然科學(xué)基金(62173331)，天津市自然科學(xué)基金面上項(xiàng)目(24JCYBJC00990)

詳細(xì)信息

作者簡介:
霍緯綱：男，教授，研究方向?yàn)闀r(shí)序大數(shù)據(jù)分析、數(shù)據(jù)挖掘

朱旭：男，碩士生，研究方向?yàn)闀r(shí)間序列分析

張盼：女，碩士生，研究方向?yàn)闀r(shí)間序列分析

通訊作者:
朱旭　2022051001@cauc.edu.cn

中圖分類號(hào): TN911.7; TP18
計(jì)量
- 文章訪問數(shù): 170
- HTML全文瀏覽量: 117
- PDF下載量: 39
- 被引次數(shù): 0
出版歷程
- 收稿日期: 2024-10-14
- 修回日期: 2025-03-10
- 網(wǎng)絡(luò)出版日期: 2025-03-20
- 刊出日期: 2025-04-01

Deep Active Time-series Clustering Based on Constraint Transitivity

School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

Funds: The Civil Aviation Joint Fund of The National Natural Science Foundation of China (U2033205), The National Natural Science Foundation of China (62173331), Tianjin Natural Science Foundation General Project (24JCYBJC00990)

摘要

摘要: 已有的深度主動(dòng)聚類方法未能通過標(biāo)注樣本推理生成必須鏈接(ML)約束或不能鏈接(CL)約束，標(biāo)注成本較高。為此該文提出一種基于約束傳遞的深度主動(dòng)時(shí)序聚類方法。該方法設(shè)置了標(biāo)注類簇集合(ACS)及相應(yīng)的輔助標(biāo)注集合(AAS)。通過預(yù)訓(xùn)練時(shí)序自編碼器得到時(shí)序樣本的表示向量。在深度聚類的每個(gè)訓(xùn)練輪次過程中，采樣并標(biāo)注表示空間中離類簇中心最近的樣本存入ACS，使每個(gè)ACS內(nèi)的樣本屬同一類別而ACS集合間的樣本屬于不同類別，然后從包含樣本數(shù)最小的ACS集合中隨機(jī)選取時(shí)序樣本，采樣并標(biāo)注與該樣本不屬于同一類簇且距其所在類簇中心最近的時(shí)序樣本存入AAS，使ACS與相應(yīng)的AAS中的樣本為不同類別，由ACS及對(duì)應(yīng)的AAS中的樣本推理生成ML和CL約束。由基于t-分布的類簇分布與其生成的輔助分布間的KL散度以及使?jié)M足ML及CL約束的時(shí)序樣本在表示空間距離分別變小和變大的約束損失更新時(shí)序自編碼器中編碼網(wǎng)絡(luò)參數(shù)和聚類中心。在18個(gè)公開數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明，該方法聚類效果在較低標(biāo)注預(yù)算下平均RI值比已有的典型基線模型均提升5%以上。
- 深度時(shí)序聚類 /
- 主動(dòng)學(xué)習(xí) /
- 約束傳遞
Abstract: Objective The rapid advancement of the Internet of Things and sensor technology has led to the accumulation of vast amounts of unlabeled time-series data, making deep time-series clustering a key analytical approach. However, existing deep clustering methods lack supervised constraint information and label guidance, making them susceptible to noise and outliers. Deep semi-supervised clustering methods rely on predefined Must-Link (ML) and Cannot-Link (CL) constraints, limiting improvements in clustering performance. Existing active clustering approaches sample only within clusters in the representation space, overlooking pairwise annotations from different clusters. This results in lower-quality ML and CL constraints and prevents further extrapolation from manually annotated pairs, increasing annotation costs. To address these limitations, this paper proposes Deep Active Time-series Clustering based on Constraint Transitivity (DATC-CT), which improves clustering performance while reducing annotation costs. Methods DATC-CT defines an Annotation Cluster Set (ACS) and an Auxiliary Annotation Set (AAS) and obtains the representation vector of time-series samples using a pre-trained autoencoder. In each clustering epoch, samples closest to cluster centers in the representation space are selected, labeled, and stored in ACS. This ensures that all samples within an ACS belong to the same category, while those in different ACSs belong to different categories. Next, a time-series sample is randomly chosen from the ACS with the fewest samples. Another sample, which does not belong to the same cluster but is nearest to the selected sample’s center, is then sampled, labeled, and stored in either AAS or ACS. Samples in ACS and AAS belong to different categories. ML and CL constraints are inferred from these samples. The encoder’s network parameters and cluster centers are updated using KL divergence between the cluster distribution (modeled by a t-distribution) and an auxiliary distribution generated from it. Additionally, a constraint loss function is applied to increase the distance between ML-constrained samples while reducing the distance between CL-constrained samples in the representation space. Results and Discussions Experimental results on 18 public datasets show that the proposed method improves the average Rand Index (RI) by more than 5% compared to existing deep time-series clustering methods (Table 2). With the same labeling budget, it achieves an RI improvement of over 7% compared to existing active clustering methods (Table 3). These findings confirm the effectiveness of the active sampling strategy and constraint reasoning mechanism. Additionally, the method infers a large number of ML and CL constraints from a small set of manually annotated constraints (Fig. 4), significantly reducing annotation costs. Conclusions This paper proposes a deep active time-series clustering model based on constraint transitivity, incorporating a two-phase active sampling strategy: exploration and consolidation. In the exploration phase, the model selects the sample closest to each cluster center in the representation space and stores it in ACS. During consolidation, a sample is randomly chosen from the ACS with the fewest samples. Another sample, which does not belong to the same cluster but is nearest to the selected sample’s center, is then sampled, labeled, and stored in either AAS or ACS. The number of ACS and AAS matches the number of clusters. ML and CL constraints are inferred from ACS and AAS samples. Experiments on public datasets demonstrate that inferring new clustering constraints reduces annotation costs and improves deep time-series clustering performance.
- Deep time series clustering /
- Active learning /
- Constraint transitivity

HTML全文

圖 1 DATC-CT模型圖

下載: 全尺寸圖片幻燈片

圖 2 時(shí)序編碼器$ E $結(jié)構(gòu)

下載: 全尺寸圖片幻燈片

圖 3 不同標(biāo)注成本下主動(dòng)聚類方法的RI值

下載: 全尺寸圖片幻燈片

圖 4 DATC-CT方法推理生成的ML與CL約束對(duì)的數(shù)量

下載: 全尺寸圖片幻燈片

1 探索階段采樣標(biāo)注過程

輸入：已標(biāo)注次數(shù)$b$，類簇$\left\{ {{\pi _g}} \right\}_{g = 1}^G$，類簇中心$\left\{ {{{\boldsymbol{\mu }}_g}} \right\}_{g = 1}^G$，　ACS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$
輸出：更新后的ACS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$
(1) for $ i =1 $ to $G$ do:
(2) 　從${\pi _i}$中選擇尚未被選擇且表示向量與${{\boldsymbol{\mu }}_i}$距離最小的樣本${{\boldsymbol{x}}_p}$
(3) 　for $ j =1 $ to $G$ do:
(4) 　　if ${S_j} = \varnothing $ then$ ： $
(5) 　　　將樣本${{\boldsymbol{x}}_p}$放入集合${S_j}$
(6) 　　　break
(7) 　　else:
(8) 　　　從${S_j}$中隨機(jī)選擇一個(gè)樣本${{\boldsymbol{x}}_q}$，標(biāo)注樣本${{\boldsymbol{x}}_p}$與${{\boldsymbol{x}}_q}$的關(guān) 　　　　　系，$ b= b+1 $
(9) 　　　if 樣本${{\boldsymbol{x}}_p}$與${{\boldsymbol{x}}_q}$為ML約束 then:
(10) 　　　　將樣本${{\boldsymbol{x}}_p}$放入集合${S_j}$
(11) 　　　　break
(12) 　　 end if
(13) 　 end if
(14) end for
(15) end for

下載: 導(dǎo)出CSV

2 鞏固階段采樣標(biāo)注過程

輸入：已標(biāo)注次數(shù)$b$，類簇$\left\{ {{\pi _g}} \right\}_{g = 1}^G$，類簇中心$\left\{ {{{\boldsymbol{\mu }}_g}} \right\}_{g = 1}^G$，　ACS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$，AAS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$
輸出：更新后的ACS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$與AAS集合$\left\{ {{{\bar S}_g}} \right\}_{g = 1}^G$
(1) 從樣本數(shù)量最少的集合${S_g}$中隨機(jī)選擇樣本${{\boldsymbol{x}}_p}$，從樣本${{\boldsymbol{x}}_p}$所　屬類簇${\pi _g}$之外選擇表示向量與${{\boldsymbol{\mu }}_g}$距離最小的樣本${{\boldsymbol{x}}_q}$
(2) 標(biāo)注樣本${{\boldsymbol{x}}_p}$與${{\boldsymbol{x}}_q}$的關(guān)系，$b = b + 1$
(3) if 樣本${{\boldsymbol{x}}_p}$與${{\boldsymbol{x}}_q}$為ML約束關(guān)系 then：
(4) 　將樣本${{\boldsymbol{x}}_q}$放入集合${S_g}$
(5) end if
(6) if 樣本${{\boldsymbol{x}}_p}$與${{\boldsymbol{x}}_q}$為CL約束關(guān)系 then：
(7) 　將樣本${{\boldsymbol{x}}_q}$放入集合${\bar S_g}$
(8) end if

下載: 導(dǎo)出CSV

3 生成ML約束集合$M$和CL約束集合$C$

輸入：ACS集合$\left\{ {{S_g}} \right\}_{g = 1}^G$與AAS集合$\left\{ {{{\bar S}_g}} \right\}_{g = 1}^G$
輸出：ML約束集合$M$，CL約束集合$C$
(1) 初始化集合$M$=$\varnothing $, $C$=$\varnothing $
(2) for $g{\text{ = }}1$ to $G$ do
(3) 　for ${{\boldsymbol{x}}_i},{{\boldsymbol{x}}_j} \in {S_g}$ do
(4) 　　將樣本對(duì)(${{\boldsymbol{x}}_i},{{\boldsymbol{x}}_j}$)添加至集合$M$
(5) 　end for
(6) 　for ${{\boldsymbol{x}}_i} \in {S_g}$, ${{\boldsymbol{x}}_j} \in {\bar S_g}$ do
(7) 　　將樣本對(duì)(${{\boldsymbol{x}}_i},{{\boldsymbol{x}}_j}$)添加至集合$C$
(8) 　end for
(9) 　for $k = g + 1$ to $ G $ do
(10) 　 for ${{\boldsymbol{x}}_i} \in {S_g}$, $ {{\boldsymbol{x}}}_{j}\in {S}_{k} $ do
(11) 　　將樣本對(duì)(${{\boldsymbol{x}}_i},{{\boldsymbol{x}}_j}$)添加至集合$C$
(12) 　 end for
(13) end for
(14) end for

下載: 導(dǎo)出CSV

4 DATC-CT算法過程

輸入：數(shù)據(jù)集$X$，總預(yù)算$B$，當(dāng)前預(yù)算$b$，訓(xùn)練輪次$T$，ACS集　合$\left\{ {{S_g}} \right\}_{g = 1}^G$與AAS集合$\left\{ {{{\bar S}_g}} \right\}_{g = 1}^G$，主動(dòng)采樣階段標(biāo)志SFlag
輸出：類簇中心$\left\{ {{{\boldsymbol{\mu }}_g}} \right\}_{g = 1}^G$
(1) 使用K-means方法對(duì)樣本${{\boldsymbol{x}}_1},{{\boldsymbol{x}}_2} \cdots ,{{\boldsymbol{x}}_N}$的表示向量　　 ${{\boldsymbol{z}}_1},{{\boldsymbol{z}}_2} \cdots ,{{\boldsymbol{z}}_N}$聚類，得到初始類簇中心${{\boldsymbol{\mu }}_1},{{\boldsymbol{\mu }}_2}, \cdots ,{{\boldsymbol{\mu }}_G}$
(2) 　初始化${S_1},{S_2}, \cdots ,{S_G} = \varnothing $, ${\bar S_1},{\bar S_2}, \cdots ,{\bar S_G} = \varnothing $, 　　　SFlag=True, $b = 0$
(3) 　for $ t =1\dots T $ do
(4) 　　使用TAE編碼器獲取樣本${{\boldsymbol{x}}_1},{{\boldsymbol{x}}_2}, \cdots ,{{\boldsymbol{x}}_N}$嵌入　　　　 ${{\boldsymbol{z}}_1},{{\boldsymbol{z}}_2}, \cdots ,{{\boldsymbol{z}}_N}$
(5) 　　使用式(2)計(jì)算樣本類簇分布形成類簇${\pi _1},{\pi _2}, \cdots ,{\pi _G}$
(6) 　　if $b < B$ then：
(7) 　　　if SFlag then：
(8) 　　　　使用算法1更新${S_1},{S_2}, \cdots ,{S_G}$
(9) 　　　else：
(10) 　　　使用算法2更新${S_1},{S_2}, \cdots ,{S_G}$與${\bar S_1},{\bar S_2}, \cdots ,{\bar S_G}$
(11) 　 end if
(12) 　 if ACS集合${S_1},{S_2}, \cdots ,{S_G}$中均至少包含1個(gè)樣本　　　　 then：
(13) 　　 SFlag=False
(14) 　 end if
(15) 　使用算法3生成約束集合$M$與約束集合$C$
(16) 　使用式(6)計(jì)算模型損失${L_3}$
(17) 　使用${L_3}$更新TAE編碼器參數(shù)與類簇中心${{\boldsymbol{\mu }}_1},{{\boldsymbol{\mu }}_2}, \cdots ,{{\boldsymbol{\mu }}_G}$
(18) end for

下載: 導(dǎo)出CSV

表 1 實(shí)驗(yàn)數(shù)據(jù)集描述

數(shù)據(jù)集	樣本總數(shù)	類別數(shù)	序列長度
ChlorineConcentration	4307	3	166
Dist.Phal.Outl.AgeGroup	539	3	80
Dist.Phal.Outl.Correct	876	2	80
ECG200	200	2	96
ECGFiveDays	884	2	136
GunPoint	200	2	150
Ham	214	2	431
Herring	128	2	512
MoteStrain	1272	2	84
OSULeaf	442	6	427
Plane	210	7	144
Prox.phal.outl.AgeGroup	605	3	80
Prox.Phal.TW	605	6	80
SonyAIBORobotSurface1	621	2	70
SonyAIBORobotSurface2	980	2	65
SwedishLeaf	1125	15	128
ToeSegmentation1	268	2	277
TwoLeadECG	166	2	343

下載: 導(dǎo)出CSV

表 2 與相關(guān)深度時(shí)序聚類方法對(duì)比結(jié)果

數(shù)據(jù)集	DEC	DTC	DTCR	DATC-CT
ChlorineConcentration	0.4994	0.5353	0.5357	0.5370
Dist.Phal.Outl.AgeGroup	0.7321	0.7812	0.7825	0.7918
Dist.Phal.Outl.Correct	0.4998	0.5010	0.6075	0.5824
ECG200	0.6308	0.6018	0.6648	0.7345
ECGFiveDays	0.5002	0.5016	0.9638	1.0000
GunPoint	0.4975	0.5400	0.6398	0.8219
Ham	0.4987	0.5648	0.5362	0.5841
Herring	0.4971	0.5045	0.5759	0.5153
MoteStrain	0.7621	0.5062	0.7686	0.9228
OSULeaf	0.7283	0.7329	0.7739	0.7742
Plane	0.9283	0.9040	0.9549	0.9921
Prox.phal.outl.AgeGroup	0.3893	0.7430	0.8091	0.8140
Prox.Phal.TW	0.5151	0.8380	0.9023	0.8568
SonyAIBORobotSurface1	0.8988	0.5563	0.8769	0.9967
SonyAIBORobotSurface2	0.6509	0.7012	0.8354	0.8562
SwedishLeaf	0.8837	0.8871	0.9223	0.9241
ToeSegmentation1	0.4995	0.5077	0.5659	0.5158
TwoLeadECG	0.5001	0.5116	0.7114	1.0000
最優(yōu)個(gè)數(shù)	0	0	4	14
平均RI	0.6173	0.6343	0.7459	0.7900

下載: 導(dǎo)出CSV

表 3 與已有主動(dòng)選擇策略對(duì)比結(jié)果

數(shù)據(jù)集	Rand	ADC	DATC-CT
ChlorineConcentration	0.5294	0.5342	0.5370
Dist.Phal.Outl.AgeGroup	0.7313	0.7218	0.7918
Dist.Phal.Outl.Correct	0.5019	0.5059	0.5824
ECG200	0.6882	0.6699	0.7345
ECGFiveDays	0.6747	0.8311	1.0000
GunPoint	0.5754	0.5348	0.8219
Ham	0.5302	0.5173	0.5841
Herring	0.5062	0.5022	0.5153
MoteStrain	0.8300	0.8915	0.9228
OSULeaf	0.7443	0.7257	0.7742
Plane	0.9506	0.9819	0.9921
Prox.phal.outl.AgeGroup	0.7656	0.8047	0.8140
Prox.Phal.TW	0.8315	0.8566	0.8568
SonyAIBORobotSurface1	0.9526	0.9744	0.9967
SonyAIBORobotSurface2	0.8033	0.7588	0.8562
SwedishLeaf	0.9051	0.9128	0.9241
ToeSegmentation1	0.4982	0.5029	0.5158
TwoLeadECG	0.5011	0.9982	1.0000
最優(yōu)個(gè)數(shù)	0	0	18
平均RI	0.6955	0.7347	0.7900

下載: 導(dǎo)出CSV

參考文獻(xiàn)(25)

[1]	AIGNER W, MIKSCH S, SCHUMANN H, et al. Visualization of Time-Oriented Data[M]. London: Springer, 2011: 153–196.
[2]	LEE S, CHOI C, and SON Y. Deep time-series clustering via latent representation alignment[J]. Knowledge-Based Systems, 2024, 303: 112434. doi: 10.1016/j.knosys.2024.112434.
[3]	CAI Jinyu, FAN Jicong, GUO Wenzhong, et al. Efficient deep embedded subspace clustering[C]. The 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, 2022: 21–30. doi: 10.1109/cvpr52688.2022.00012.
[4]	ZHONG Ying, HUANG Dong, and WANG Changdong. Deep temporal contrastive clustering[J]. Neural Processing Letters, 2023, 55(6): 7869–7885. doi: 10.1007/s11063-023-11287-0.
[5]	ROS F, RIAD R, and GUILLAUME S. Deep clustering framework review using multicriteria evaluation[J]. Knowledge-Based Systems, 2024, 285: 111315. doi: 10.1016/j.knosys.2023.111315.
[6]	XIE Junyuan, GIRSHICK R, and FARHADI A. Unsupervised deep embedding for clustering analysis[C]. The 33rd International Conference on International Conference on Machine Learning, New York, USA, 2016: 478–487.
[7]	MADIRAJU N S. Deep temporal clustering: Fully unsupervised learning of time-domain features[D]. [Ph. D. dissertation], Arizona State University, 2018.
[8]	MA Qianli, ZHENG Jiawei, LI Sen, et al. Learning representations for time series clustering[C]. The 33rd International Conference on Neural Information Processing Systems, Red Hook, USA, 2019: 3781–3791.
[9]	PENG Furong, LUO Jiachen, LU Xuan, et al. Cross-domain contrastive learning for time series clustering[C]. The AAAI Conference on Artificial Intelligence, Vancouver, Canada, 2024: 8921–8929. doi: 10.1609/aaai.v38i8.28740.
[10]	LI Xiaosheng, XI Wenjie, and LIN J. Randomnet: Clustering time series using untrained deep neural networks[J]. Data Mining and Knowledge Discovery, 2024, 38(6): 3473–3502. doi: 10.1007/s10618-024-01048-5.
[11]	SUN Bicheng, ZHOU Peng, DU Liang, et al. Active deep image clustering[J]. Knowledge-Based Systems, 2022, 252: 109346. doi: 10.1016/j.knosys.2022.109346.
[12]	REN Yazhou, HU Kangrong, DAI Xinyi, et al. Semi-supervised deep embedded clustering[J]. Neurocomputing, 2019, 325: 121–130. doi: 10.1016/j.neucom.2018.10.016.
[13]	REN Pengzhen, XIAO Yun, CHANG Xiaojun, et al. A survey of deep active learning[J]. ACM Computing Surveys (CSUR), 2022, 54(9): 180. doi: 10.1145/3472291.
[14]	DENG Xun, LIU Junlong, ZHONG Han, et al. A3S: A general active clustering method with pairwise constraints[C]. The 41st International Conference on Machine Learning, Vienna, Austria, 2024: 10488–10505.
[15]	李海林, 張麗萍. 時(shí)間序列數(shù)據(jù)挖掘中的聚類研究綜述[J]. 電子科技大學(xué)學(xué)報(bào), 2022, 51(3): 416–424. doi: 10.12178/1001-0548.2022055. LI Hailin and ZHANG Liping. Summary of clustering research in time series data mining[J]. Journal of University of Electronic Science and Technology of China, 2022, 51(3): 416–424. doi: 10.12178/1001-0548.2022055.
[16]	PETITJEAN F, KETTERLIN A, and GAN?ARSKI P. A global averaging method for dynamic time warping, with applications to clustering[J]. Pattern Recognition, 2011, 44(3): 678–693. doi: 10.1016/j.patcog.2010.09.013.
[17]	PAPARRIZOS J and GRAVANO L. Fast and accurate time-series clustering[J]. ACM Transactions on Database Systems (TODS), 2017, 42(2): 8. doi: 10.1145/3044711.
[18]	ZAKARIA J, MUEEN A, and KEOGH E. Clustering time series using unsupervised-shapelets[C]. 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 2012: 785–794. doi: 10.1109/icdm.2012.26.
[19]	余思琴, 閆秋艷, 閆欣鳴. 基于最佳u-shapelets的時(shí)間序列聚類算法[J]. 計(jì)算機(jī)應(yīng)用, 2017, 37(8): 2349–2356. doi: 10.11772/j.issn.1001-9081.2017.08.2349. YU Siqin, YAN Qiuyan, and YAN Xinming. Clustering algorithm of time series with optimal u-shapelets[J]. Journal of Computer Applications, 2017, 37(8): 2349–2356. doi: 10.11772/j.issn.1001-9081.2017.08.2349.
[20]	MACQUEEN J B. Some methods for classification and analysis of multivariate observations[C]. The 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, USA, 1967: 281–297.
[21]	JOHNSON S C. Hierarchical clustering schemes[J]. Psychometrika, 1967, 32(3): 241–254. doi: 10.1007/bf02289588.
[22]	NG A Y, JORDAN M I, and WEISS Y. On spectral clustering: Analysis and an algorithm[C]. The 15th International Conference on Neural Information Processing Systems: Natural and Synthetic, Cambridge, USA, 2001: 849–856.
[23]	CHANG Shiyu, ZHANG Yang, HAN Wei, et al. Dilated recurrent neural networks[C]. The 31st International Conference on Neural Information Processing Systems, Red Hook, USA, 2017: 76–86.
[24]	BASU S, BANERJEE A, and MOONEY R J. Active semi-supervision for pairwise constrained clustering[C]. The 2004 SIAM International Conference on Data Mining, Lake Buena Vista, USA, 2004: 333–344. doi: 10.1137/1.9781611972740.31.
[25]	DAU H A, BAGNALL A, KAMGAR K, et al. The UCR time series archive[J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6(6): 1293–1305. doi: 10.1109/jas.2019.1911747.