A Distributed Topic Model for Large-Scale Streaming Text

Published in International Conference on Knowledge Science, Engineering and Management, 2019

Recommended citation: Yicong, Li. (2019). "A Distributed Topic Model for Large-Scale Streaming Text." International Conference on Knowledge Science, Engineering and Management. https://link.springer.com/chapter/10.1007/978-3-030-29563-9_4

Learning topic information from large-scale unstructured text has attracted extensive attention from both the academia and industry. Topic models, such as LDA and its variants, are a popular machine learning technique to discover such latent structure. Among them, online variational hierarchical Dirichlet process (onlineHDP) is a promising candidate for dynamically processing streaming text. Instead of a static assignment in advance, the number of topics in onlineHDP is inferred from the corpus as the training process proceeds. However, when dealing with large scale streaming data it still suffers from the limited model capacity problem. To this end, we proposed a distributed version of the onlineHDP algorithm (named as DistHDP) in this paper, the training task is split into many sub-batch tasks and distributed across multiple worker nodes, such that the whole training process is accelerated. The model convergence is guaranteed through a distributed variation inference algorithm. Extensive experiments conducted on several real-world datasets demonstrate the usability and scalability of the proposed algorithm.

Download paper here

Recommended citation: Li, Yicong, et al. “A distributed topic model for large-scale streaming text.” Knowledge Science, Engineering and Management: 12th International Conference, KSEM 2019, Athens, Greece, August 28–30, 2019, Proceedings, Part II 12. Springer International Publishing, 2019.