Current solutions to multimedia modeling tasks feature sequential models and static tree-structured models. Sequential models, especially models based on Bidirectional LSTM (BLSTM) and Multilayer LSTM networks, have been widely applied on video, sound, music and text corpora. Despite their success in achieving state-of-the-art results on several multimedia processing tasks, sequential models always fail to emphasize short-term dependency relations, which are crucial in most sequential multimedia data. Tree-structured models are able to overcome this defect. The static tree-structured LSTM presented by Tai et al. (Tai, Socher, and Manning 2015) forcingly breaks down the dependencies between elements in each semantic group and those outside the group, while preserves chain-dependencies among semantic groups and among nodes in the same group. Though the tree-LSTM network is able to better represent the dependency structure of multimedia data, it requires the dependency relations of the input data to be known before it is fed into the network. This is hard to achieve since for most types of multimedia data there exists no parsers which can detect the dependency structure of every input sequence accurately enough. In order to preserve dependency information while eliminating the necessity of a perfect parser, in this paper we present a novel neural network architecture which 1) is self-expandable and 2) maintains the layered dependency structure of incoming multimedia data. We call our new neural network architecture Seq2Tree network. A Seq2Tree model is applicable on classification, prediction and generation tasks with task-specific adjustments of the model. We prove by experiments that our Seq2Tree model performs well in all the three types of tasks.