Songyang Zhang

PhD Student at University of Rochester


3504 Wegmans Hall

University of Rochester

Rochester, NY, USA

I am Songyang Zhang (张宋扬 - in Chinese). Currently I’m a fifth-year PhD Student in the Computer Science Department at University of Rochester, advised by Prof. Jiebo Luo. Before that, I got my bachelor’s degree from Southeast University and my master degree from Zhejiang University supervised by Prof. Jun Xiao. I’ve received the NAACL 2021 Best Long Paper Award. My research is on computer vision and natural language processing, especially the intersection between video and language.

I’m looking for industrial researcher positions. Feel free to contact me.

Email: szhang83 AT ur DOT rochester DOT edu



Oct 14, 2022 Invited talk at UM-IoS Workshop @ EMNLP 2022.
Oct 6, 2022 One paper is accepted by EMNLP 2022.

Selected publications

* denotes equal contribution. The full list is here.

  1. Learning a Grammar Inducer by Watching Millions of Instructional YouTube Videos
    Songyang ZhangLinfeng SongLifeng Jin, Haitao Mi, Kun XuDong Yu, and Jiebo Luo
    In Conference on Empirical Methods in Natural Language Processing 2022
  2. Make-a-video: Text-to-video generation without text-video data
    Uriel Singer, Adam Polyak, Thomas Hayes, Xi YinJie AnSongyang ZhangQiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman
    arXiv preprint arXiv:2209.14792 2022
  3. MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration
    Thomas Hayes*, Songyang Zhang*Xi Yin, Guan Pang, Sasha Sheng, Harry Yang, Songwei GeQiyuan Hu, and Devi Parikh
    In European Conference on Computer Vision 2022
  4. Expanding Language-Image Pretrained Models for General Video Recognition
    Bolin Ni, Houwen Peng, Minghao Chen, Songyang ZhangGaofeng MengJianlong FuShiming Xiang, and Haibin Ling
    In European Conference on Computer Vision 2022
  5. SAT: 2D Semantics Assisted Training for 3D Visual Grounding
    Zhengyuan YangSongyang ZhangLiwei Wang, and Jiebo Luo
    In IEEE International Conference on Computer Vision 2021
  6. Video-aided Unsupervised Grammar Induction
    Songyang ZhangLinfeng SongLifeng JinKun XuDong Yu, and Jiebo Luo
    In Conference of the North American Chapter of the Association for Computational Linguistics 2021
  7. Multi-Scale 2D Temporal Adjacency Networks for Moment Localization with Natural Language
    Songyang ZhangHouwen PengJianlong Fu, Yijuan Lu, and Jiebo Luo
    IEEE Transactions on Pattern Analysis and Machine Intelligence 2021
  8. Learning Sparse 2D Temporal Adjacent Networks for Temporal Action Localization
    Songyang ZhangHouwen Peng, Le Yang, Jianlong Fu, and Jiebo Luo
    arXiv preprint arXiv:1912.03612 2019
  9. Learning 2D Temporal Adjacent Networks forMoment Localization with Natural Language
    Songyang ZhangHouwen PengJianlong Fu, and Jiebo Luo
    In the AAAI Conference on Artificial Intelligence 2020
  10. Exploiting temporal relationships in video moment localization with natural language
    Songyang ZhangJinsong Su, and Jiebo Luo
    In ACM International Conference on Multimedia 2019
  11. Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks
    Songyang ZhangYang YangJun XiaoXiaoming LiuYi Yang, Di Xie, and Yueting Zhuang
    IEEE Transactions on Multimedia 2018
  12. On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks
    Songyang ZhangXiaoming Liu, and Jun Xiao
    In IEEE Winter Conference on Applications of Computer Vision 2017