NSF#1726523 Collaborative Research: Learning to Use Essential Tools and Resources for Data Science with a Cloud-Based Virtual Environment

Business, government, and science researchers are producing massive amounts of complex data. The availability of these huge datasets fuels a need for both data-driven analytics and a 21st-century workforce that can use data analytics to answer questions and solve problems. This collaborative project will develop a cloud-based virtual platform to train undergraduate students how to use software tools essential to data science. The platform will make state-of-the-art computing resources, including both powerful data analysis tools and parallel hardware systems, more accessible to students and faculty, even if they are at institutions without locally available high-power computing systems. The project aims to help students develop critical workforce skills in data science. The project will also provide professional development opportunities to help faculty use data-analysis tools in their courses and research.

The goal of this project is to develop a cloud-based infrastructure in the form of a virtual science platform with related training modules. First, it will leverage an existing framework for building web applications to provide broad access to open source, high performance computing resources at the collaborating universities and through the NSF Extreme Science and Engineering Discovery Environment. The cloud-based platform will support both training of students and collaboration among students. Second, the project will produce a data science curriculum targeted to undergraduate students. The curriculum will also be suitable for graduate students, post-doctoral researchers, and information technology professionals interested in data science. The project will deliver a full set of interactive documents and video tutorials on using and configuring the platform. The educational activities will use graphical, interactive, simulation-based, and experiential learning components to teach data science concepts and computing skills, accessed through the cloud-based platform. Through the platform, students will have the opportunity to learn how to use powerful data science resources, enabling their potential to transform data-rich computer science and engineering problems into practical solutions. Third, the project will deliver professional development for faculty at multiple institutions, to help them learn how to use data science in their classrooms and their own research. This project addresses national interests by making state-of-the-art computing resources more accessible to students, supporting their development of critical workforce skills.

Publications


2019:
  • G. Ruan & H. Zhang. Parallelized Topological Relaxation Algorithm. 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 3406-3415.
  • Abstract: Geometric problems of interest to mathematical visualization applications involve changing structures, such as the moves that transform one knot into an equivalent knot. In this paper, we describe mathematical entities (curves and surfaces) as link-node graphs, and make use of energy-driven relaxation algorithms to optimize their geometric shapes by moving knots and surfaces to their simplified equivalence. Furthermore, we design and conFigure parallel functional units in the relaxation algorithms to accelerate the computation these mathematical deformations require. Results show that we can achieve significant performance optimization via the proposed threading model and level of parallelization.

2019:
  • R. Subramanian & H. Zhang. Parallel R Computing on the Web. 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 3416-3423.
  • Abstract: R is the preferred language for Data analytics due to its open source development and high extensibility. Exponential growth in data has caused longer processing times leading to the rise in parallel computing technologies for analysis. Using R together with high performance computing resources is a cumbersome task. This paper proposes a framework that provides users with access to high-performance computing resources and simplifies the configuration, programming, uploading data and job scheduling through a web user interface. In addition to that, it provides two modes of parallelization of data-intensive computing tasks, catering to a wide range of users. The case studies emphasize the utility and efficiency of the framework. The framework provides better performance, ease of use and high scalability.

2019:
  • R. Subramanian & H. Zhang. Parallel Framework for Data-Intensive Computing with XSEDE. In Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning) (PEARC '19)
  • Abstract: With the increase in data-driven analytics, the demand for high performing computing resources has risen. There are many high-performance computing centers providing cyberinfrastructure (CI) for academic research. However, there exists access barriers in bringing these resources to a broad range of users. Users who are new to data analytics field are not yet equipped to take advantage of the tools offered by CI. In this project, we propose a framework to lower the access barriers that exist in bringing the high-performance computing resources to users that do not have the training to utilize the capability of CI. The framework uses divide-and-conquer (DC) paradigm for data-intensive computing tasks. It consists of three layers – user interface (UI), parallel scripts generator (PSG) and underlying cyberinfrastructure (CI). The goal of the framework is to provide a user-friendly method for parallelizing data-intensive computing tasks with minimal user intervention. Some of the key design goals are usability, scalability and reproducibility. The users can focus on their problem and leave the parallelization details to the framework. This paper will outline the rationale behind this framework, its detailed implementation and demonstrate its usage with practical use cases.

2019:
  • R. Subramanian & H. Zhang. Automatic Code Parallelization for Data-Intensive Computing in Multicore Systems Journal of Physics: Conference Series, Volume 1411, Number 1.
  • Abstract: A major driving force behind the increasing popularity of data science is the increasing need for data-driven analytics fuelled by massive amounts of complex data. Increasingly, parallel processing has become a cost-effective method for computationally large and data-intensive problems. Many existing applications are sequential in nature and if such applications are ported to multi-processor systems for execution, they would make use of only one core and the optimal usage of all cores is not guaranteed. Knowledge of parallel programming is necessary to ensure the use of processing power offered by multi-processor systems in order to achieve better performance. However, many users do not possess the skills and knowledge required to convert existing sequential code to parallel code to achieve speedups and scalability. In this paper, we introduce a framework that automatically transforms existing sequential code to parallel code while ensuring functional correctness using divide-and-conquer paradigm, so that the benefits offered by multi-core systems can be maximized. The paper will outline the implementation of the framework and demonstrate its usage with practical use cases.


2018:
  • R. Subramanian & H. Zhang. Performance Analysis of Divide-and-Conquer strategies for Large scale Simulations in R. 2018 IEEE International Conference on Big Data (Big Data), 4261-4267.
  • Abstract: As the volume of data and technical complexity of large-scale analysis increases, many domain experts desire powerful computational and familiar analysis interface to fully participate in the analysis workflow by just focusing on individual datasets, leaving the large-scale computation to the system. Towards this goal, we investigate and benchmark a family of Divide-and-Conquer strategies that can help domain experts perform large-scale simulations by scaling up their analysis code written in R, the most popular data science and interactive analysis language. We implement the Divide-and-Conquer strategies that use R as the analysis (and computing) language, allowing advanced users to provide custom R scripts and variables to be fully embedded into the large-scale analysis workflow in R. The whole process will divide large-scale simulations tasks and conquer tasks with Slurm array jobs and R. Simulations and final aggregations are scheduled as array jobs in parallel means to accelerate the knowledge discovery process. The objective is to provide a new analytics workflow for performing similar large-scale analysis loops where expert users only need to focus on the Divide-and-Conquer tasks with the domain knowledge.

Strategy

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincid unt ut laoreet dolore magna aliquam erat volutpat

Principal Investigator:

phd Students:

    Ranjini Subramanian, Summer 2018 - now

    PhD Topic: Visual Analysis and Machine Learning for Sequential Pattern Mining from Multimodal Data Sets

    This PhD dissertation work will focus on the extraction of statistically reliable events and their associations (i.e., sequential patterns) from large-scale time-series data, with the use of information visualization interfaces, sequential pattern mining algorithms, and scalable computing techniques.

    The deliverables of this PhD research work will include reports, research articles that document the research outcome, as well as software components and integrated interfaces to plot time series data, highlight frequent events and phenomena and quantitative methods to extract the temporal properties of these events. To address usability and scalability, the framework will support the use of high-performance computing resources to allow the proposed visual and computational methods.

Lab RET/EOT Events:

Last updated on March 15, 2019.