© 2020 Strange Loop
Analyzing genomic sequence data requires non-trivial computational workflows that can be highly parallel. Managing these workflows at scale is a significant challenge both in terms of performance and fault tolerance. A survey of available workflow management systems did not yield a candidate that met our needs. Therefore, we are developing a workflow system called PTero.
Workflow systems usually represent workflows as a set of simple tasks arranged in a directed acyclic graph. Because real world applications often require choices to be made at run-time, those tasks become complicated finite state machines. PTero uses petri nets, which are capable of representing both DAGs and FSMs in one data structure. The system is able to handle very large workflows due to the amortized constant time algorithm used to determine whether the next step in the workflow can be executed.
PTero consists of a set of restful services that work together to execute workflows but can also be used individually. Following the 12-factor recommendations allows us to scale horizontally and be resilient in the face of single node failures. The service oriented architecture also provides a clear extension path for adding execution schedulers such as SGE, AWS or OpenStack.
Since the first version of PTero was deployed at the McDonnell Genome Institute at Washington University in the summer of 2013, it is has orchestrated more than 600,000 workflows and with over a combined 8 million tasks. Since then development has focused on making the system more accessible to the community.
Michael Kiwala works on PTero at the McDonnell Genome Institute at Washington University. Michael has served at MGI as a software engineer, automating genomic data analysis and sharing the results with the scientific community since 2005.
David Morton is the Project Lead for PTero at the McDonnell Genome Institute at Washington University. David received his PhD from Washington University in Saint Louis in 2012 where he worked to automate the acquisition and analysis of data in a neurophysics lab. Since then he has been a member of the development team working to create the infrastructure that enables large scale genome sequencing at MGI.