under the reserve that funds are granted - part-time employment may be possible
Designing a reliable system for Cloud Maintenance and ensuring resilience (anomaly detection, root cause analysis and recovery) is challenging due to the complex nature of the system. Resilience is defined as the ability of a cloud platform to recover quickly and continue operating even when there has been a failure, also referred to as fault tolerance. Efforts are made to implement predictive approaches that can remediate system parts exhibiting anomalies, which not yet evolved to system failures. Recently, AIOps evolved as a category aiming at the combination of system operation with artificial intelligence (AI) methods. Thereby IT operations are automated and enhanced by using machine learning and AI to analyze data streams collected from various IT monitoring tools and devices. This allows to react to issues in real time.
The aim of the project is to do research and develop AIOps methods for continuous data streams. One challenge is to create solutions, that are general enough to operate across different systems. We focus on the following topics: log and metric data embedding, learning joint representations for multiple systems, anomaly detection, and explaining anomalies. All these will entail designing a general method, implementing a prototype in the context of existing open source projects, and experimentally evaluating the prototype with a test data using simulated and production data.
Successfully completed university degree (Master, Diplom or equivalent) in Computer Science, with specialization in distributed systems. Completed courses in artificial intelligence and data science. Preferably experience in the field of AIOps.
Interest in system development and operation of large-scale software architecture, as well as enthusiasm to establish recent research results in practice.
Experience with statistical software (python, numpy);
Experience with database languages (SQL, mongoDB);
Experience with big data platforms (Hadoop, SPARK);
Experience in working with cluster and cloud systems (OpenStack);
Building and operation of containers (Docker).
Practical experience in project management and agile development methodologies;
Practical experience in conception and design of AI system solutions;
Familiar in working with methods from the domain of real time and stream data analysis;
Experience and interest in the topics of machine learning, representational learning, anomaly detection and time series classification;
Experience in working with explainable machine learning methodologies;
Experience with TensorFlow/PyTorch/keras;
Experience in writing of scientific papers.
Further requirements include team spirit and an excellent command of the German and English languages (spoken / written).
How to apply:
Please send your written application with the reference number and the usual documents to Technische Universität Berlin - Der Präsident - Fakultät IV, Institut für Telekommunikationssysteme, FG Komplexe und Verteilte IT-Systeme, Prof. Dr. Odej Kao, Sekr. TEL 12-5, Ernst-Reuter-Platz 7, 10587 Berlin or by e-mail to firstname.lastname@example.org.
To ensure equal opportunities between women and men, applications by women with the required qualifications are explicitly desired. Qualified individuals with disabilities will be favored. The TU Berlin values the diversity of its members and is committed to the goals of equal opportunities.
Please send copies only. Original documents will not be returned.
Technische Universität Berlin - Der Präsident - Fakultät IV, Institut für Telekommunikationssysteme, FG Komplexe und Verteilte IT-Systeme, Prof. Dr. Odej Kao, Sekr. TEL 12-5, Ernst-Reuter-Platz 7, 10587 Berlin