Challenges in IT Operations Management at a German University Chair -- Ten Years in Retrospect
CChallenges in IT Operations Management at a GermanUniversity Chair – Ten Years in Retrospect
Martin Geier
Chair of Real-Time Computer Systems
Technical University of Munich [email protected]
Samarjit Chakraborty
Chair of Real-Time Computer Systems
Technical University of Munich [email protected]
ABSTRACT
Over the last two decades, the majority of German universi-ties adopted various characteristics of the prevailing North-American academic system, resulting in significant changes inseveral key areas that include, e.g., both teaching and research.The universities’ internal organizational structures, however,still follow a traditional, decentralized scheme implementingan additional organizational level – the Chair – effectively a“mini department” with dedicated staff, budget and infrastruc-ture. Although the Technical University of Munich (TUM)has been establishing a more centralized scheme for manyadministrative tasks over the past decade, the transition fromits distributed to a centralized information technology (IT)administration and infrastructure is still an ongoing process.In case of the authors’ chair, this migration so far includedhanding over all network-related operations to the joint com-pute center, consolidating the Chair’s legacy server system interms of both hardware architectures and operating systemsand, lately, moving selected services to replacements operatedby Department or University. With requirements, individualsand organizations constantly shifting, this process, however,is neither close to completion nor particularly unique to TUM.In this paper, we will thus share our experiences w.r.t. this ITmigration as we believe both that many of the other Germanuniversities might be facing similar challenges and that, inthe future, North-American universities – currently not imple-menting the chair layer and instead relying on a centralizedIT infrastructure – could need a more decentralized solution.Hoping that both benefit from this journey, we thus presentthe design, commissioning and evolution of our infrastructure.
With information technology (IT) pervading nearly all aspectsof today’s university life for both students and employees, IToperations management teams face an ever increasing numberof challenges to ensure availability, security, applicability and usability of required and offered tools. In case of research , thisincludes services for knowledge dissemination and informationsharing, computing and storage, provisioning of commonsoftware packages and – as in all other cases – user support.
Teaching also relies on various IT-driven workflows for studentlifecycle management and exam handling. Depending on theuniversity and the field of study, an increasing number oflab courses also heavily rely on specialized IT infrastructure.Lastly, administration such as human resources, accountingand facilities also depends on IT – ranging from standardenterprise resource planning solutions to custom special tools.Comparing the internal organizational structures of North-American universities with those found in Germany, one keydifference stems from the additional organizational layer thatGerman universities utilize – the chair , which is often alsoreferred to as an institute . A collection of such chairs consti-tutes a department (such as that of Electrical or MechanicalEngineering). The chairs could be viewed as “mini depart-ments” with their own administrative and IT staff, budget,and IT infrastructure. This results in a lot of flexibility, whichis also necessary for the many practical laboratories offered inGerman universities, but comes with considerable overhead.Over the past decade, the Technical University of Munich(TUM) has been establishing a more centralized scheme formany administrative tasks, ranging from non-technical (e.g.,project management, financial and human resources) to thevarious technical areas of responsibilities. Breaking with thetraditional, decentralized scheme with the chairs maintainingtheir own administration and IT infrastructure, TUM’s efforthas been to reduce overhead, cut down on duplication andmove the freed-up funding from non-academic or administra-tive positions to increasing the number of academic positions(such as Assistant Professors). Towards this, TUM has been inparticular heavily pushing towards more centralized servicesand IT infrastructures. However, the necessary centralizedalternatives require time to be set up and cannot provide theflexibility the chairs have traditionally been used to. On onehand, such a transition is thus associated with a significantnumber of non-trivial challenges. On the other hand, it mightbecome difficult, if not impossible, to move labs and projectsof many German universities to the envisioned centralized in-frastructure common in North-American universities. Hence,we believe that this situation requires considerable planningand introspection, and a realization of the trade-offs that areinvolved in centralized versus decentralized
IT infrastructuresin German universities, especially taking into account thekind of hands-on lectures and laboratories that are offered. a r X i v : . [ c s . G L ] J u l oals of this paper: TUM’s efforts are not unique in Ger-many and many other local universities are following on thesame track. In this paper, we outline our experiences withthe (continuing) transition process from a chair-oriented toa more centralized administration, particularly focusing onIT services and infrastructure. First, other German univer-sities, who face similar challenges that we do, might benefitfrom our experiences and perceived challenges. Second, wehope to get feedback from our North-American counterpartswho have extensive experience with centralized IT operationsat universities. Third, the worldwide trend in teaching andlearning has been steadily shifting from traditional classroom-oriented lecturing to self-learning using online courses andMOOCs (Massive Open Online Courses). In order to adoptto this growing trend, it is becoming important to focus moreon labs and hands-on projects that might help students tobetter “digest” their newfound and self-acquired knowledge.Further, online courses cannot replace the value of hands-onexperiments and projects that require physical, electrical andsoftware infrastructure. Hence, providing them will also helpthe universities to retain their value, in addition to meaning-fully supplementing what students can learn online on theirown. Towards this, providing suitable IT support – going farbeyond web browsing, emails and backed-up storage – thatmight be necessary for these labs and hands-on projects is ofparamount importance. We believe that, in the future, a suit-able IT setup might lie somewhere in between the traditionalchair-oriented decentralized system in Germany and a cen-tralized North-American approach. Hence, our experiencesoutlined in this paper might also benefit IT administratorsand planners from American universities. To characterize thevarious, often lab-related peculiarities and requirements thatdrive IT operations at a German university chair, we presentthe – at the moment still mostly decentralized – IT systemdeployed at the Chair of Real-Time Computer Systems (RCS)and its design, introduction and ongoing evolution towardsmore centralized services over the last ten years in retrospect.The remainder of this paper is organized as follows. Atfirst, Sec. 2 introduces all entities involved in IT operations atchair-level – covering not only different types of staff membersat RCS, but also external TUM units and the compute center.Sec. 3 reconstructs the initial state of the IT at the time bothauthors joined the RCS (approx. ten years ago) and motivatesthe derivation of requirements for a future IT infrastructure inSec. 4. Based thereon, Sec. 5 presents original design togetherwith selected implementation details of the system as initiallyintroduced in 2012. Sec. 6 not only summarizes our findingsduring both start-up and operation of our new infrastructure,but also covers the external developments and their impacton our local IT operations. Sec. 7 finally concludes this paper.
Due to the highly federated structure both within and outsidethe University, our local IT operations not only involve severalpeople at the RCS itself, but also extend towards both variousother organizational units within TUM and the joint compute center of Munich’s public universities (as our highest-level IT,high-performance computing and internet service provider).
Each chair – implementing a “mini department” as introducedbelow in Sec. 2.2 – is headed by (at least) one professor withalmost unrestricted control of scientific and administrativematters. In case of RCS, the second author joined TUM as aprofessor in 2009 and had to head the Chair, without any priorexperience in German universities. His lack of proficiency witha decentralized administration and, in particular, his implicitassumption that IT infrastructure and services should be theconcern of the University and need not have to be managedby individual professors on chair-level, posed some initialchallenges for the IT operations management at the Chair.Most day-to-day research and teaching activities, however,are handled by the scientific staff comprising up to dozens offull-time research associates pursuing their PhD degrees.In contrast to other countries and – primarily – in engineeringand computer science departments, they commonly enjoy fullpositions funded either from public sources (allocated to eachchair) or by third parties such as, e.g., industry and (nationalor international) research foundations. This sound financialposition of a research associate (RA), however, comes at theprice of various responsibilities that – partially – depend onthe source of funding. Generally, RAs are either committedto funded research projects or heavily involved in the chair’steaching activities (i.e., by giving tutorials for the professor’slectures or running entire labs) – or both. In addition, mostRAs are responsible for some of the various administrativetasks covering HR, funding, IT operations and organizationof teaching and project-related matters at chair-level. Duringhis time as an RA, the first author, as an example, has beeninvolved in one industry- and several agency-funded researchprojects, designed two new lab courses whilst also in chargeof one external lecture and, at times, another lab. The singlemost time-consuming assignment, however, turned out to betaking over and maintaining IT operations of the Chair. Withvarious – predominantly outdated – systems in existence atthe time both authors joined RCS, a smooth transition to anup-to-date infrastructure was imperative to not only reliably,but also securely continue research and teaching activities.In case of RCS in 2018/19, one professor, ten RAs and threeexternal guests are teaching a total of eight lectures (partiallyincluding a tutorial), five laboratory courses and seminars.They are supported by four technical and non-technicalstaff members in charge of purchasing, IT, electronics work-shop, secretary’s office and finances – most of them, however,being assigned to part-time positions only. Regularly servingas a gateway between the chair’s researchers and the variousorganizational units within and outside TUM, they performa vital interface function whilst maintaining a lot of flexibilityregarding administrative and technical aspects at chair-level.External to the chairs, numerous mostly non-scientific staffmembers comprise central services (e.g., library, IT, language nd international centers) plus functional and administrativeunits such as HR, financial, controlling, facilities and legal. Intotal, TUM currently has over 10,000 employees with approx.two thirds in scientific and a third in remaining positions [3]. From an administrative perspective, IT operations at chair-level require coordination of and contributions from technicalstaff across multiple organization units as some services – peradministrative decision or technical necessity – are exclusivelyhandled by only one, single unit. This, e.g., holds true for thevarious essential network services made available to Munich’suniversities by the Leibniz Supercomputing Centre (LRZ) [2],which serves as a joint compute center and gateway to theGerman National Research and Education Network (DFN) [6].Effectively both acting as Internet Service Provider (ISP) thatalso maintains an IP backbone for over 180,000 devices andoperating various IT services in addition to High-PerformanceComputing (HPC) systems, the LRZ provides the foundationfor most of the IT in research facilities in and around Munich.The LRZ’ services relevant for university, faculty and chairstoday extend far beyond networking (i.e., switch management,routing, upstream IP connectivity and basic services includingDNS and DHCP). Additionally, the compute center not onlyoperates both global end-user services (such as Wi-Fi, VPN orvideo conferencing) and per-client – i.e., TUM-only – services(e.g., campus management system, directory services or wikis),but also offers backup, storage and file sharing in addition tovirtualized firewalling and compute nodes on a project basis.The
University itself today also manages a vast number ofservices within its various internal organizational units. As apart of TUM’s corporate IT systems and services, for instance,the central information technology unit takes care of facilitiessuch as various web-based portals and managed workstations.Additionally, it maintains an (SAP-driven) enterprise resourceplanning solution and the central campus management systemprimarily covering student-, teaching- and resource-relatedmatters – with the latter based on CAMPUSonline, a solutiondeveloped at the TU Graz, which also has been introduced byvarious other universities in both Austria and Germany [1].More specialized services are provided by dedicated teach-ing and library units and include not only e-learning platformsand document/website support, but also (internal and public)repositories and e-access systems for scientific data exchange.The
Department of Electrical and Computer Engineering(which RCS is part of) complements selected services offeredby neither LRZ nor TUM – with some now also used by otherdepartments. Apart from student-only IT facilities such as thefaculty’s roughly 100 Linux workstations (operated togetherwith one of its chairs) and a course scheduler, the Departmentprovides not only a web-based management tool for additionaladministrative (e.g., examination-related) workflows, but alsoservices essential for the (often predominantly) Linux-driveninfrastructure used at its nearly 30 chairs. Today, this includesboth NFS4 storage servers and a Puppet-based configuration management system crucial for a wide, consistent provisioningof Linux servers and clients deployed by faculty and its chairs.In case of RCS, the
Chair itself has a long legacy regardinglocal IT systems and operations. Active in the area of processcontrol computing since 1972 and renamed as “Chair of Real-time Computer Systems” in 1999, the RCS has not only used,but also researched numerous computer architectures runningvarious operating systems in both IT and real-time contexts.Although each chair, department and university has its ownhistory w.r.t. IT infrastructure, the authors hope to use RCSand its IT as a meaningful reference case in following sections.
This section introduces both the IT infrastructure of 2010 andsome challenges the first author faced to maintain operations.
At the time both authors joined the RCS, a significant numberof network components and their operations had already beensuccessfully transferred to the LRZ. This includes a structuredcabling infrastructure with central switches (providing up tosix Gigabit Ethernet ports to each office seating two RAs) andthe Domain Name System (DNS) servers for the – externallyvisible – internet domains of the chair. Apart from a dedicatedproject VLAN (Virtual LAN) already interfaced to a “virtualfirewall” instance provided by the LRZ’ Cisco FWSM blades,however, all other lower-level network services such as internalfirewalling, NAT (Network Address Translation), DNS andDHCP (Dynamic Host Configuration Protocol) were mappedto own hosts. In case of the firewall implementing an iptables-based packet filter between the external upstream (via LRZ)and the chair’s internal network, no failovers were available.On the server side, a large variety of hardware architecturesand operating systems were used. Although the majority ofservices were mapped to Intel/AMD-based systems runningLinux and Windows, various non-x86 servers (such as Alpha-,MIPS- and Sparc-based machines with their respective flavorsof UNIX) were an integral part of the system, e.g., providingadditional disk space via NFS (Network File System). Againfor historical reasons, the majority of servers, local switchesand the – one or other – UPS (Uninterruptible Power Supply)did not follow the standard 19-inch, rack-mount form factor.Instead, a multitude of desktop chassis were distributed acrossthe server room’s tables – with a variety of cables underneath.The client systems were – and still are – a combination ofIntel/AMD-based desktops and notebooks used for general-purpose computing and, due to the Chair’s research on real-time, various (mostly PowerPC- and ARM-based) embeddedsystems running specialized operating systems such as eCos,FreeRTOS and Real-Time Linux. In both cases, the individualRAs have full administrative access to maintain and adapt theparticular system to their needs – which regularly resulted inthe setup of server software to compensate a lack of centrallyoffered solutions. Some of these services were even permittedthrough the firewall, e.g., to make them accessible for studentsconnecting from home or via the LRZ-operated Wi-Fi directly. esides that, a significant number of client systems still usedstatic IP and DNS configurations instead of relying on DHCP.In 2010, the RCS’ internal network thus consisted of tennon-x86 and 15 Intel/AMD-based servers (with four runningWindows), ten printers and approx. 150 clients. Old databasesreport nearly 1100 (mostly inactive) users in over 200 groups. Traditionally, several RAs plus one member of technical staffwere handling IT operations at RCS. When the authors joinedthe Chair, however, the number of RAs still contributing hadalready reduced to one. With said RA leaving RCS less thansix months later, the first author quickly became the primaryperson in charge for maintaining the operation of the existingsystem, handling the pending migration and supporting users.The hardware ’s average age and variety resulted not only inan increasing number of wear-out failures, but also additionaleffort to understand – and, at least temporarily, resolve – bothvarious quirks and a current outage on each server platform.Similar to hard drive, memory and fan problems, a number ofPSU (power supply unit) failures were also difficult to remedydue to missing spare parts on site or the general unavailabilityof suitable replacements. The first author remembers severalcases of planned and unplanned power cuts that resulted inmore than one server requiring a new PSU and – in rare casesonly – even new hard drives with a subsequent data restore. Asingle district-wide blackout revealed that five power circuitswere not balanced properly, causing blown fuses at power-on.To make matter worse, only three servers were connected toUPSs initially – leaving the remaining majority unprotected.From a software perspective, keeping the IT infrastructurerunning required a steeper learning curve – not only regardingregular (i.e., unchanged) operation, but also for more commonadministration tasks including user or host management. Thevariety of non-Intel/AMD hardware architectures implied alarge number of Operating Systems (OSs) that needed special,dedicated knowledge – such as Tru64, RISC/os and Solaris.Such knowledge not only was needed for operations of a singlehost (e.g., adding a replaced hard drive to its array), but alsoto cope with the historical, often unspoken – and sometimesbizarre – dependencies between servers and services. The firstauthor is reminiscent of realizing that running Matlab on theIntel/AMD Linux hosts required one of the Solaris servers tobe operational as it provided the required disk space via NFS.Other challenges were an rsync job partially synchronizing theconfiguration of some Linux servers – occasionally overwritinglocal changes made by those unaware – and a stale, live copyof the Chair’s primary DNS zone on a server in another state.Day-to-day administrative workflows often required multiplemanual changes in tools or files on more than one host. This,e.g., held true for the management of users and groups, whichrequired registration on both a Linux-based YP/NIS (YellowPages or Network Information Service) server and a WindowsNT domain controller. Similarly, new disk space was manuallyallocated, formatted and exported on the (NFS/CIFS) serverand – on two other hosts – added to automounter tables and import scripts. Internal DNS and DHCP services, however,were centrally provisioned from a single, custom database – afact that not only enabled redundancy using multiple servers,but also simplified migration (to an even more central source).This combination of hardware- and software-related issuesmade maintaining operations challenging – not helped by thefact that most hardware (including the firewall or file servers)and services (such as email) neither had failover solutions onstandby nor were monitored methodically or comprehensively.Hardware defects thus often were detected rather late – and inneed of immediate attention, which often required dedicationfar beyond normal working hours – similar to the case of the(albeit rare) power cuts during or shortly before a weekend.Temporarily transplanting crucial existing servers to newerhardware was considered during the migration – but actuallynever implemented due to various incompatibilities betweeninstalled software and available replacement hardware such as,e.g., missing device drivers for storage or network controllers.The security of the old system was questionable, too – notonly due to the often outdated/unmaintained server softwarestill in use, but also because several services were also exposedto the public internet. In rare occasions, old server daemonseven inhibited installing updated client software, as in case ofa new release of Adobe’s Acrobat Reader and a rusty versionof the Samba server interfering due to a certain CIFS feature.Furthermore, the centrally operated user workstations reliedon a KNOPPIX-based live system that, due to its dependencyon the testing and unstable repositories of Debian, could notbe updated over longer periods – resulting in outdated clients.From a user perspective, however, only one of above issueswas directly visible and regularly addressed – the often limitedavailability of the IT system or some of its services. This alsoincluded two Windows clients used by the non-technical staffmembers of the Chair. Even though one required a (relatively)time-consuming setup to access TUM’s SAP, neither backupsnor replacement systems were at hand – causing an occasionalflurry in case of failure. A further hindrance was the historicalsetup for email services that combined a local IMAP (InternetMessage Access Protocol) server to retrieve or store messageswith the LRZ’s SMTP (Simple Mail Transfer Protocol) servicefor transmission of outgoing email. The former relied on mbox-based storage for each individual folder – resulting in massiveperformance penalties when accessing mailboxes larger thanthe server’s file system buffers, an effect particularly noticablewhen moving emails between folders. The latter was reachableonly from the LRZ’s own networks or by using a VPN (VirtualPrivate Network) – causing additional discomfort for multiplesmartphone users as the required VPN client was not availableon all platforms. Additionally, neither SVN (Subversion) forrevision control nor wikis to cooperate with external partnerswere provided – regularly complicating research and teachingactivities. Several hardware-centric lab courses also relied oncustom, non-central solutions for storage and computing thatgreatly varied in terms of reliability, proficiency and security.Most labs suffered from (undocumented) tweaks of file systempermissions, often far beyond the “usual mishmash” resultingfrom Unix’ and Windows’ incompatible semantics, whilst two ven depended on their own, again outdated NFS server plusa custom kernel module to control the Motorola-based boardsusing a pre-JTAG (Joint Test Action Group) interface. Lastly,one newer lab relied on a complex, distributed runtime drivenby custom, camera-based tracking system running on its ownserver, again interfacing central servers and user workstations. With the IT system’s availability , security and maintainability severely degraded, the following requirements for an updated,hopefully sustainable infrastructure were identified mid-2010.Own hardware (if not avoided completely) should be set upboth in a structured fashion – using, e.g., 19-inch rack-mountpower distribution units, UPSs, switches and servers – andsuch that redundancy is achieved for each type of component,whilst keeping not only their total number but also the varietyof models as low as possible. Based on an up-to-date, commonhardware platform with a current choice of operating systems,a more reliable and secure IT should also simplify operations,e.g., by increasing the availability of file servers and firewall.To improve security on a network -level, services should notonly be kept up-to-date (e.g., using software with dependablemigration policies), but also be split into publicly and (only)internally exposed groups to reduce the potential impact. Inaddition to assigning those to separate networks, only securedprotocols should be used. The Chair’s internal network shouldeventually only contain various clients and non-public servers.Both software environments such as operating systems andeach individual service implementation should be as hardware-independent as possible to enable or at least simplify recovery,migration and upgrades. Even if all components are (initially)purchased in pairs to achieve redundancy, a (future) combinedlack of spare parts and increase of wear-out failures will resultin a situation as in 2010 and greatly benefit from an improvedhardware-software independence of such a new infrastructure.To further improve both availability and maintainability, acombination of hardware and service monitoring with beyond-host configuration traceability will help mitigating the impactof hardware failures or human error. A single, or even multiple,but yet central sources of dynamic (e.g., host-, authentication-and storage-relevant) and static configuration should not onlysimplify (automated) system monitoring and tracing, but alsoreduce the number of entry points necessary for day-to-dayadministrative workflows. The introduction of standard toolsand documented operating procedures to consistently managethe configuration should lower the barriers for additional RAsto contribute and take over – with the central documentationrepository supporting functional printouts for severe outages.For users , various services and features should be providedsecurely, reliably and efficiently. This not only includes email(with support for smartphones and large, i.e., up to 10 GBytesmailboxes of some RCS members), but also globally reachableSVN and wiki services (with support for guest accounts), fileand compute servers, centrally operated user workstations (asour unified and up-to-date solution for research and teaching),redundant clients for the non-technical staff members (mainly VLAN 3VLAN 2 VLAN 1VLAN 4 eduroam,ADSL, …
LRZ, TUM,Internet
Internal networkServer DMZ Project DMZLRZ-hosted:Cisco FWSM(or pfSense)file1/2servicesauth1/2 Compute andTerminal Srv.virt1/2vpngatesshClients rcs.ei.tum.deint.rcs.ei.tum.de2 2
OpenVPN
ICMP, DNSKerberosCUPS, RDPSSH, SambaHTTP, HTTPSNAT projectsadmin imapsmtp External Clients
Not shown:- Power distribution units, UPSs and (console) switch- Secondary switch for UPS failsafe links and projects- Internal VMs: E.g., licences, databases and intranet pdmz.rcs.ei.tum.de
Figure 1: Networks, Servers, Firewall and VPN-Gate for office tools and SAP) and a comprehensive documentation.The system also should use state-of-the-art security measuresfor sensitive (in particular personal and teaching-related) dataand provide unified ACL (Access-Control List) templates thatensure sane file system permissions on project, lecture and labvolumes. The latter should further benefit from configurationtemplates interfacing own servers to the central infrastructure,which do not require non-standard (e.g., root-only) methodsfor common RA activities such as account resets and templatedeployment. Selected, centrally maintained software packagescould reduce setup and storage overhead for labs and research.
The final hardware and software components were chosen anddesigned to incorporate redundancy, secure network protocolsand configuration traceability – throughout the entire system.The hardware was dramatically reduced to four new serverslocated in two 19-inch racks (conveniently donated by anotherchair) and complemented by three dedicated power circuitswith corresponding distribution units and per-rack UPSs. Allservers feature redundant PSUs (on two independent circuits)and memory with support for Error-Correcting Code (ECC).Each UPS is monitored by one server, which distributes statusinformation to other hosts over redundant network paths toensure a clean shutdown of all systems in case of power cuts.The network architecture was modified to not only provideone additional VLAN for globally reachable services, but alsoexclusively utilize a Cisco FWSM firewall offered by the LRZ.Thus interfacing not only the internal RCS network, but alsoa server and a project DMZ (demilitarized zone) to the publicinternet, this firewall solution improves both availability (dueto redundant hardware at LRZ) and security (as the internalnetwork is no longer reachable from outside). This topology isshown in Fig. 1 and also reflected by separated DNS zones.On the software side, all four servers use Ubuntu Server asbase OS. Two identical quad-core Opterons with 8 GBytes ofmemory and 16-port RAID (Redundant Array of IndependentDisks) controllers each are used as file servers (file1/2), whilstthe other two feature two six-core Opteron CPUs, 32 GBytes The well-disposed reader might recognize the irony of documenting amostly Linux-based IT infrastructure using Microsoft tools - The firstauthor had not yet learned Ti k Z back then and thus resorted to Visio f memory and 4-port RAID controllers each and serve as thevirtualization hosts (virt1/2) for all services – except storage.We heavily utilize KVM (Kernel-based Virtual Machine), ahypervisor in current Linux kernels, to instantiate a dedicatedhardware-independent VM (Virtual Machine) per “group” ofservices. Individual VMs can be set up (using templates basedon Ubuntu Server) and restored (from an rsync-based backup)within a few minutes. This not only makes the complex serviceVMs independent of the underlying hardware (due to KVM’sgeneric interface), but also enables a fast migration – or evenfailover – in case of failure on one of the virtualization hosts.Three VMs implement a redundant DNS, DHCP and IAA(Identification, Authentication and Authorization) subsystem,which relies on a central LDAP (Lightweight Directory AccessProtocol) directory for storage of host-, user- (including mostpasswords and email setup), group- and storage-related data.Most information is managed using a web interface (originallydeveloped by the City of Munich [5]) only – whilst hosts (also)and automounter tables (exclusively) are managed via customcommand line tools. Static host configuration data is centrallymanaged in Puppet, a configuration management tool usingagents to ensure that all managed nodes and a central masterare synchronized at all times. With its class-based language,we implement a variety of host templates for, e.g., file servers,virtualization hosts, IAA VMs (admin and auth1/2, as shownin Fig. 2 bottom left), service VMs with and without IAA, our“basic Linux network client” (with complete IAA and storageservices) and a reduced version of the latter for (RA-operated)lab and project servers that still use central IAA and storage.User IAA relies on multiple password hashes and Kerberosprincipals jointly stored in LDAP and integrated client-sideusing PAM (Pluggable Authentication Module) and GSSAPI(Generic Security Services Application Program Interface) onLinux, CIFS (Common Internet File System) with traditionalWindows NT-like logons and, in both cases, Kerberos. Storageis provided using NFS4 (Network File System version 4) usingKerberos and password-based CIFS. This entire subsystem iscentrally managed from a – single – configuration file, whichnot only configures file servers and clients as needed, but alsocontrols on- and off-site backups of both user data and VMs.Servers, VMs and services are monitored via Munin, whilstconfiguration is traced both locally (etckeeper and listchanges)and globally (using SVN repositories for LDAP and Puppet).Most network protocols are secured either internally and bydesign (as Kerberos) or configured to enforce TLS (TransportLayer Security) – with the only notable exception being CIFS.A VPN gateway enables staff members to use selected serviceseven when not connected to the internal network – e.g., whenusing the LRZ’s Wi-Fi on notebooks or working outside RCS.Additional user services include comprehensive email withfast, maildir-based IMAP, authenticated SMTP (with secondpassword) and sieve filtering. A projects VM provides globalSVN and wiki services with fixed, random passwords whilst anUbuntu-based diskless image drives central user workstations,also used to access the Terminal Server of non-technical staff.The entire architecture of the Chair’s new IT infrastructureis shown in Fig. 2 with VM and service details in the appendix. To ensure a smooth start-up , the system was tested during onesemester using a newly created lab that required most servicesand the central user workstations – with some of the centrallymaintained software packages and USB firmware for its JTAGinterface. Step-by-step, labs were migrated successfully – withonly the most complex one with its distributed runtime andcamera tracker posing a challenge. The lab’s server was linkedto the central IAA and file servers, which ensured a consistentlogin and execution environment for all software components,plus SSO (Single Sign-On) for users. After some initial hiccupsdue to ACL limitations and some fragile shell scripts, the labeventually went live. Shortly after the migration, the systemalready served over 370 users (150 lab accounts) on 200 clients,handling 42 GBytes in mailboxes and over 20k mails a month.Apart from three major outages of the primary fileserversdue to an eventually fixed bug in a RAID controller’s firmware, operation has been smoothly. With the introduction of newerUbuntu releases, it became clear that our initial Puppet code-base requires restructuring, too [4]. Whilst missing support forSSH public key authentication was less critical than expectedthanks to VPN-based Kerberos and fixed (project) passwords,all ACLs remain challenging due to incompatible applications.After the migration, various external developments affecteddesign and use of the system. With email, room booking andwikis now offered by TUM, both intranet and email VMs areno longer required. The Department’s file servers now offerNFS based on TUM’s central IAA services, whilst the LRZ’sgitlab service will eventually replace our SVN/wiki solution.
In this paper, we introduced a key difference in organizationalstructures of German universities, which has resulted in ratherdecentralized IT operations at many chairs. We presented thehistory, analysis and redesign of our Chair’s infrastructure toshare our findings – in particular related to the various labs,which require specialized IT infrastructure. With future, less(de?)central solutions in sight, IT operations remain exciting.
REFERENCES [1] Sarah Grzemski and Bernd Decker. 2018. Challenges of the Changeof Decentralized Support Structures in Combination with Digiti-zation Processes in the Student Life Cycle. RWTHonline the NewCampus Management System of RWTH Aachen University. In → Our University → Facts and figures).[4] Shawn Plummer and David Warden. 2016. Puppet: Introduction,Implementation, & the Inevitable Refactoring. In . ACM, New York, NY, USA.https://doi.org/10.1145/2974927.2974950[5] Wikipedia contributors. 2019. LiMux – Wikipedia, The Free Ency-clopedia. https://en.wikipedia.org/wiki/LiMux (Primary sourcesfor LiMux and GOsa ² a il / s m t p m a il / i m a p ss h . r c s . e i . t u m . d e p r o j e c t s C o m pu t e S . Diskless Clients
Basic Linuxnetworkclient L D A P c li e n t K r b . c li e n t WindowsTerminalServers D o m a i n m e m b e r A u t h e n t i c a t i o n S ub s y s t e m U n i o n FS : t m p f s o n t o p o f N FS D N S c li e n t P upp e t a g e n t M un i n a g e n t C U P S c li e n t N FS c li e n t i d m a p g ss d C I FS / D FS c li e n t P X E / T F T P K r b . c li e n t s r v s m t p L D A P c li e n t SMTPIA/Authorization(IMAP and Sieve)
Postfix A dd r e ss l oo k up s U s e r l oo k up s ( e . g . , qu o t a ) L D A P c li e n t Network
OpenVPN K e r b e r o s K D C s D H C P s e r v e r s D N S s e r v e r s W i nd o w s D C s NetfilterNAT
LDAP directory vpngate
AutoFStables Hosts G O s a auth1 auth2 P upp e t M un i n admin W i nd o w s D FS - R oo t CIFS print C U P S W i nd o w s P S d r i v e r s T F T P s e r v e r : P X E & L i nu x X ili n x services H a l c o n S c a d e licences Virtualisation layer (KVM, libvirt, …) N U T N T P Virtualisation hosts (virt1 & virt2) ss hd libcNSS nslcdPAM Auto-FS A p a c h e B a c k up s v i a r s y n c & T S M N FS - S e r v e r s v c g ss d i d m a p qu o t a d C I FS - S e r v e r Filesystems (ext3)Linux LVMNot shown:- Puppet node agents- Munin node agents- Mail transfer agents- NUT node agents- portmap
Passwords(Krb., SMB) Users &Groups
Fileservers (file1 & file2) D N S c li e n t L D A P c li e n t K r b . c li e n t nslcdlibcNSSFilesystem accessto RCS /DIST treeApplications nscd gdmGnome, dbus, ... F i r e f o x g cc , … X ili n x I S E M a t l a b ... Not shown (cont’d):- lm-sensors- SSL/TLS layers- Administrative access- Administrative tools- Local databases R D P K r b . c li e n t ApplicationsPAM G SS A P I s m t p s Queues,Routing,…Mail delivery toexternal serversIncoming mail from LRZ-MXs
Authorization(relay access)
DovecotSASL A dd r e ss l oo k up s proxymap Color codes:- Light blue: User services- Orange: IAA services- Dark green: File services- Red: Hardware(/-related)- Light green: VMsSMTP tomail/imap L o c a l ( v i r t u a l ) d i s k i m a p s s i e v e m g m t P o s t f i x UserDB
Dovecot s c f g L D A Maildir++ Last Update: 15.05.2012, 12:00 C(E)STCreated by: Martin Geier, TUMEIRCS [email protected] in: June 2019 W i k i / S V N P r o j e c t P r o j e c t n Layout:- Upper half: Frontend- Lower half: Backend /DIST
RCS Server System
Figure 2: RCS Infrastructure – Physical Servers (bottom), Service VMs (light green) and User Frontend (top)
BACKEND HOSTS AND SERVICE VMs
Whilst all physical servers use NTP (Network Time Protocol)daemons for time synchronization, NUT (Network UPS Tools)servers are only required on virt1/2 and forward UPS status toNUT clients on file1/2 and user VMs. Puppet and Munin nodeagents are installed on all physical servers and VMs – with thesame holding true for a minimal MTA (Mail Transfer Agent).A Munin master (admin) captures physical (e.g., temperature,disk or RAID status) and logical (e.g., load or volume usage)samples and sends an email notification if limits are violated.Both file servers – like many VMs – use NSS (Name ServiceSwitch) to access user and group information in LDAP, whilstGSSAPI enables NFS4 authentication using Kerberos. Linuxclients may choose between CIFS (implemented by the Sambaserver) and NFS4 for storage – Windows ones only the former.Disk space is organized using volumes, i.e., as individual ext3file systems above RAID and LVM (Logical Volume Manager)exported via CIFS and kernel-based NFS. Client-side importsrely on a Windows DFS (Distributed File System) entry pointon the services VM and automounter tables in LDAP, whilston- and off-site backups are implemented with rsync (betweenfile1/2) and LRZ’s Tivoli. A variety of POSIX ACL templatesand online mapping from CIFS/NFS4 to POSIX ACLs ensurea – relatively – consistent view and control of file permissions.The
LDAP directory is managed with scripts and GOsa ² [5],also storing Kerberos user principals with multiple passwords. FRONTEND SERVICES AND CLIENTS