Anda di halaman 1dari 33

The Agile Infrastructure Project

Part 1: Configuration Management

Tim Bell Gavin McCance


www. cern. ch/i t
CERN IT Department CH-1211 Genve 23 Switzerland

Configuration and Operations Tools


https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure https://agileinf.cern.ch/jira/

IT Technical Forum 27 Jan 2012

Project scope
The project is reviewing the entire CERN computer-centre management toolset
What happens from the bare metal up Asset management, inventory Sysadmin tools and maintenance workflows Service management and configuration tools Dynamic configuration for virtual hosts Operations monitoring Workflow automation and continuous deployment

IT Technical Forum 27 Jan 2012

Configuration and Operations Tools

IT Technical Forum 27 Jan 2012

Why?
Current production system built around the Quattor toolset is successfully managing 10k servers
(CERN) Quattor + many CERN components

Why are we changing the toolset?

IT Technical Forum 27 Jan 2012

What are the issues


Uncompressible technical debt
The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources Small community (less funding) and general support problem. At CERN, weve fallen into the sticky hands support model

We need better automation and integration between the sub-components


Lack of automated workflow: everything is a ticket
emailScript : your added value in the process is often your CERN password

The 15-min CDB commit walk context switch cost

IT Technical Forum 27 Jan 2012

What are the issues


Transferrable skills and training
Learning curve for our tools is steep and remains high

Its easier to hire people who have skills in a widelyused tool than your internal tools
Depending on where you look

IT Technical Forum 27 Jan 2012

Jobs adverts indeed.com


Puppet
Index of millions of worldwide job posts across thousands of job sites

Quattor

These are the sort of posts our departing staff will be applying for.

IT Technical Forum 27 Jan 2012

Integration is hard
IPv6, virtualisation, Windows Server all need a solution
We could leverage lots of open source tools
But piecemeal integration of these requires high investment due to our complex system Years of organic growth have made the system way too hairy Its often easier to reinvent rather than integrate

Lack of dynamic-ness in the infrastructure


We hack the config system for dynamic VMs

Its critical to look at the system as a whole

IT Technical Forum 27 Jan 2012

Where to look?
Large ops community out there taking the tool chain approach whose scaling needs match ours: O(100k) servers, many apps Become standard and join this community

IT Technical Forum 27 Jan 2012

10

Use Puppet for the core


The tool space has exploded in the last few years
In configuration management and ops Large, shared tool forges, and lots of experience

Puppet and Chef are the clear leaders for the core tool
other tools in our scope try to integrate with those

Many large-scale enterprises use Puppet


Its declarative approach fits better with what were used to Large installations: friendly, wide-base community and commercial support and training You can buy books on it

IT Technical Forum 27 Jan 2012

11

Scaling challenges: nodes


Currently we have O(10k) physical nodes IaaS approach:
Moving to virtual machines More (smaller, load-balanced) service nodes VMs for raw compute (batch or pilot jobs) Homogeneous: compute + storage on the same node

Add another computer centre, 24/48 SMT cores per node, you get 100k 300k virtual nodes to be managed
99.6%(1) node update success-rate means 1200 manual interventions to fix it
(1)

in a recent intervention on lxbatch

IT Technical Forum 27 Jan 2012

12

Scaling challenges: people

Many, diverse applications (clusters) managed by different teams ..and 700+ other unmanaged Linux nodes in VMs that could benefit from a simple configuration system
IT Technical Forum 27 Jan 2012 13

Agile Infrastructure 1st Try


First started investigating tools in September using parttime resources from CF, DB, DSS, GT, OIS and PES
Trying iterative agile-sprint style (Scrum): short sprints, feedback, sprint review, visible Take first, best-guess at architecture and tool selection, iterate

Mixed success with this agile style


What works: Good visibility and reviews. Daily scrum meeting useful. Weekly review meeting open to management. What doesnt: The time boxing part of of Scrum sprints is hard with part-time resources The project planning now foresees more dedication of staff

IT Technical Forum 27 Jan 2012

14

Agile Infrastructure 1st Try


Were currently running:
OpenStack as cloud software for virtual machines, image management, bulk storage
Future IT forum presentation

Puppet for the configuration management core with Foreman as a dashboard

IT Technical Forum 27 Jan 2012

15

Foreman dashboard

IT Technical Forum 27 Jan 2012

16

Agile Infrastructure 1st Try


Were currently running:
OpenStack as cloud software for virtual machines, image management, bulk storage
Future IT forum presentation

Puppet for the configuration management core with Foreman as a dashboard

None of the tools are perfect out-of-the-box


..but wed rather submit patches to a good open source tool than reimplement it Weve experienced very good community support: RFCs and patches are quickly accepted Very active community: often problems are fixed and missing features implemented before you even report them

IT Technical Forum 27 Jan 2012

17

Agile Infrastructure 1st Try


Were currently running:
yum for software distribution (replacing spma) git for template management: why git?
Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates Many of the tools we can benefit from also assume git We should not be different from the rest of the community

IT Technical Forum 27 Jan 2012

18

Puppet
Client/server architecture
puppetmaster: horizontally scalable Rails application X509 cert authenticated nodes: integrate with CERN CA

IT Technical Forum 27 Jan 2012

19

Puppet
Puppet runs on the client, applying the configuration changes
It detects the current state and only runs if theres something to do

It runs every few minutes


new configuration will be ~immediately applied (fail-fast). This is a change from CDB where latent changes can be stacked up

Normal mode is client-side compile (assume success)


No more CDB commit waits Change from CDB: the compilation fails later

Good monitoring is a pre-req: puppet sends reports back to the puppetmaster


The Foreman tool can collect these for you
IT Technical Forum 27 Jan 2012 20

Puppet language
Puppet uses its own Ruby-like language for the templates to assert the desired state of the nodes
With Ruby fall-back for hard stuff (weve only needed this once)

Being declarative rather than procedural, there are quirks


Takes a bit of practice to get it There are books, online docs, online cook-books, and a large community to help

It dispenses with the need for ncm components


All the work is done by puppet on the node itself you just provide the template part to assert what you want done Less software -> easier to move to new OS versions
IT Technical Forum 27 Jan 2012 21

Externals
Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates Node function + hardware
Moving a host between clusters is a DB update

Your configuration can use variables the node detects itself


e.g. reconfigure daemons based on where a newly live-migrated VM has found itself

Query the compiled configuration of other hosts


e.g. Open my firewall to the lxadm nodes
IT Technical Forum 27 Jan 2012 22

Moving towards PaaS


Parametrisable recipes
Just fill in the blanks

The aim is to make it easy to use pre-canned recipes without even touching a Puppet template
e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box with these parameters

Moving us in the PaaS direction


Ultimately, it would be better if you never even needed to log into this node
(J2EE public service, IT web hosting service, MySQL service)

IT Technical Forum 27 Jan 2012

23

Standard workflow
Iterate CDB on lxadm check out from CDB update templates CDB commit
n minutes

run and notify with check on check on nc-client node(s) test node

Iterate Puppet on lxadm check out from git update templates


1 minute

git commit and push

run and check on test node

notify with mcollective

check on foreman

Iterate Puppet-apply on test node update run check out from git on templates puppet-apply the test node check on test node git commit and push notify with mcollective check on foreman

IT Technical Forum 27 Jan 2012

24

Modernising our processes


Our software processes for the computer centre are fairly limited
fire-and-forget broadcasts to project-elfms

and rather manual


The manual test/ -> preprod/ -> prod/ template dance Our toolset RPMs are built on laptop and uploaded to swrep by hand

Add standard CI (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC
.. then automate the testing e.g. suitably tagged RPMs are automatically deployed to /test nodes.

IT Technical Forum 27 Jan 2012

25

Modernising our processes


Were working out which of the many puppet / git models suits us
code review, sign-off and automated notification for changes that will affect multiple clusters How to automate the test/preprod/prod advancement

Pre-req is flexible monitoring and alarming


you need to trust that an automation failure will be signaled to you

Script-generated emails are banned


Need good monitoring to hang these notifications on

Integrate components rather than use emailScript


Script-generated tickets (where your value in the process is your password), are banned
IT Technical Forum 27 Jan 2012 26

Current tool snapshot (liable to change)


Puppet Foreman mcollective, yum AIMS/PXE Foreman Openstack Nova Jenkins

JIRA

git, SVN

Koji, Mock Yum repo Pulp

Hardware database

Lemon Puppet stored config DB

IT Technical Forum 27 Jan 2012

27

Preliminary timelines
Year
2011 2012

What

Actions
Agree overall principles Prepare formal project plan Establish IaaS in CERN CC Production Agile Infrastructure Monitoring Implementation as per WG Migrate lxcloud Early adopters to Agile Infrastructure

2013

LSD 1 New Data Centre

Extend IaaS to remote CC Business Continuity Support Experiment App re-work Migrate CVI General migration to Agile with SLC6 and Windows 8 Phase out Quattor/CDB/

2014

LSD 1 (to November)

Aggressive schedule if we are to make it for new data centre


IT Technical Forum 27 Jan 2012 28

Initial steps
Decide on tools now and integrate them together to make a production setup (Q1)
We can still change.. But were starting to commit

Looking for early adopters (from Q1)


In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best?
e.g. PES/OIS services: batch/VMs, JIRA, Drupal https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/ EarlyAdopters2012

Help with integration / coding Help with ideas Help with building the task list

IT Technical Forum 27 Jan 2012

29

Summary
IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components
Puppet for the core configuration tool Better integration between components Use of more modern software processes to aid deployment Better monitoring Engage with the community rather than re-implement

Overall project scope is wider (future IT forums)


Cloud and virtualisation, improved monitoring

Please get involved early and give feedback


https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
IT Technical Forum 27 Jan 2012 30

Backup slides

IT Technical Forum 27 Jan 2012

31

Code ownership model


The sticky hands support model (you touched it last!) Were working out an FE-based model where
Code is owned by the related service Functional-Element Ownership confers the responsibility to maintain a decent standard config for the computer centre, and the responsibility to roll out new versions of that code/config Patches from interested people can be offered, and if you take them, you support them
not the guy that gave you the patch

IT Technical Forum 27 Jan 2012

32

mcollective and messaging


mcollective is a notification framework
Mix of CERNs not.d / wassh It broadcast instructions to run pre-canned tasks to nodes selected by a filter
collects the results from the nodes then renders that result for the CLI e.g. restart all my webservers, do a puppet run now

It requires a messaging framework that all nodes subscribe to (to receive the notification)
Typically: AcvtiveMQ or RabbitMQ Both Openstack and our (future) monitoring system need a CC wide messaging system as well

IT Technical Forum 27 Jan 2012

33

Anda mungkin juga menyukai