IT Tech Agile Config

The Agile Infrastructure Project
Part 1: Configuration Management
Tim Bell Gavin McCance

www. cern. ch/i t
CERN IT Department CH-1211 Genve 23 Switzerland
Configuration and Operations Tools

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure https://agileinf.cern.ch/jira/
IT Technical Forum 27 Jan 2012
Project scope
The project is reviewing the entire CERN computer-centre management toolset
What happens from the bare metal up Asset management, inventory Sysadmin tools and maintenance workflows Service management and configuration tools Dynamic configuration for virtual hosts Operations monitoring Workflow automation and continuous deployment
Configuration and Operations Tools
Why?
Current production system built around the Quattor toolset is successfully managing 10k servers
(CERN) Quattor + many CERN components
Why are we changing the toolset?
What are the issues

Uncompressible technical debt
The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources Small community (less funding) and general support problem. At CERN, weve fallen into the sticky hands support model
We need better automation and integration between the sub-components

Lack of automated workflow: everything is a ticket
emailScript : your added value in the process is often your CERN password
The 15-min CDB commit walk context switch cost
What are the issues

Transferrable skills and training
Learning curve for our tools is steep and remains high
Its easier to hire people who have skills in a widelyused tool than your internal tools
Depending on where you look
Jobs adverts indeed.com

Puppet
Index of millions of worldwide job posts across thousands of job sites
Quattor
These are the sort of posts our departing staff will be applying for.
Integration is hard
IPv6, virtualisation, Windows Server all need a solution
We could leverage lots of open source tools
But piecemeal integration of these requires high investment due to our complex system Years of organic growth have made the system way too hairy Its often easier to reinvent rather than integrate
Lack of dynamic-ness in the infrastructure

We hack the config system for dynamic VMs
Its critical to look at the system as a whole
Where to look?
Large ops community out there taking the tool chain approach whose scaling needs match ours: O(100k) servers, many apps Become standard and join this community
10
Use Puppet for the core

The tool space has exploded in the last few years
In configuration management and ops Large, shared tool forges, and lots of experience
Puppet and Chef are the clear leaders for the core tool
other tools in our scope try to integrate with those
Many large-scale enterprises use Puppet

Its declarative approach fits better with what were used to Large installations: friendly, wide-base community and commercial support and training You can buy books on it
11
Scaling challenges: nodes

Currently we have O(10k) physical nodes IaaS approach:
Moving to virtual machines More (smaller, load-balanced) service nodes VMs for raw compute (batch or pilot jobs) Homogeneous: compute + storage on the same node
Add another computer centre, 24/48 SMT cores per node, you get 100k 300k virtual nodes to be managed
99.6%(1) node update success-rate means 1200 manual interventions to fix it
(1)
in a recent intervention on lxbatch
12
Scaling challenges: people
Many, diverse applications (clusters) managed by different teams ..and 700+ other unmanaged Linux nodes in VMs that could benefit from a simple configuration system
IT Technical Forum 27 Jan 2012 13
Agile Infrastructure 1st Try

First started investigating tools in September using parttime resources from CF, DB, DSS, GT, OIS and PES
Trying iterative agile-sprint style (Scrum): short sprints, feedback, sprint review, visible Take first, best-guess at architecture and tool selection, iterate
Mixed success with this agile style

What works: Good visibility and reviews. Daily scrum meeting useful. Weekly review meeting open to management. What doesnt: The time boxing part of of Scrum sprints is hard with part-time resources The project planning now foresees more dedication of staff
14

Were currently running:
OpenStack as cloud software for virtual machines, image management, bulk storage
Future IT forum presentation
Puppet for the configuration management core with Foreman as a dashboard
15
Foreman dashboard
16

OpenStack as cloud software for virtual machines, image management, bulk storage
Future IT forum presentation
Puppet for the configuration management core with Foreman as a dashboard
None of the tools are perfect out-of-the-box

..but wed rather submit patches to a good open source tool than reimplement it Weve experienced very good community support: RFCs and patches are quickly accepted Very active community: often problems are fixed and missing features implemented before you even report them
17

yum for software distribution (replacing spma) git for template management: why git?
Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates Many of the tools we can benefit from also assume git We should not be different from the rest of the community
18
Puppet
Client/server architecture
puppetmaster: horizontally scalable Rails application X509 cert authenticated nodes: integrate with CERN CA
19
Puppet
Puppet runs on the client, applying the configuration changes
It detects the current state and only runs if theres something to do
It runs every few minutes

new configuration will be ~immediately applied (fail-fast). This is a change from CDB where latent changes can be stacked up
Normal mode is client-side compile (assume success)

No more CDB commit waits Change from CDB: the compilation fails later
Good monitoring is a pre-req: puppet sends reports back to the puppetmaster

The Foreman tool can collect these for you
Puppet language
Puppet uses its own Ruby-like language for the templates to assert the desired state of the nodes
With Ruby fall-back for hard stuff (weve only needed this once)
Being declarative rather than procedural, there are quirks

Takes a bit of practice to get it There are books, online docs, online cook-books, and a large community to help
It dispenses with the need for ncm components

All the work is done by puppet on the node itself you just provide the template part to assert what you want done Less software -> easier to move to new OS versions
Externals
Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates Node function + hardware
Moving a host between clusters is a DB update
Your configuration can use variables the node detects itself

e.g. reconfigure daemons based on where a newly live-migrated VM has found itself
Query the compiled configuration of other hosts

e.g. Open my firewall to the lxadm nodes
Moving towards PaaS

Parametrisable recipes
Just fill in the blanks
The aim is to make it easy to use pre-canned recipes without even touching a Puppet template
e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box with these parameters
Moving us in the PaaS direction

Ultimately, it would be better if you never even needed to log into this node
(J2EE public service, IT web hosting service, MySQL service)
23
Standard workflow
Iterate CDB on lxadm check out from CDB update templates CDB commit
n minutes
run and notify with check on check on nc-client node(s) test node
Iterate Puppet on lxadm check out from git update templates

1 minute
git commit and push
run and check on test node
notify with mcollective
check on foreman
Iterate Puppet-apply on test node update run check out from git on templates puppet-apply the test node check on test node git commit and push notify with mcollective check on foreman
24
Modernising our processes

Our software processes for the computer centre are fairly limited
fire-and-forget broadcasts to project-elfms
and rather manual

The manual test/ -> preprod/ -> prod/ template dance Our toolset RPMs are built on laptop and uploaded to swrep by hand
Add standard CI (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC
.. then automate the testing e.g. suitably tagged RPMs are automatically deployed to /test nodes.
25
Modernising our processes

Were working out which of the many puppet / git models suits us
code review, sign-off and automated notification for changes that will affect multiple clusters How to automate the test/preprod/prod advancement
Pre-req is flexible monitoring and alarming

you need to trust that an automation failure will be signaled to you
Script-generated emails are banned

Need good monitoring to hang these notifications on
Integrate components rather than use emailScript

Script-generated tickets (where your value in the process is your password), are banned
Current tool snapshot (liable to change)

Puppet Foreman mcollective, yum AIMS/PXE Foreman Openstack Nova Jenkins
JIRA
git, SVN
Koji, Mock Yum repo Pulp
Hardware database
Lemon Puppet stored config DB
27
Preliminary timelines
Year
2011 2012
What
Actions
Agree overall principles Prepare formal project plan Establish IaaS in CERN CC Production Agile Infrastructure Monitoring Implementation as per WG Migrate lxcloud Early adopters to Agile Infrastructure
2013
LSD 1 New Data Centre
Extend IaaS to remote CC Business Continuity Support Experiment App re-work Migrate CVI General migration to Agile with SLC6 and Windows 8 Phase out Quattor/CDB/
2014
LSD 1 (to November)
Aggressive schedule if we are to make it for new data centre

Initial steps
Decide on tools now and integrate them together to make a production setup (Q1)
We can still change.. But were starting to commit
Looking for early adopters (from Q1)

In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best?
e.g. PES/OIS services: batch/VMs, JIRA, Drupal https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/ EarlyAdopters2012
Help with integration / coding Help with ideas Help with building the task list
29
Summary
IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components
Puppet for the core configuration tool Better integration between components Use of more modern software processes to aid deployment Better monitoring Engage with the community rather than re-implement
Overall project scope is wider (future IT forums)

Cloud and virtualisation, improved monitoring
Please get involved early and give feedback

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
Backup slides
31
Code ownership model

The sticky hands support model (you touched it last!) Were working out an FE-based model where
Code is owned by the related service Functional-Element Ownership confers the responsibility to maintain a decent standard config for the computer centre, and the responsibility to roll out new versions of that code/config Patches from interested people can be offered, and if you take them, you support them
not the guy that gave you the patch
32
mcollective and messaging

mcollective is a notification framework
Mix of CERNs not.d / wassh It broadcast instructions to run pre-canned tasks to nodes selected by a filter
collects the results from the nodes then renders that result for the CLI e.g. restart all my webservers, do a puppet run now
It requires a messaging framework that all nodes subscribe to (to receive the notification)
Typically: AcvtiveMQ or RabbitMQ Both Openstack and our (future) monitoring system need a CC wide messaging system as well
33

IT Tech Agile Config

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

IT Tech Agile Config

Diunggah oleh

Hak Cipta:

Format Tersedia

The Agile Infrastructure Project

Part 1: Configuration Management

Tim Bell Gavin McCance

Configuration and Operations Tools

IT Technical Forum 27 Jan 2012

IT Technical Forum 27 Jan 2012

Configuration and Operations Tools

IT Technical Forum 27 Jan 2012

Why are we changing the toolset?

IT Technical Forum 27 Jan 2012

What are the issues

We need better automation and integration between the sub-components

The 15-min CDB commit walk context switch cost

IT Technical Forum 27 Jan 2012

What are the issues

IT Technical Forum 27 Jan 2012

Jobs adverts indeed.com

IT Technical Forum 27 Jan 2012

Lack of dynamic-ness in the infrastructure

Its critical to look at the system as a whole

IT Technical Forum 27 Jan 2012

IT Technical Forum 27 Jan 2012

Use Puppet for the core

Many large-scale enterprises use Puppet

IT Technical Forum 27 Jan 2012

Scaling challenges: nodes

in a recent intervention on lxbatch

IT Technical Forum 27 Jan 2012

Scaling challenges: people

Agile Infrastructure 1st Try

Mixed success with this agile style

IT Technical Forum 27 Jan 2012

Agile Infrastructure 1st Try

Puppet for the configuration management core with Foreman as a dashboard

IT Technical Forum 27 Jan 2012

IT Technical Forum 27 Jan 2012

Agile Infrastructure 1st Try

Puppet for the configuration management core with Foreman as a dashboard

None of the tools are perfect out-of-the-box

IT Technical Forum 27 Jan 2012

Agile Infrastructure 1st Try

IT Technical Forum 27 Jan 2012

IT Technical Forum 27 Jan 2012

It runs every few minutes

Normal mode is client-side compile (assume success)

Good monitoring is a pre-req: puppet sends reports back to the puppetmaster

Being declarative rather than procedural, there are quirks

It dispenses with the need for ncm components

Your configuration can use variables the node detects itself

Query the compiled configuration of other hosts

Moving towards PaaS

Moving us in the PaaS direction

IT Technical Forum 27 Jan 2012

Iterate Puppet on lxadm check out from git update templates

git commit and push

run and check on test node

notify with mcollective

IT Technical Forum 27 Jan 2012

Modernising our processes

and rather manual

IT Technical Forum 27 Jan 2012

Modernising our processes

Pre-req is flexible monitoring and alarming