Tivoli Software
Document Author:
Version 1.0
Date 16/06/2005
Tivoli Software
Introduction
The goal of this document is to provide some suggestions, tips basing on best practices about the tuning and configuration of Tivoli Configuration Manager Mcollect component.
Figure 1
The overall architecture of the SCS collection hierarchy can be seen in figure 1. Lets follow in detail an Inventory scan flow. The The The The The InventoryConfig profile is distributed from the Inventory Server. profile is distributed via Mdist2. profile reaches the endpoint and is decoded. hardware and software scanner execute and create various mif files. mif files are parsed and converted into a Inventory results structure.
Tivoli Software
If the mif parsing succeeds, the Inventory results structure is marshaled, compressed and stored in a .DAT file. The Inventory endpoint code creates a CTOC (Collection Table of Contents property list) that contains information about the datapack location, the data handler and the methods to use to retrieve and send data through the collection hierarchy. The CTOC is passed to the gateway collector via the mc_request_collection upcall. If the mc_request_collection_upcall fails, the data is passed back to the Inventory Callback object via mdist2. The callback object stores the datapack in $DBDIR, creates a CTOC and passes the CTOC to the data handler. The mc_request_collection upcall method runs on the Collector. The upcall contains a collection request/CTOC as a parameter. This CTOC is added to the input queue of the collector. If the collector process is not running, the upcall starts the process. Once the CTOC is successfully stored in the input queue, the upcall terminates. Once the gateway collector starts up, the collectors scheduler thread begins running. The scheduler thread checks the input queue and if it sees a CTOC in the queue, it starts an input worker thread to process the ctoc. The input worker thread reads the CTOC for the endpoints OID and calls iom_create_key to generate an iom key. The input worker thread then starts another short-running thread that does a mc_get_data downcall via a tmf_req_invoke function call to the endpoints oid. The iom key is passed to the endpoint via this downcall. After starting the short-running thread, the input worker thread does an iom_timed_open call using the generated iom key. The mc_get_data call runs on the endpoint and has received the iom key. The endpoint performs an iom_open() call with the same key. Once the iom connection has been established with the collector, the endpoint transfers each data segment in the datapack in chunks of 1024KB. (Endpoint_prog1) The Gateway Collector input thread meanwhile receives the chunks and writes them to the collectors depot. (Note: the collector checks if there is enough space in the depot to store the datapack prior to initiating the transfer). After all the datasegments have been sent, the iom connection is closed and a .DON file is written in the $LCFROOT/inv/SCAN/mcollect directory on the endpoint. This marker file tells Inventory that the DAT file has been transferred to the gateway collector and it is now safe to delete the DAT file on the endpoint. This deletion occurs the next time an Inventory scan is distributed to the endpoint. After all the datasegments have been received, the input worker thread moves the CTOC from the input queue to the output queue. The input worker thread checks for another CTOC to process in the input queue. If it finds one, it processes it, otherwise the input worker thread times out. Meanwhile, the scheduler thread continues running in the background. It detects that a CTOC has been added to the output queue and starts an output worker thread. The output worker thread processes the ctoc and determines the next hop in the route between the current gateway collector and the destination data handler. The thread then calls io_alert() on the next hop oid. This results in a mc_request_collection upcall method running on the next level Collector or Data Handler (if there are no other collectors in the collection route). After the mc_request_collection upcall succeeds, the output worker thread marks the CTOC with a WAIT_UPSTREAM status. This indicates that the next level in the collection hierarchy has been notified. The output worker thread then checks to see if there are any other CTOC in the output queue that are not in WAIT_UPSTREAM status. If so, the threads pick up a CTOC and process it. The mc_request_collection upcall runs on the data handler. The CTOC is passed in as a parameter of this upcall. The CTOC is stored in the input queue of the data handler.
Tivoli Software
If the data handler isnt running when the mc_request_collection upcall arrives, the process starts and the scheduler thread begins running. The scheduler thread notices that there is a CTOC in the input queue to process and starts an input worker thread to process the CTOC. The input worker thread obtains a handle to the CTOC, creates an iom_key and spawns off a short-running thread that runs a mc_get_data downcall to pass the iom_key to the collector from which the CTOC arrived. The input worker thread meanwhile calls iom_timed_open using the iom_key The mc_get_data downcall starts a get_data_worker thread on the lower level collector. This thread takes the iom_key and calls iom_timed_open to establish an iom connection with the data handler. Once the iom connection has been established, the datapack is transferred in chunks (size based on the collectors chunk size setting) to the data handler. After the entire datapack has been transferred the get_data_worker thread then removes the corresponding CTOC from the collectors output queue and finally terminates. After the data handler has finished receiving a datapack and has stored it in the depot, it removes the CTOC from the input queue and stores it in the output queue The data handler scheduler thread notices that there is a CTOC in the output queue of the data handler. The scheduler threads starts up an output worker thread to process the CTOC. The data handler output thread establishes a RIM connection, takes the datapack, converts it into a RIM format and inserts the data into the database. Once the data has been inserted successfully, the status collector is notified, the CTOC is removed from the output queue and the datapack is deleted from the data handler depot. The output thread then checks for another CTOC and continues processing until it runs out of CTOCs to process. When there are no more CTOC to process, the output thread times out.
Tivoli Software
Problem: scans stuck in pending (as seen by wgetscanstat) wcancelscan is the first step in a Mcollect cleanup. If additional manual steps are needed, use the ones below. Problem determination tips: The easiest way to check where a scan is stuck is to pick a single target that appears in the wgetscanstat output with a pending status. Next, check the endpoint for .mif, .bk1 and .DAT files. If there is a lcf/inv/SCAN/Mcollect directory, does it contain any .DON file that corresponds to the scan id? If any of these files are missing, the problem is at the endpoint level or with mdist2. If all of these files appear on the endpoint, check the collector queues using the wcstat q command. This command lists the CTOCs that are stored in each queue. If the CTOC corresponding to the pending node is found, check the collector mcollect.log and the oservlog to verify that the collector is still running. If the CTOC is in the collectors output queue and is in WAIT_UPSTREAM status, this means that the CTOC has already been sent to the input queue of the next level collector. If the input queue on the next level collector doesnt have the CTOC, then the CTOC in the lower-level collectors output queue in WAIT_UPSTREAM mode will never get processed. Repeat the same checks of the queues on the Data Handler queues if the CTOC was not found on the collector. If the CTOC is in the output queue of the data handler in QUEUED_OUTPUT status, there could be problem with the data handler or with the data base insertion code. Check the data handler mcollect.log, oservlog and also a RIM log to check for any problems. Solution: Manual Mcollect Cleanup A) Stop ALL collectors and the datahandler the easy way: stopping the oserv OR wcollect -h immediate @Gateway:gw_name (for every collector!) wcollect -h immediate @InvDataHandler:inv_data_handler B) CallBack Cleanup On the MN that is the callback object 1. Kill callback process (inv_cb_meths) (ps -ef if UNIX) 2. Remove the callback datapacks from $DBDIR, that is, remove the files with filenames than end in endpoint oids (example = 1502333812.1613_1502333812.561996.508+ ) C) On the DataHandler, kill the "status collector" process called inv_stat_meths D) Delete the run time directories 1) On any managed node where that Inventory Server installed in $DBDIR/inventory (delete everything except the directory called INI ) in $DBDIR/Mcollect (delete everything)
Tivoli Software
2) on the Collectors: (all gateways that have "scalable collection service" installed a CLL ) in $DBDIR/Mcollect (delete everything) 3) NOTE: the run time directory can be found doing a wcollect "collector" so these files may not be under $DBDIR E) Start all collectors and the datahandler wcollect -s @Gateway:collector_name wcollect -s @InvDataHandler:inv_data_handler F) You may want to verify that the mdist2 delivery system is clear. (be careful, swdist and ITM both use mdist2 delivery) 1. wmdist -la (to see any active distributions) 2. if active, use wcancelscan to cancel them, and/or wmdist -c to cancel them.. 3. if wmdist -la is clean, then, wmdist -I gw_name (must also be clean, for every gateway) if not, wmdist -B to clean the mdist2.bdb file on the gateway Problem: When should the .dat file be removed from the ...lcf\inv\SCAN directory? Solution: The time at which the .dat file is removed depends on how it is returned from the endpoint. 1. If the .dat file is returned via the callback object, the .dat file is removed immediately. 2. If the .dat file is returned via Mcollect, a .DON file is created in the inv/SCAN/Mcollect subdirectory. The .dat file remains on the endpoint until the next scan. If the next scan finds a .DON file, it then removes the corresponding .DAT file before running the next scan to create a new .dat file.
Problem: How do you find or move the callback object? Solution: wlookup -ar InventoryConfigCB = to see where the callback object is To move the CallbackObject wdel @InventoryConfigCB:inv_cb wcrtinvcb <managed node name>
Problem: What happens when the collector/data handler depot becomes full? Solution: Once data handler depot becomes full, it throttles down the input threads to one and then continues processing the output threads to insert data into the database. If the database is not available, the output threads will retry several times and then mark scan to endpoint as a failure. When the depot becomes full, the one input thread than remains running will be put into an aggressive retry mode which basically means it will attempt to store a datapack until it successfully stores the datapack. The input thread behavior will return to normal when that one input thread successfully stores the datapack that it was attempting to store. The idea is that if it
Tivoli Software
was able to store the datapack, there must now be more room in the depot so additional input threads can start and begin storing datapacks in the depot.
Problem: What does the following error in the mcollect.log mean? Found in Queues ctoc id:c1273648966606027851 Found ERROR ctoc id:c1273648966606027851 in input, forwarding... ERROR:scheduler_ieq_fatal called:c1273648966606027851 Solution: This error indicates that the CTOC has a collection status of false. This means that on some lower level collector, the number of collection attempts to collect the datapack reached the max retries limit and the collector gave up, marked the CTOC with a collection status of false and forwarded the CTOC up the collection heirarchy. When the data handler encounters such a CTOC, it discards it. Problem: How long does it take for the queue checkpoint process? Solution: When the data handler does a checkpoint of the queue, it writes the entire queue in memory to the file. So, the longer the queue, the longer it will take to write. It is a linear process. Also, the function that does the checkpoint is mutexed, so if one thread is already doing the checkpoint and another thread wishes to do a checkpoint, the second thread must wait until the first one completes. You how long the checkpoint process takes by seeing the time difference between the following lines for a given thread in the mcollect.log:
Sep 11 19:30:05 3 [pid:00022093 tid:00770064] queue_checkpoint: creating temp file: Sep 11 19:30:05 3 [pid:00022093 tid:00770064] queue_checkpoint: renaming temp file
A checkpoint is done whenever a collection request/CTOC is added or removed from a queue. The more threads you have running, the more often the checkpoint process will occur. Customer may want to consider lowering the output thread counts on the collectors that are feeding into the data handler. This will reduce the number of mc_request_collection upcalls and will slow the rate at which CTOCs enter the input queue on the data handler. This should be ok since the rate at which the data leaves the data handler seems to be far slower that the rate at which the data enters the data handler. Problem: How does one reset collector settings back to the defaults? For example, its impossible to shrink the depot size setting. Solution: Make sure no Inventory distributions are active and wgetscanstat indicates that no scans are currently running. 1) Shut down all collectors and the data handler in the TMR. 2) Go to each collector and delete the contents of the runtime directory (usually $DBDIR/Mcollect) 3) Go to the data handler and delete the contents of the run time directory (usually $DBDIR/inventory/data_handler) 4) On the TMR server, source in the Tivoli environment and run the attached script. This script
Tivoli Software
will remove the settings for the collectors and the data handler from the collection manager. NOTE: This script will reset all the settings for every collector and the data handler in the TMR back to the defaults. 5) Startup each collector and data handler and notice that the depot size is now 40 MB. This can now be increased up to 1 GB (assuming 3.7.1-CLL-0006 is installed) #!/bin/sh ####################################### # this should only run on the TMR server ######################################## # find the Collection Manager for this TMR # CMO=`wlookup MCollect_coll_mgr` # cleanout object database entries on TMR server # created by the collection manager to hold # collector attributes # set +e attrlist=`odbls -a -k $DBDIR $CMO | grep mctune | awk {'print $1'} \ | sed -e 's/^_//'` for attr in $attrlist; do echo removing collector attribute $attr objcall $CMO o_rmattr $attr done ###########################################################
Problem: Can I use the default Mcollect debug_log_size to collector documentation in a heavy inventory distribution? Solution: No, is better to change the size to 300 MB at least running the following command: wcollect -g 300 @InvDataHandler:inv_data_handler wcollect -g 300 @Gateway:gw-name
Tivoli Software
Problem: Can I use the Data Handler mcollect.log file to size the tablespace that hosts the inventory database ? Solution: Yes, since: 4.2-INV-FP01, 4.2.1-INV-FP01, 4.0-INV-0052, 4.0-INV-FP08, for each successful insertion in the DB are logged these new rows in DH mcollect.log: IR: db_ret_code = [0] DBrate: finish time: 1111709687 DBrate: 2672646 bytes in 379 seconds DBrate: bytes/second = 7051.84 DBrate: kilobytes/minute = 423.11 So, if you collect the DH mcollect.log and issue cat mcollect.log | grep DBrate | grep bytes in youll have the number of bytes that were actually inserted in the DB. Then you can evaluate the average and have an estimation about how many bytes are required for each endpoint.
Problem: How much time does the RIM/DB take to insert data coming from one endpoint in the database ? Solution: Again, since: 4.2-INV-FP01, 4.2.1-INV-FP01, 4.0-INV-0052, 4.0-INV-FP08, for each successful insertion in the DB are logged these new rows in DH mcollect.log: IR: db_ret_code = [0] DBrate: finish time: 1111709687 DBrate: 2672646 bytes in 379 seconds DBrate: bytes/second = 7051.84 DBrate: kilobytes/minute = 423.11 So, if you collect the DH mcollect.log and issue cat mcollect.log | grep DBrate | grep bytes in youll get the seconds taken by RIM/DB to insert each .DAT file data.
Problem: Can I evaluate ideal maximum number of endpoints that the Data Handler can process in a given configuration? Solution: Yes, get get all the values the time taken by the RIM/DB to insert each .DAT in the DB as explained in the previous section. Then evaluate the average of all these values and lets call T this average. Using the following formula: (3600/T) * (DH output threads) will give you the maximum number of endpoints that a DH is able to manage in one hour: lets call this value: ideal rate. If you now compare this value with the actual one (you can run cat mcollect.log | grep DBrate | grep bytes in ) and see how many rows you get, you can see how close is the actual rate to the ideal rate. If the 2 values are significantly different, then it means that DH is wasting time to do indexing, so may help the tuning parameters like wcollect n w we will cover in the next section, or that the DB needs some tuning.
10
Tivoli Software
11
Tivoli Software
12
Tivoli Software
The following table contains information about some interesting fixes that can help in the area of tuning. Defect/APAR IY54266 Description Symptoms: The Inventory data handler performance slows down when it is receiving a large number of collection requests from Collectors. This causes the database throughput rate to be slow and results in a longer completion time for the data collection. Solution: Added some tuning parameters to the wcollect command to help throttle the rate at which collection requests arrive at the data handler. This allows the database throughput rate to be much faster. 170964 The size of the output queue is regulated just as the size of the input queue is regulated. This prevents the output queue from growing too large and rejects collection requests coming into the data handler based on the size of both the input and output queues. The size of the data handler input queue is limited to 1000 items. This helps speed up the queue check pointing process and allows more cpu time for database insertion tasks. Note: this defect fix has been enhanced by the fixes for IY54266 and 170964. It is not possible to configure the INPUT_QUEUE_THRESHOLD parameter ('wcollect -n ') for the WAN entry point collector, but only for a data handler object. In a scenario of interconnected TMR, we need to configure the INPUT_QUEUE_THRESHOLD also for the WAN entry point collector. After applying this fix, the WAN entry point collector will reject the ctocs sent by the Spoke collectors when the number of the ctocs in the input queue or in the output queue are greater than the INPUT_QUEUE_THRESHOLD. This threshold has been introduced to keep the size of the checkpointGL_iqfile.dat and checkpointGL_oqfile.dat small, avoiding the long delays seen updating the files when it was allowed to become large under the previous design. The DH crashed so the DH scheduler is down, the DH mc_request_collection is invoked by some gw collectors, but as the threshold is reached, the ctoc is rejected and the DH does not attempt to start the scheduler thread in this situation. As the scheduler is responsible to activate the output threads that process the output queue, and the scheduler is down, the output queue won't be unfilled and so the number of output entries has no chance to be lower the threshold. When the threshold is reached, if the scheduler has not been started yet then it will be started by the data handler.
IY50922
IY70039
IY70283
13
Tivoli Software
A Tuning Example
A Tuning Example using the fixes listed above Here are the recommended parameters we'd like to use: wcollect wcollect wcollect wcollect wcollect -n 200 @InvDataHandler:inv_data_handler -t 2 @InvDataHandler:inv_data_handler -o 15 @InvDataHandler:inv_data_handler -w 10 @Gateway:gw-name -o 2 @Gateway:gw-name
Explanation: The following arguments are provided at: 4.0-INV-0052 3.7.1-CLL-FP01 4.2-INV-0021 4.1-CLL-FP01 4.2.1-INV-FP01 4.1.1-CLL-FP01 1. wcollect -n 200 @InvDataHandler:inv_data_handler to set the maximum input queue size and regulate the output queue size on the data handler. Stop and restart the data handler. Verify the -n setting with the wcstat command wcstat @InvDataHandler:inv_data_handler the INPUT_QUEUE_THRESHOLD property is set to 200 The goal is to keep a constant supply of datapacks for the datahandler to have available to write, and keep the size of the output queue small. You can verify this by watching the output thread utilization of the datahandler.
14
Tivoli Software
The downstream collector remains in single output-thread mode with the "p" delay for 3 minutes.. After this 3 minute period, if no rejects have been received from the DataHandler the collector will come out of the single thread mode and resume using its configured number of output threads with no delays. In the collector mcollect logs, check for the string "scheduler_reduce_to_one_output_worker". This indicates that the collector has entered into the one output thread mode. Or you can monitor the output thread usage with this.. tail -f mcollect.log | egrep -i "threads for|scheduler_reduce_to_one" Since the request coming into the data handler are throttled up and down automatically, the data handler can spend more time performing its primary function inserting data into the database. Notice: The fix for IY70039 was provided in 4.1-CLL-0007, 4.1.1-CLL-FP03, 4.2.2-CLL-FP01. The fix for IY70283 was provided in 4.2-INV-0034, 4.2.1-INV-FP03, 4.2.2-INV-FP01.
Reference:
Redbook: All About IBM Tivoli Configuration Manager Version 4.2
http://publib-b.boulder.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246612.html?Open
(See chapter 9 of this redbook for information about SCS and the data handler architecture, status collector)
Credits:
Navin Manohar, my Mcollect guru, who wrote the Mcollect knowledge transfer document on which this one is based.
15