Post-Installation Health Check
Verify Docker Containers
Navigate to the Network Copilot installation directory and run
docker-compose ps
to verify if all the dockers are running
The 'State' should be Up for all services
aviz@ncp:~/ncp-1711792447-onprem$ docker-compose ps
Name Command State Ports
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
broker /etc/confluent/docker/run Up 0.0.0.0:29092->29092/tcp,:::29092->29092/tcp, 0.0.0.0:9092->9092/tcp,:::9092->9092/tcp, 0.0.0.0:9101->9101/tcp,:::9101->9101/tcp
chromadb chroma run --host 0.0.0.0 ... Up 8000/tcp
collector-db /docker-entrypoint.sh postgres Up 0.0.0.0:5432->5432/tcp,:::5432->5432/tcp, 8008/tcp, 8081/tcp
docker python3 app.py Up
flow-collector ./goflow2 -transport=kafka ... Up 0.0.0.0:2055->2055/udp,:::2055->2055/udp, 0.0.0.0:6343->6343/udp,:::6343->6343/udp, 0.0.0.0:8099->8080/tcp,:::8099->8080/tcp
gnmi-collector java -jar -XX:MaxGCPauseMi ... Up 0.0.0.0:50053->50053/tcp,:::50053->50053/tcp, 8093/tcp
gnmi-gateway ./gnmi-gateway -TargetLoad ... Up 0.0.0.0:9339->9339/tcp,:::9339->9339/tcp
kafka-connect /etc/confluent/docker/run Up (healthy) 0.0.0.0:8083->8083/tcp,:::8083->8083/tcp, 9092/tcp
ksqldb-server /usr/bin/docker/run Up 0.0.0.0:8088->8088/tcp,:::8088->8088/tcp
ncp-api gunicorn main:app -w 4 -k ... Up 0.0.0.0:9001->8000/tcp,:::9001->8000/tcp
ncp-db docker-entrypoint.sh postgres Up 5432/tcp
ncp-ui docker-entrypoint.sh node ... Up 3002/tcp, 0.0.0.0:443->443/tcp,:::443->443/tcp
ncp-vllm entrypoint.sh Up
schema-registry /etc/confluent/docker/run Up 0.0.0.0:8081->8081/tcp,:::8081->8081/tcp
snmp-collector java -jar -XX:MaxGCPauseMi ... Up 8093/tcp
stream-processor java -jar /app/stream-proc ... Up 8080/tcp
zookeeper /etc/confluent/docker/run Up 0.0.0.0:2181->2181/tcp,:::2181->2181/tcp, 2888/tcp, 3888/tcp
aviz@ncp:~/ncp-1711792447-onprem$
Check Docker Volumes
Ensure the below docker volumes are present:
ncp-collector-db-data
ncp-db-data
ncp-ui-data
ravi@ncp02:~$ docker volume ls
DRIVER VOLUME NAME
local ncp-collector-db-data
local ncp-db-data
local ncp-ui-data
ravi@ncp02:~$
Check Docker Networks
Ensure the below docker network is present:
aviz-shared-network
ravi@ncp02:~$ docker network ls
NETWORK ID NAME DRIVER SCOPE
8a0d2a14519a aviz-shared-network bridge local
ravi@ncp02:~$
Check Docker logs
You may also check the docker logs to look for errors if any docker keeps restarting or crashing with the command docker logs <service-name>
ravi@ncp02:~$ docker logs ncp-ui
(node:1) ExperimentalWarning: Import assertions are not a stable feature of the JavaScript language. Avoid relying on their current behavior and syntax as those might change in a future version of Node.js.
(Use `node --trace-warnings ...` to show where the warning was created)
(node:1) ExperimentalWarning: Importing JSON modules is an experimental feature and might change at any time
{
dbUrl: 'postgresql://postgres:35e6eb4c890133c9726c2479eecd533db1634872@collector-db:5432/collector?schema=public'
}
starting cron job
(node:1) [DEP0111] DeprecationWarning: Access to process.binding('http_parser') is deprecated.
HTTP/2 server listening on port: 443
running syncInventory
{
newUrl: 'http://rule-service:8080/rule-engine/config/v1/sync-device-info'
}
{
req: '/api/Device/imageManagementStatus?neededIPs=[]',
ip: '::ffff:10.0.1.5',
time: 2025-02-04T05:36:57.289Z
.....
ravi@ncp02:~$ docker logs ncp-llm
+ python3 -O -u -m vllm.entrypoints.openai.api_server --tensor-parallel-size 1 --worker-use-ray --host 0.0.0.0 --port 8000 --model=/root/models/Llama-3.1-8B-Instruct --served-model-name Llama-3.1-8B-Instruct --enable-lora --lora-modules sql-lora0=/app/lora/ncp-llm-v1.4_lora --max-lora-rank=256 --max-model-len 16384
INFO 02-05 08:50:47 api_server.py:459] vLLM API server version 0.6.0
INFO 02-05 08:50:47 api_server.py:460] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=[LoRAModulePath(name='sql-lora0', path='/app/lora/ncp-llm-v1.4_lora')], prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, model='/root/models/Llama-3.1-8B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=16384, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=True, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=True, max_loras=1, max_lora_rank=256, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Llama-3.1-8B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 02-05 08:50:47 api_server.py:160] Multiprocessing frontend to use ipc:///tmp/032821f5-aa4b-4da9-842c-d1b19fda5047 for RPC Path.
INFO 02-05 08:50:47 api_server.py:176] Started engine process with PID 79
2025-02-05 08:50:54,839 INFO worker.py:1819 -- Started a local Ray instance.
INFO 02-05 08:50:56 llm_engine.py:213] Initializing an LLM engine (v0.6.0) with config: model='/root/models/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='/root/models/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Llama-3.1-8B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=True)
INFO 02-05 08:50:57 ray_gpu_executor.py:134] use_ray_spmd_worker: False
INFO 02-05 08:51:02 model_runner.py:915] Starting to load model /root/models/Llama-3.1-8B-Instruct...
.....
System Resource Check
To ensure the system has sufficient resources available, use the command docker stats
ravi@ncp02:~$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
99e6c77c3832 ncp-flow-collector 0.00% 7.039MiB / 251.5GiB 0.00% 29MB / 4.97MB 0B / 0B 25
c4ba63cfcd84 ncp-snmp-collector 0.33% 2.337GiB / 251.5GiB 0.93% 9.55GB / 8.88GB 44.8MB / 1.33GB 109
8739bc87e748 ncp-docker 0.00% 34.02MiB / 251.5GiB 0.01% 149MB / 123MB 18.1MB / 0B 1
fc4db16af2c0 ncp-api 0.82% 603.1MiB / 251.5GiB 0.23% 393MB / 497MB 7.29MB / 143kB 104
56ee968917a8 ncp-ui 0.00% 61.72MiB / 251.5GiB 0.02% 54.8MB / 44MB 8.99MB / 0B 11
ee782add15e8 ncp-streams-processor 0.31% 981.4MiB / 251.5GiB 0.38% 5.75GB / 4.68GB 2.23MB / 1.29GB 90
e5ee19351568 ncp-gnmi-collector 4.79% 11.63GiB / 251.5GiB 4.62% 249GB / 130GB 106MB / 2.04GB 531
5d973d295939 ncp-gateway 0.09% 228.6MiB / 251.5GiB 0.09% 167GB / 216GB 7.45MB / 10.8MB 87
1f920f9c47f2 ncp-connect 5.89% 2.676GiB / 251.5GiB 1.06% 33.4GB / 68.2GB 353MB / 1.86GB 205
f655948cef18 ncp-ksqldb 32.10% 5.477GiB / 251.5GiB 2.18% 189GB / 320GB 109MB / 149GB 793
fbc4edaecf75 ncp-schema-registry 0.49% 587.6MiB / 251.5GiB 0.23% 279MB / 253MB 24.6MB / 964MB 105
b115d7125d18 ncp-broker 10.75% 3.175GiB / 251.5GiB 1.26% 422GB / 227GB 2.89GB / 554GB 144
3e29b890b796 ncp-zookeeper 0.84% 492MiB / 251.5GiB 0.19% 286MB / 404MB 66.4MB / 912MB 250
5610f7e40faa ncp-db 0.00% 41.4MiB / 251.5GiB 0.02% 506MB / 379MB 168kB / 730MB 19
6f85f6fedf93 ncp-llm 130.56% 10.26GiB / 251.5GiB 4.08% 4.18MB / 517kB 726MB / 3.97GB 6246
419a0d3400c2 ncp-knowledgebase 0.19% 91.99MiB / 251.5GiB 0.04% 11.5MB / 14.2kB 53.2kB / 9.65MB 73
1e07e05fb7c4 ncp-collector-db 0.05% 25.83GiB / 251.5GiB 10.27% 94.1GB / 38.9GB 975MB / 1.3TB 55
Verify Kafka Broker
Ensure the Kafka message broker is properly configured
ravi@ncp02:~$ docker exec -it ncp-broker bash
[appuser@broker ~]$ kafka-topics --bootstrap-server localhost:9092 --list
CPU_MEMORY_UPDATES
DEVICE_UPDATES
FAN_UPDATES
INTERFACE_UPDATES
LINK_EVENTS
LINK_UPDATES
ONES_ALERT_EVENTS
ONES_RULE_EVENTS
ONES_RULE_NOTIFICATION_EVENTS
OUTPUT_BGP_NEIGHBORS
OUTPUT_BGP_NEIGHBORS_STATISTICS
OUTPUT_BGP_STATISTICS
OUTPUT_DEVICE_CPU_CORE_TEMPERATURE
OUTPUT_DEVICE_CPU_LOAD
OUTPUT_DEVICE_CPU_UTILIZATION
OUTPUT_DEVICE_CRM_ACL_STATISTICS
OUTPUT_DEVICE_DOCKER_STATISTICS
OUTPUT_DEVICE_FAN_STATUS
OUTPUT_DEVICE_MEMORY_UTILIZATION
OUTPUT_DEVICE_PSU_STATUS
OUTPUT_DEVICE_RESOURCE_UTILIZATION
OUTPUT_FLOW_DATA
OUTPUT_INTERFACE_FRAME_COUNTERS
OUTPUT_INTERFACE_LLDP_STATISTICS
OUTPUT_INTERFACE_RATE
OUTPUT_INTERFACE_RECEIVE_QUEUE_COUNTERS
OUTPUT_INTERFACE_STATISTICS
OUTPUT_INTERFACE_STATUS
OUTPUT_INTERFACE_TRANSMIT_QUEUE_COUNTERS
OUTPUT_INTERFACE_UTILIZATION
OUTPUT_MCLAG
OUTPUT_PORTCHANNEL
OUTPUT_SSD
OUTPUT_VTEP
PSU_UPDATES
SOURCE_BGP_NEIGHBORS
SOURCE_BGP_NEIGHBORS_STATISTICS
SOURCE_BGP_STATISTICS
SOURCE_COMPONENT_METRIC_GAUGE_EVENTS
SOURCE_DEVICE_CPU_CORE_TEMPERATURE
SOURCE_DEVICE_CPU_LOAD
SOURCE_DEVICE_CPU_UTILIZATION
SOURCE_DEVICE_CRM_ACL_STATISTICS
SOURCE_DEVICE_DOCKER_STATISTICS
SOURCE_DEVICE_FAN_STATUS
SOURCE_DEVICE_INVENTORY
SOURCE_DEVICE_MEMORY_UTILIZATION
SOURCE_DEVICE_METRIC_GAUGE_EVENTS
SOURCE_DEVICE_PSU_STATUS
SOURCE_DEVICE_RESOURCE_UTILIZATION
SOURCE_FLOW_DATA
SOURCE_INTERFACE_FRAME_COUNTERS
SOURCE_INTERFACE_LLDP_STATISTICS
SOURCE_INTERFACE_METRIC_GAUGE_EVENTS
SOURCE_INTERFACE_RATE
SOURCE_INTERFACE_RECEIVE_QUEUE_COUNTERS
SOURCE_INTERFACE_STATISTICS
SOURCE_INTERFACE_STATUS
SOURCE_INTERFACE_TRANSMIT_QUEUE_COUNTERS
SOURCE_INTERFACE_UTILIZATION
SOURCE_MCLAG
SOURCE_METRICS_GAUGE_EVENTS
SOURCE_PORTCHANNEL
SOURCE_SSD
SOURCE_VTEP
TRANSCEIVER_DOM_EVENTS
TRANSCEIVER_EVENTS
TW_INTERFACE_METRIC_AGGR_EVENT
TW_METRIC_AGGR_EVENT
__consumer_offsets
__transaction_state
_confluent-ksql-default__command_topic
_confluent-ksql-default_transient_transient_SOURCE_INTERFACE_METRIC_GAUGE_EVENTS_2147244345813467970_1738646234981-Aggregate-Aggregate-Materialize-changelog
_confluent-ksql-default_transient_transient_SOURCE_INTERFACE_METRIC_GAUGE_EVENTS_2147244345813467970_1738646234981-Aggregate-GroupBy-repartition
_confluent-ksql-default_transient_transient_SOURCE_METRICS_GAUGE_EVENTS_840353255721300468_1738646234970-Aggregate-Aggregate-Materialize-changelog
_confluent-ksql-default_transient_transient_SOURCE_METRICS_GAUGE_EVENTS_840353255721300468_1738646234970-Aggregate-GroupBy-repartition
_connect-configs
_connect-offsets
_connect-status
_schemas
default_ksql_processing_log
globalNOTIFICATION
s3-records
sflow-records
[appuser@broker ~]$ exit
exit
ravi@ncp02:~$
Verify Stream Processor
Ensure the KSQLDB server is running
ravi@ncp02:~$ curl -s "http://localhost:8088/info" | jq
{
"KsqlServerInfo": {
"version": "0.28.2",
"kafkaClusterId": "myPMAcK3TJ-Y6Z2g4JJBdA",
"ksqlServiceId": "default_",
"serverStatus": "RUNNING"
}
}
ravi@ncp02:~$
Verify Kafka Connectors
Ensure all the Kafka Connectors are created
If the below list is empty, run the create-connectors.sh
script in the installation directory to create the connectors
ravi@ncp02:~$ curl -s "http://localhost:8083/connectors" | jq
[
"OUTPUT_DEVICE_CPU_LOAD",
"OUTPUT_INTERFACE_LLDP_STATISTICS",
"OUTPUT_INTERFACE_RATE",
"OUTPUT_DEVICE_CRM_ACL_STATISTICS",
"OUTPUT_DEVICE_DOCKER_STATISTICS",
"TW_METRIC_AGGR_EVENT",
"OUTPUT_DEVICE_MEMORY_UTILIZATION",
"OUTPUT_BGP_NEIGHBORS_STATISTICS",
"OUTPUT_VTEP",
"OUTPUT_FLOW_DATA",
"OUTPUT_INTERFACE_UTILIZATION",
"OUTPUT_INTERFACE_FRAME_COUNTERS",
"OUTPUT_DEVICE_RESOURCE_UTILIZATION",
"OUTPUT_MCLAG",
"device-sink-connector",
"OUTPUT_DEVICE_PSU_STATUS",
"DEVICE_TABLE_CONNECTOR",
"TW_INTERFACE_METRIC_AGGR_EVENT",
"OUTPUT_INTERFACE_STATISTICS",
"OUTPUT_BGP_NEIGHBORS",
"OUTPUT_INTERFACE_TRANSMIT_QUEUE_COUNTERS",
"OUTPUT_INTERFACE_RECEIVE_QUEUE_COUNTERS",
"OUTPUT_INTERFACE_STATUS",
"OUTPUT_PORTCHANNEL",
"OUTPUT_DEVICE_FAN_STATUS",
"OUTPUT_DEVICE_CPU_UTILIZATION",
"OUTPUT_DEVICE_CPU_CORE_TEMPERATURE",
"OUTPUT_BGP_STATISTICS"
]
ravi@ncp02:~$
Verify GPU
Ensure the system recognizes the GPU
If the below command is not recognized, go to this section to 'Install GPU drivers'
ravi@ncp02:~$ nvidia-smi
Wed Feb 12 20:40:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:3B:00.0 Off | Off |
| 30% 32C P8 23W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
ravi@ncp02:~$
Verify GPU VRAM
Ensure the GPU VRAM is free before proceeding with the Network copilot
verify 'No running processes found' in the output
ravi@ncp02:~$ nvidia-smi
Wed Feb 12 20:40:43 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:3B:00.0 Off | Off |
| 30% 32C P8 23W / 300W | 14MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
ravi@ncp02:~$
Verify LLM Container
Ensure the LLM was successfully loaded onto the GPU
ravi@ncp02:~$ docker logs -f ncp-llm
....
INFO 02-12 12:44:26 launcher.py:33] Launching Uvicorn with --limit_concurrency 32765. To avoid this limit at the expense of performance run with --disable-frontend-multiprocessing
INFO: Started server process [7]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 02-12 12:44:37 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-12 12:44:47 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-12 12:44:57 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
....
ravi@ncp02:~$ nvidia-smi
Wed Feb 12 20:46:11 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:3B:00.0 Off | Off |
| 30% 34C P8 22W / 300W | 42517MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3387365 C /usr/bin/python3 42492MiB |
+---------------------------------------------------------------------------------------+
ravi@ncp02:~$
Last updated
Was this helpful?