keepalived VM禁用网卡导致无VIP分析

来源:本站原创 网络技术 超过33 views围观 0条评论

测试环境约定
keepalived-1.2.13-8.el7.x86_64
CentOS Linux release 7.3.1611 (Core)
Linux shtslb01 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

双机的keepalived进程都开启,启用master backup的模式
vip地址能正常ping通

1.关闭master的网卡
ifdown ens33

ping keepalive master ————-
[c:\~]$ ping 172.16.9.67

正在 Ping 172.16.9.67 具有 32 字节的数据:
请求超时。

ping vip地址—————
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
请求超时。
请求超时。
请求超时。

故障现像
双机的keepalived进程都存在,但m/b机器都没有Add VIP地址

原因是主机未sending  0 priority 给备机导致备机由于机制原因未增加vip.

0915具体原因已查明,在VM环境下ifdown ens160接口,会出现vm的网卡关闭,但在另一个机器上

tcpdump -i ens160 -nn grrp 发现这个vm仍会持续发送vrrp包,导致备机优先级为10小于100无法进行切换.这就奇怪了.
10:11:18.356859 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
10:11:19.357352 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20

因为 关机,挂起,kill进程都会切换备机,唯独关闭网卡出现无VIP的情况

测试1 执行挂机操作
在VM上执行挂起操作,抓包发现瞬间进行了切换.
10:18:16.859199 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
10:18:17.860395 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
10:18:18.861724 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
————–切换开始—————-
10:18:22.823343 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
————–切换结束—————-
10:18:23.825414 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
10:18:24.826114 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
10:18:25.826719 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
10:18:26.827314 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20

测试2 将master 进行恢复操作

master vip已恢复
10:21:15.930309 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
————-切换开始—————-
10:21:15.930829 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20
10:21:15.930965 IP 172.16.9.68 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 10, authtype simple, intvl 1s, length 20
————-切换结束—————-
10:21:15.931328 IP 172.16.9.67 > 224.0.0.18: VRRPv2, Advertisement, vrid 51, prio 100, authtype simple, intvl 1s, length 20

相关测试  ping vip地址切换过程中丢一个包.
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
请求超时。
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61
来自 172.16.9.69 的回复: 字节=32 时间<1ms TTL=61

master 看接口地址
[root@shtslb01 ~]#
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:50:56:9d:5d:16 brd ff:ff:ff:ff:ff:ff
    inet 172.16.9.67/24 brd 172.16.9.255 scope global ens160
       valid_lft forever preferred_lft forever
    inet 172.16.9.69/32 scope global ens160————vip地址——-
       valid_lft forever preferred_lft forever

———-以下为错误分析——
—–这个版本的 keepalived 会使两台机器都不启用 vip本案例需要解决这个问题 ——

Master日志:
Sep 14 14:40:44 mysql01 NetworkManager[627]: <info>  [1505371244.4998] device (ens33): state change: activated -> deactivating (reason ‘user-requested’) [100 110 39]
网络被关闭-----Sep 14 14:40:44 mysql01 NetworkManager[627]: <info>  [1505371244.5006] manager: NetworkManager state is now DISCONNECTING
Sep 14 14:40:44 mysql01 dbus[617]: [system] Activating via systemd: service name=’org.freedesktop.nm_dispatcher’ unit=’dbus-org.freedesktop.nm-dispatcher.service’
Sep 14 14:40:44 mysql01 dbus-daemon: dbus[617]: [system] Activating via systemd: service name=’org.freedesktop.nm_dispatcher
‘ unit=’dbus-org.freedesktop.nm-dispatcher.service’
Sep 14 14:40:44 mysql01 systemd: Starting Network Manager Script Dispatcher Service…
Sep 14 14:40:44 mysql01 NetworkManager[627]: <info>  [1505371244.6169] audit: op="device-disconnect" interface="ens33" ifindex=2 pid=126800 uid=0 result="success"
Sep 14 14:40:44 mysql01 NetworkManager[627]: <info>  [1505371244.6178] device (ens33): state change: deactivating -> disconnected (reason ‘user-requested’) [110 30 39]
Sep 14 14:40:44 mysql01 Keepalived_vrrp[126656]: Netlink reflector reports IP fe80::20c:29ff:fe5c:8574 removed
Sep 14 14:40:44 mysql01 Keepalived_healthcheckers[126655]: Netlink reflector reports IP fe80::20c:29ff:fe5c:8574 removed
Sep 14 14:40:44 mysql01 dbus[617]: [system] Successfully activated service ‘org.freedesktop.nm_dispatcher’
Sep 14 14:40:44 mysql01 systemd: Started Network Manager Script Dispatcher Service.
Sep 14 14:40:44 mysql01 dbus-daemon: dbus[617]: [system] Successfully activated service ‘org.freedesktop.nm_dispatcher’
Sep 14 14:40:44 mysql01 Keepalived_vrrp[126656]: Netlink reflector reports IP 192.168.142.138 removed  ---地址被移除
Sep 14 14:40:44 mysql01 Keepalived_vrrp[126656]: Netlink reflector reports IP 192.168.142.188 removed
Sep 14 14:40:44 mysql01 Keepalived_healthcheckers[126655]: Netlink reflector reports IP 192.168.142.138 removed
Sep 14 14:40:44 mysql01 Keepalived_healthcheckers[126655]: Netlink reflector reports IP 192.168.142.188 removed
Sep 14 14:40:44 mysql01 nm-dispatcher: req:1 ‘connectivity-change’: new request (4 scripts)
Sep 14 14:40:44 mysql01 nm-dispatcher: req:1 ‘connectivity-change’: start running ordered scripts…
Sep 14 14:40:44 mysql01 NetworkManager[627]: <info>  [1505371244.6525] manager: NetworkManager state is now DISCONNECTED
Sep 14 14:40:44 mysql01 nm-dispatcher: req:2 ‘down’ [ens33]: new request (4 scripts)
Sep 14 14:40:44 mysql01 nm-dispatcher: req:2 ‘down’ [ens33]: start running ordered scripts…
Sep 14 14:40:44 mysql01 chronyd[647]: Source 172.30.100.139 offline
Sep 14 14:40:44 mysql01 chronyd[647]: Can’t synchronise: no selectable sources
Sep 14 14:40:50 mysql01 Keepalived_healthcheckers[126655]: TCP socket bind failed. Rescheduling.
Sep 14 14:40:56 mysql01 Keepalived_healthcheckers[126655]: TCP socket bind failed. Rescheduling.
Sep 14 14:41:02 mysql01 Keepalived_healthcheckers[126655]: TCP socket bind failed. Rescheduling.

正常的切换日志如下
[root@mysql01 ~]# !tail
tail -250 /var/log/messages|less
[root@mysql01 ~]# tail -f /var/log/messages
Sep 14 14:59:11 mysql01 chronyd[647]: System clock wrong by -3.886966 seconds, adjustment started
Sep 14 15:01:01 mysql01 systemd: Started Session 322 of user root.
Sep 14 15:01:01 mysql01 systemd: Starting Session 322 of user root.
Sep 14 15:32:51 mysql01 Keepalived[127003]: Stopping Keepalived v1.2.13 (11/05,2016)---正常停止
Sep 14 15:32:51 mysql01 Keepalived_vrrp[127005]: VRRP_Instance(VI_1) sending 0 priority---发送优先级
Sep 14 15:32:51 mysql01 Keepalived_vrrp[127005]: VRRP_Instance(VI_1) removing protocol VIPs.
Sep 14 15:32:51 mysql01 Keepalived_healthcheckers[127004]: Netlink reflector reports IP 192.168.142.188 removed
Sep 14 15:32:51 mysql01 Keepalived_healthcheckers[127004]: Removing service [192.168.142.138]:3310 from VS [192.168.142.188]:3310
Sep 14 15:32:51 mysql01 Keepalived_healthcheckers[127004]: IPVS: No such destination
Sep 14 15:32:51 mysql01 Keepalived_healthcheckers[127004]: IPVS: No such file or directory

备机收到的日志
Sep 14 15:32:51 mysql02 Keepalived_vrrp[124728]: VRRP_Instance(VI_1) Transition to MASTER STATE---
Sep 14 15:32:52 mysql02 Keepalived_vrrp[124728]: VRRP_Instance(VI_1) Entering MASTER STATE---
Sep 14 15:32:52 mysql02 Keepalived_vrrp[124728]: VRRP_Instance(VI_1) setting protocol VIPs.
Sep 14 15:32:52 mysql02 Keepalived_vrrp[124728]: VRRP_Instance(VI_1) Sending gratuitous ARPs on ens33 for 192.168.142.188--
Sep 14 15:32:52 mysql02 Keepalived_healthcheckers[124727]: Netlink reflector reports IP 192.168.142.188 added
Sep 14 15:32:57 mysql02 Keepalived_vrrp[124728]: VRRP_Instance(VI_1) Sending gratuitous ARPs on ens33 for 192.168.142.188

2.解决方案
增加人工仲裁机制,即ping 对方如果对方不可访问即重启keepalived进程

增加人工仲裁脚本
#——————jeff v2——————
IP=172.16.9.69
date="`date ‘+%Y-%m-%d %H:%M:%S’`"    #取时间
lost=`ping -c 3 -w 3 $IP | grep ‘packet loss’ \    #取lost packet 值与 0、100进行对比
| awk -F’packet loss’ ‘{ print $1 }’ \
| awk ‘{ print $NF }’ | sed ‘s/%//g’`

if [ $lost -eq 0 ]   #不丢包则打印
then
echo "$date ping is ok" >>/var/log/keepalived.log
elif [ $lost -lt 100 ]  #不是100丢包即报警
then
echo "$date ping is error" >>/var/log/keepalived_error.log
else                     #等于100即重启服务
systemctl restart keepalived
fi
——————jeff v2———————————

keepalived 增加配置
#———–add bgeing————
vrrp_script cl {
    script "/opt/script/keepalive_check.sh"
    interval 5
    weight 120    #如检测出现问题 优先级增加120
        }
#————add bgeing———
vrrp_instance VI_1 {
    state BACKUP
    interface ens160
    virtual_router_id 51
    priority 10
    advert_int 1
#————add check begin—–
   track_script {
    cl
   }
#———–add check begin——-

检测脚本最终版

#——————jeff v2-0914—————–
IP=172.16.9.69
date="`date ‘+%Y-%m-%d %H:%M:%S’`"
lost=`ping -c 3 -w 3 $IP | grep ‘packet loss’ \
| awk -F’packet loss’ ‘{ print $1 }’ \
| awk ‘{ print $NF }’ | sed ‘s/%//g’`

if [ $lost -eq 0 ]
then
echo "$date ping $IP  is ok" >>/var/log/keepalived.log
elif [ $lost -lt 100 ]
then
echo "$date ping $IP is error" >>/var/log/keepalived_error.log
else
echo "$date ping $IP is error" >>/var/log/keepalived_error.log
#systemctl restart keepalived
#pkill keepalived
fi
~                                                                                                                          

3.脚本运行后的恢复方式

1 恢复主服服务器的网络
2 开启 systemctl restart network 
       systemctl restart keepalived
     关掉备机进程  pkill keepalived
3 ps aux |grep keepalive_check
4 ip a 查看
5 观察主机是否恢复  
6 开启备机keepalived

文章出自:CCIE那点事 http://www.jdccie.com/ 版权所有。本站文章除注明出处外,皆为作者原创文章,可自由引用,但请注明来源。 禁止全文转载。
本文链接:http://www.jdccie.com/?p=3562转载请注明转自CCIE那点事
如果喜欢:点此订阅本站
  • 相关文章
  • 为您推荐
  • 各种观点