LVS集群架构及部署

简述

LVS是Linux Virtual Server的简称,也就是Linux虚拟服务器, 是一个由章文嵩博士发起的自由软件项目,它的官方站点是www.linuxvirtualserver.org。现在LVS已经是 Linux标准内核的一部分,在Linux2.4内核以前,使用LVS时必须要重新编译内核以支持LVS功能模块,但是从Linux2.4内核以后,已经完全内置了LVS的各个功能模块,无需给内核打任何补丁,可以直接使用LVS提供的各种功能。

作用

LVS主要用于服务器集群的负载均衡。它工作在网络层,可以实现高性能,高可用的服务器集群技术。它廉价,可把许多低性能的服务器组合在一起形成一个超级服务器。它易用,配置非常简单,且有多种负载均衡的方法。它稳定可靠,即使在集群的服务器中某台服务器无法正常工作,也不影响整体效果。另外可扩展性也非常好。

LVS自从1998年开始,发展到现在已经是一个比较成熟的技术项目了。可以利用LVS技术实现高可伸缩的、高可用的网络服务,例如WWW服务、Cache服务、DNS服务、FTP服务、MAIL服务、视频音频点播服务等等,有许多比较著名网站和组织都在使用LVS架设的集群系统,例如:淘宝、腾讯、百度、sourceforge等。

体系结构

使用LVS架设的服务器集群系统有三个部分组成:

  • 前端负载均衡层,用Load Balancer表示
  • 中间的服务器集群层,用Server Array表示
  • 最底端的数据共享存储层,用Shared Storage表示

转发模式

先说几个名字解释

  • LB(Load Balancer) :负载均衡器,也就是装有LVS(ipvsadm)的server
  • VIP(Virtual IP):虚拟IP,也就是给远程客户端(网民)提供服务的外部IP,比如,提供80服务,域名是www.a.com,则www.a.com 对应的A记录就是VIP
  • LD(Load Balancer Director):同LB,负载均衡调度器
  • real server:即后端提供真是服务的server,比如你提供的是80服务,那你机器可能就是装着Apache这中web服务器
  • DIP(Director IP):在NAT模式中是后端realserver的gateway,在DR和Tune中如果使用heartbeat或者keepalived,用来探测使用
  • RIP(Real Server IP):后端realserver的IP

NAT模式

原理: 把客户端发来的数据包IP报头的目的地址,在负载均衡器上转换成其中一台RS的IP地址,并发至此RS来处理,RS处理完成后把数据交给负载均衡器,负载均衡器再把数据包的原IP地址改为自己的IP,将目的地址改为客户端IP地址即可,期间,无论是进来的流量,还是出去的流量,都必须经过负载均衡器。

优点: 集群中的物理服务器可以使用任何支持TCP/IP操作系统,只有负载均衡器需要一个合法的IP地址。
缺点: 扩展性有限。当服务器节点(普通PC服务器)增长过多时,负载均衡器将成为整个系统的瓶颈。

TUNNEL模式

原理: 隧道模式,把客户端发来的数据包,封装一个新的IP头标记(仅目的IP)发给RS,RS收到后,先把数据包的头解开,还原数据包,处理后,直接返回给客户端,不需要再经过负载均衡器。注意,由于RS需要对负载均衡器发过来的数据包进行还原,所以说必须支持IPTUNNEL协议,所以,在RS的内核中,必须编译支持IPTUNNEL这个选项。

优点: 负载均衡器只负责将请求包分发给后端节点服务器,而RS将应答包直接发给用户。所以,减少了负载均衡器的大量数据流动,负载均衡器不再是系统的瓶颈,就能处理很巨大的请求量,这种方式,一台负载均衡器能够为很多RS进行分发。而且跑在公网上就能进行不同地域的分发。

DR模式

原理: 负载均衡器和RS都使用同一个IP对外服务,但只有DR对ARP请求进行响应,所有RS对本身这个IP的ARP请求保持静默,也就是说,网关会把对这个服务IP的请求全部定向给DR,而DR收到数据包后根据调度算法,找出对应的RS,把目的MAC地址改为RS的MAC(因为IP一致)并将请求分发给这台RS,这时RS收到这个数据包,处理完成之后,由于IP一致,可以直接将数据返给客户,则等于直接从客户端收到这个数据包无异,处理后直接返回给客户端,由于负载均衡器要对二层包头进行改换,所以负载均衡器和RS之间必须在一个vlan中。

优点: 和TUN(隧道模式)一样,负载均衡器也只是分发请求,应答包通过单独的路由方法返回给客户端。与TUN相比,DR这种实现方式不需要隧道,开销更小。
缺点: realserver 与 director server必须处在同一网段中。

DR转发原理详解

现在客户端CLient访问www.a.com ,经过dns查询得到目的IP为VIP,目的端口为80,于是客户端和我们VIP,端口80建立连接(TCP三次握手,只是建立连接没有传送数据),之后客户端发送HTTP请求,LVS在VIP上收到之后,根据hash策略,从后端realserver中选出一台作为此次请求的接受者,假设为RIP1,LVS将请求包的目的mac地址更改为RIP1的mac,然后封装后转发给后端的RIP1,同时将该链接记录在hash表中。

RIP1的某一块网卡,比如eth0,接收到这个转发包看到mac地址是自己的,于是就转发给上层的IP层,IP层解开包后,发现目的的IP地址也是自己,因为VIP也配置在我们的一块non-arp的网卡上(比如lo:0),然后根据IP首部的类型字段(这里是TCP),把请求送给TCP,然后TCP根据目的端口80,传给应用层的Apache,Apache处理完请求之后,将数据传给TCP,TCP将源端口更改为80 ,源IP更改为VIP,目的端口更改为客户端的端口,目的IP更改为Client的IP,打包后给IP层,IP层根据目的地址进行路由,然后经过网络返给Client,完成了一次请求,而不经过LB。

如图

这里的client也可以理解为网关gateway,整个请求分为两部分:
首先客户端需要ARP解析VIP的mac地址,这部分的HA由VRRP协议来实现,VRRP监听一个虚拟IP地址,由master服务器使用自己的MAC地址进行回复,所有流量到达director server,当master宕机时,backup升级为master并负责ARP回复。

流量到达DS后,DS会通过负载均衡算法选择一个RIP,并将数据包的目的MAC地址改写为RIP对应的MAC地址再转发出去,数据包到达RS后,RS检查发现目的IP为自己loopback接口的ip(也即VIP),源ip为客户端的IP,于是自己以VIP为源ip,客户端为目的IP,不经过DS直接回包。

ARP问题

arp_ignore
上述可知,DIP和RIP必须位于同一个vlan,那么client/gateway在请求VIP对应的MAC地址时,会向该广播域发送ARP请求,该请求将同时被DS和RS接收,而在linux主机上又具有代理ARP的特性,当eth1收到目标地址为本地任意接口IP的ARP请求时都会返回ARP reply,需要关闭,否则会干扰DS的ARP回复,导致错误。

arp_ignore - INTEGER
Define different modes for sending replies in response to received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured on any interface
1 - reply only if the target IP address is local address configured on the incoming interface
2 - reply only if the target IP address is local address configured on the incoming interface and both with the sender’s IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host, only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses

配置 echo "1" >/proc/sys/net/ipv4/conf/all/arp_ignore,如果ARP请求的目标IP是入站接口的IP则回复该请求,否则不响应。这样保证只有DS会回复client/gateway的ARP请求。

注意
只有在RS和DS处于MA多路访问的网络中才需要关闭RIP的ARP相应,如下图。

如果gateway发送的ARP请求并不会被RS收到,如下图,则可以不设置arp_ignore

arp_announce

Assume that a linux box X has three interfaces - eth0, eth1 and eth2. Each interface has an IP address IP0, IP1 and IP2. When a local application tries to send an IP packet with IP0 through the eth2. Unfortunately, the target node’s mac address is not resolved. Thelinux box X will send the ARP request to know the mac address of the target(or the gateway). In this case what is the IP source address of the “ARP request message”? The IP0- the IP source address of the transmitting IP or IP2 - the outgoing interface? Until now(actually just 3 hours before) ARP request uses the IP address assigned to the outgoing interface(IP2 in the above example) However the linux’s behavior is a little bit different. Actually the selection of source address in ARP request is totally configurable bythe proc variable “arp_announce” .If we want to use the IP2 not the IP0 in the ARP request, we should change the value to 1 or 2. The default value is 0 - allow IP0 is used for ARP request.

考虑这样一种情况:

当gateway/client把流量发送给realserver时(图中省略director server),realserver拆包完成后发现目的ip为VIP,对应的入站接口为loopback0,再回包时,源ip为VIP,从接口eth0发出。这时realserver会发ARP请求gateway/client对应的MAC地址。简单来说,ARP请求中的源IP是VIP还是eth0接口对应的IP地址,这是关键所在。一般来说,会使用eth0接口对应的IP地址,但是linux的行为会有所不同,由参数arp_announce来决定,默认情况下会使用VIP作为ARP请求中的源ip地址。

arp_announce - INTEGER Define different restriction levels for announcing the local source IP address from IP packets in ARP requests sent on interface:
0 - (default) Use any local address, configured on any interface
1 - Try to avoid local addresses that are not in the target’s subnet for this interface. This mode is useful when target hosts reachable via this interface require the source IP address in ARP requests to be part of their logical network configured on the receiving interface. When we generate the request we will check all our subnets that include the target IP and will preserve the source address if it is from such subnet. If there is no such subnet we select source address according to the rules for level
2 - Always use the best local address for this target. In this mode we ignore the source address in the IP packet and try to select local address that we prefer for talks with the target host. Such local address is selected by looking for primary IP addresses on all our subnets on the outgoing interface that include the target IP address. If no suitable local address is found we select the first local address we have on the outgoing interface or on all other interfaces, with the hope we will receive reply for our request and even sometimes no matter the source IP address we announce. The max value from conf/{all,interface}/arp_announce is used. Increasing the restriction level gives more chance for receiving answer from the resolved target while decreasing the level announces more valid sender’s information.

在LVS这样的架构下就会出现问题。gateway收到这样的ARP请求后会刷新自己的arp缓存,将VIP关联到realserver的eth0口,导致不能正确解析。
需要配置echo "2" >/proc/sys/net/ipv4/conf/all/arp_announce,忽略源ip而使用本地路由的最佳出站接口的ip地址。

基础LVS架构实现

部署

Director Server上安装IPVS

1
2
3
4
5
6
yum install libnl* libpopt* popt-static
wget http://www.linuxvirtualserver.org/software/kernel-2.6/ipvsadm-1.26.tar.gz
tar zxvf ipvsadm-1.26.tar.gz
cd ipvsadm-1.26
make
make install

Director Server 安装Keepalived

1
2
3
4
5
yum install openssl-devel
wget http://keepalived.org/software/keepalived-1.2.19.tar.gz
./configure --sysconf=/etc --with-kernel-dir=/usr/src/kernels/`uname -r`
make
make install

配置文件 /etc/keepalived/keepalived.conf

DR转发模式

实验拓扑
topo

所有设备均处于同一网段,在vmware中把所有虚拟机桥接到同一网卡即可。

IP地址规划如下

Role IP VIP
DS 192.168.62.136 192.168.62.200
BDS 192.168.62.137 192.168.62.200
RS1 192.168.62.138 192.168.62.200
RS2 192.168.62.139 192.168.62.200
Client 192.168.62.1

Director Server配置

采用keepalive来实现director server之间的HA,需要配置/etc/keepalive/keepalived.conf

关于keepalive和VRRP

keepalive、heartbeat的功能类似且都可用于LVS,主要用作RealServer的健康状态检查以及LoadBalance主机和BackUP主机之间failover的实现。

VRRP协议将两台或多台路由器设备虚拟成一个设备,对外提供虚拟路由器IP(一个或多个),而在路由器组内部,如果实际拥有这个对外IP的路由器如果工作正常的话就是MASTER,或者是通过算法选举产生,MASTER实现针对虚拟路由器IP的各种网络功能,如ARP请求,ICMP,以及数据的转发等;其他设备不拥有该IP,状态是BACKUP,除了接收MASTER的VRRP状态通告信息外,不执行对外的网络功能。当主机失效时,BACKUP将接管原先MASTER的网络功能。

配置模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
! Configuration File for keepalived
global_defs { # 全局配置
notification_email {
conan1993ai@163.com # 报警邮箱
}
notification_email_from i@tyr.so # 发送账号
smtp_server smtp.tyr.so # smtp服务器
smtp_connect_timeout 30
router_id LVS_DEVEL # 将显示在主题中
}
vrrp_instance VI_1 { # VRRP实例
state MASTER # 角色 MASTER/BACKUP
interface eth1 # HA监听接口
virtual_router_id 51 # 同一VRRP实例下router-id必须一致
priority 100 # VRRP角色优先级,越高越优
advert_int 1 # HA检查的时间间隔
authentication {
auth_type PASS # 验证类型有PASS和AH两种
auth_pass password
}
virtual_ipaddress {
192.168.62.200 # 虚拟IP地址,可以有多个
}
}
virtual_server 192.168.62.200 80 {
delay_loop 6 # 运行情况检查时间,6s
lb_algo rr # 负载均衡算法,rr轮询
lb_kind DR # 转发模式
persistence_timeout 50 # 会话保持时间,对集群的session保持很有用,在没有超时前,session会一直被发送到
# 同一个server,直到50s超时时间
protocol TCP
sorry_server 192.168.62.140 80
real_server 192.168.62.138 80 {
weight 1
HTTP_GET { # 存活检测
url {
path /test.html
digest 99d17aaa9df48db49605a368fde7cda9 # 对请求页面的md5值
}
connect_timeout 3
nb_get_retry 3 # 重试次数
delay_before_retry 3 # 重试间隔
}
}
real_server 192.168.62.139 80 {
weight 1
HTTP_GET {
url {
path /test.html
digest 7e39c6203faba4dd010ffcba1423a265
}
connect_timeout 3
nb_get_retry 3
delay_before_retry 3
}
}
}

LVS支持的调度算法

  • 轮叫调度(Round-Robin Scheduling)
  • 加权轮叫调度(Weighted Round-Robin Scheduling)
  • 最小连接调度(Least-Connection Scheduling)
  • 加权最小连接调度(Weighted Least-Connection Scheduling)
  • 基于局部性的最少链接(Locality-Based Least Connections Scheduling)
  • 带复制的基于局部性最少链接(Locality-Based Least Connections with Replication Scheduling)
  • 目标地址散列调度(Destination Hashing Scheduling)
  • 源地址散列调度(Source Hashing Scheduling)

Real server配置

在RS上配置如下脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
vim /etc/init.d/realserver.sh
# add for chkconfig
# chkconfig: 2345 70 30  # 234都是文本界面,5就是图形界面X,70启动顺序号,30系统关闭,脚本
# 止顺序号
# description: RealServer's script  # 关于脚本的简短描述
# processname: realserver.sh       # 第一个进程名,后边设置自动时会用到
# !/bin/bash
VIP=192.168.62.200
case "$1" in
start)
       ifconfig lo:0 $VIP netmask 255.255.255.255 broadcast $VIP
       /sbin/route add -host $VIP dev lo:0
       echo "1" >/proc/sys/net/ipv4/conf/lo/arp_ignore
       echo "2" >/proc/sys/net/ipv4/conf/lo/arp_announce
       echo "1" >/proc/sys/net/ipv4/conf/all/arp_ignore
       echo "2" >/proc/sys/net/ipv4/conf/all/arp_announce
       echo "RealServer Start OK"
       ;;
stop)
       ifconfig lo:0 down
       route del $VIP >/dev/null 2>&1
       echo "0" >/proc/sys/net/ipv4/conf/lo/arp_ignore
       echo "0" >/proc/sys/net/ipv4/conf/lo/arp_announce
       echo "0" >/proc/sys/net/ipv4/conf/all/arp_ignore
       echo "0" >/proc/sys/net/ipv4/conf/all/arp_announce
       echo "RealServer Stoped"
       ;;
       *)
       echo "Usage: $0 {start|stop}"
       exit 1
esac
exit 0

启动

DS: /etc/init.d/keepalive start
RS: /etc/init.d/realserver start

注意: 在DS使用/etc/init.d/keepalive start启动时可能会提示Starting keepalived: /bin/bash: keepalived: command not found
which keepalived 确定路径为/usr/local/sbin/keepalived
修改 /etc/init.d/keepalive将其中的 daemon keepalived ${KEEPALIVED_OPTIONS} 修改为daemon /usr/local/sbin/keepalived ${KEEPALIVED_OPTIONS}

RS上配置web server,步骤略。

监控

在测试的过程中发现两台DS上都在监听VIP,最后经过排查是iptables的问题。这里需要注意的是,根据VRRP协议规则,master和backup使用组播224.0.0.18来通信。当处于稳定状态时,由master周期性向该地址发送hello包。

通过使用 ip addr show可以观察到DS/BDS监听的地址

DS

1
2
3
4
5
6
7
8
9
10
11
12
[root@localhost ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:0c:29:70:1f:43 brd ff:ff:ff:ff:ff:ff
inet 192.168.62.136/24 brd 192.168.62.255 scope global eth1
inet 192.168.62.200/32 scope global eth1
inet6 fe80::20c:29ff:fe70:1f43/64 scope link
valid_lft forever preferred_lft forever

BDS

1
2
3
4
5
6
7
8
9
10
11
[root@localhost ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:0c:29:c3:08:06 brd ff:ff:ff:ff:ff:ff
inet 192.168.62.137/24 brd 192.168.62.255 scope global eth1
inet6 fe80::20c:29ff:fec3:806/64 scope link
valid_lft forever preferred_lft forever

使用ipvsadm来查看LVS的工作状态

ipvsadm -Ln # 可以查看VIP/RS的工作情况,-n不进行反向dns解析

1
2
3
4
5
6
7
[root@localhost ~]# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.33.200:80 rr persistent 50
-> 192.168.62.138:80 Tunnel 1 570 1
-> 192.168.150.129:80 Tunnel 1 365 121

ipvsadm -Lcn # 可以查看连接情况

1
2
3
4
5
6
7
[root@localhost ~]# ipvsadm -Lnc
IPVS connection entries
pro expire state source virtual destination
TCP 01:25 FIN_WAIT 192.168.26.131:44861 192.168.33.200:80 192.168.150.129:80
TCP 01:52 FIN_WAIT 192.168.26.131:44862 192.168.33.200:80 192.168.150.129:80
TCP 01:59 FIN_WAIT 192.168.26.131:44863 192.168.33.200:80 192.168.150.129:80
TCP 00:48 NONE 192.168.26.131:0 192.168.33.200:80 192.168.150.129:80

可靠性测试

RS失效

当RealServer宕机时,日志可以看到MD5错误,移除RS并发送邮件告警

1
2
3
4
MD5 digest error to [192.168.62.139]:80 url[/test.html], MD5SUM[8dd7f2bd9360efee9222f05d6d21b030].
Removing service [192.168.62.139]:80 from VS [192.168.62.200]:80
Remote SMTP server [0.0.0.0]:25 connected.
SMTP alert successfully sent.

失效的RS会自动剔除,当恢复后会自动加入。

DS失效

在client上看到当前的VIP的MAC地址为

1
2
C:\Users\conan>arp -a | find "192.168.62.200"
192.168.62.200 00-0c-29-70-1f-43 动态

这为DS eth1口的MAC地址

模拟DS故障:iptables禁止VRRP流量:
iptables -A OUTPUT -d 224.0.0.18 -j DROP

BDS上可以看到日志,BACKUP变为MASTER并发送免费ARP主动通知MAC地址变更

1
2
3
4
5
6
Jul 10 20:54:33 localhost Keepalived_vrrp[29951]: VRRP_Instance(VI_1) Transition to MASTER STATE
Jul 10 20:54:34 localhost Keepalived_vrrp[29951]: VRRP_Instance(VI_1) Entering MASTER STATE
Jul 10 20:54:34 localhost Keepalived_vrrp[29951]: VRRP_Instance(VI_1) setting protocol VIPs.
Jul 10 20:54:34 localhost Keepalived_healthcheckers[29950]: Netlink reflector reports IP 192.168.62.200 added
Jul 10 20:54:34 localhost Keepalived_vrrp[29951]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 192.168.62.200
Jul 10 20:54:39 localhost Keepalived_vrrp[29951]: VRRP_Instance(VI_1) Sending gratuitous ARPs on eth1 for 192.168.62.200

BDS上

1
2
3
4
5
6
7
8
9
10
11
12
[root@localhost ~]# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:0c:29:c3:08:06 brd ff:ff:ff:ff:ff:ff
inet 192.168.62.137/24 brd 192.168.62.255 scope global eth1
inet 192.168.62.200/32 scope global eth1
inet6 fe80::20c:29ff:fec3:806/64 scope link
valid_lft forever preferred_lft forever

client上看到VIP的MAC地址为

1
2
C:\Users\conan>arp -a | find "192.168.62.200"
192.168.62.200 00-0c-29-c3-08-06 动态

这里为BDS eth1口的MAC地址

模拟DS故障

1
2
3
4
5
6
7
8
9
C:\Users\conan>ping 192.168.62.200 -t
正在 Ping 192.168.62.200 具有 32 字节的数据:
来自 192.168.62.200 的回复: 字节=32 时间<1ms TTL=64
来自 192.168.62.200 的回复: 字节=32 时间<1ms TTL=64
请求超时。
来自 192.168.62.200 的回复: 字节=32 时间<1ms TTL=64
来自 192.168.62.200 的回复: 字节=32 时间<1ms TTL=64
来自 192.168.62.200 的回复: 字节=32 时间<1ms TTL=64

客户端只有一个丢包,使用webbench测试

1
2
3
4
5
Benchmarking: GET ttp://192.168.62.200/ (using HTTP/1.1)
30 clients, running 100 sec.
Speed=9141 pages/min, 363271 bytes/sec.
Requests: 15206 susceed, 29 failed.

每秒请求152pv,失败29个,切换时间在0.2s左右。

模拟RS故障

请求成功率100%

特性

抢占

禁用抢占特性,即在故障DS恢复时不再抢占当前的MASTER角色。需要将所有的DS都配置成BACKUP状态,且高优先级的设备配置nopreempt

双主机模式

BDS处于热备状态,浪费资源,可以配置两个VRRP实例,使得不同DS作为不同VRRP实例的MASTER,转发请求,当某台DS宕机时流量全部切换到另一台。

在DS1上配置如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
vrrp_instance VI_1 {
state MASTER
interface eth1
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.62.200
}
}
vrrp_instance VI_2 {
state BACKUP
interface eth1
virtual_router_id 52
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.62.220
}
}

在DS2上配置如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
vrrp_instance VI_1 {
state BACKUP
interface eth1
virtual_router_id 51
priority 80
advert_int 1
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.62.200
}
}
vrrp_instance VI_2 {
state MASTER
interface eth1
virtual_router_id 52
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass password
}
virtual_ipaddress {
192.168.62.220
}
}

再分别建立对应192.168.62.200和192.168.62.220的virtual_server即可,可以使用相同的RS集群,也可以使用另一组RS集群。
如果使用的是同一组RS集群,还需要在RS的/etc/init.d/realserver.sh脚本中添加相应的接口。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# !/bin/bash
VIP=192.168.62.200
VIP2=192.168.62.220
source /etc/rc.d/init.d/functions
case "$1" in
start)
ifconfig lo:0 $VIP netmask 255.255.255.255 broadcast $VIP
/sbin/route add -host $VIP dev lo:0
ifconfig lo:1 $VIP2 netmask 255.255.255.255 broadcast $VIP2
/sbin/route add -host $VIP2 dev lo:1
echo "1" >/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2" >/proc/sys/net/ipv4/conf/lo/arp_announce
echo "1" >/proc/sys/net/ipv4/conf/all/arp_ignore
echo "2" >/proc/sys/net/ipv4/conf/all/arp_announce
echo "RealServer Start OK"
;;
stop)
ifconfig lo:0 down
route del $VIP >/dev/null 2>&1
ifconfig lo:1 down
route del $VIP2 >/dev/null 2>&1
echo "0" >/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "0" >/proc/sys/net/ipv4/conf/lo/arp_announce
echo "0" >/proc/sys/net/ipv4/conf/all/arp_ignore
echo "0" >/proc/sys/net/ipv4/conf/all/arp_announce
echo "RealServer Stoped"
;;
*)
echo "Usage: $0 {start|stop}"
exit 1
esac
exit 0

这样通过两个不同的虚拟IP都可以访问,比如dns轮询机制。在client上可以看到

1
2
3
4
C:\Users\conan>arp -a | find "192.168.62.2"
192.168.62.200 00-0c-29-70-1f-43 动态 # DS1 eth1 MAC
192.168.62.220 00-0c-29-c3-08-06 动态 # DS2 eth1 MAC
192.168.62.255 ff-ff-ff-ff-ff-ff 静态

TUNNEL模式

拓扑
topo

其中的所有节点(除路由器)都是桥接到虚拟机,IP地址规划如下:
路由器R1所有接口IP地址主机位都是100,即

Role IP
router 192.168.26.100
192.168.62.100
192.168.33.100
192.168.150.100
client 192.168.26.131
DS 192.168.33.128
BDS 192.168.33.129
VIP 192.168.33.200
RS1 192.168.62.138
RS2 192.168.150.129

实验使用gns3模拟的路由器和vmware虚拟机进行桥接,在和vmware网卡桥接的过程中,最好选择host-only模式,这样虚拟机的网卡会缺少默认路由,需要手动指向一条默认路由到路由器:
route add default gw 192.168.xx.100 xx为网口所在网段。如果是NAT模式需要删除预配的默认路由,再按照如上进行配置
注意不可将默认网关指向宿主机的网卡,比如192.168.xx.1,因为宿主机不会转发该数据包,需要通过桥接的路由器来转发。

TUN模式keepalive的配置区别在于virtual_server中的转发模式更改为lb_kind TUN。在realserver上,需要配置tunl0接口,启动脚本修改如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
VIP=192.168.33.200
case "$1" in
start)
ifconfig tunl0 $VIP netmask 255.255.255.255 broadcast $VIP up
echo "1" >/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2" >/proc/sys/net/ipv4/conf/lo/arp_announce
echo "1" >/proc/sys/net/ipv4/conf/all/arp_ignore
echo "2" >/proc/sys/net/ipv4/conf/all/arp_announce
echo "0" > /proc/sys/net/ipv4/conf/tunl0/rp_filter
echo "RealServer Start OK"
;;
stop)
ifconfig tunl0 down
echo "0" >/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "0" >/proc/sys/net/ipv4/conf/lo/arp_announce
echo "0" >/proc/sys/net/ipv4/conf/all/arp_ignore
echo "0" >/proc/sys/net/ipv4/conf/all/arp_announce
echo "RealServer Stoped"
;;
*)
echo "Usage: $0 {start|stop}"
exit 1
esac
exit 0

初始化配置完成后,client始终无法访问VIP,抓包可以看到:

出现问题:一直在重传,RS并没有给client恢复[SYN,ACK]。

IPIP封装

可以看到双层IP封装的IPIP数据包:

当IP包到达DS时,DS会再在外层封装自己的IP报头,源ip为DS,目的ip为RS

需要注意的是只有linux,FreeBSD,windows 2000可以解析IPIP数据包,后续的windows版本不再支持该功能。

在RS tunl0上抓包发现只有入站的包,没有出站的包,应该是被内核丢弃了,最后发现是一个参数没有设置。

ARP问题

如果RIP和DIP不在同一个网段中,则无需考虑ARP问题,否则和DR模式一样,realserver会回复VIP的ARP请求并且产生ARP请求时携带的source IP对gateway的ARP缓存存在干扰,需要配置arp_ignorearp_announce参数。

RPF检测和IP欺骗

rp_filter

需要配置 echo 0 >/proc/sys/net/ipv4/conf/tunl0/rp_filter

A router SHOULD IMPLEMENT the ability to filter traffic based on a comparison of the source address of a packet and the forwarding table for a logical interface on which the packet was received. If this filtering is enabled, the router MUST silently discard a packet if the interface on which the packet was received is not the interface on which a packet would be forwarded to reach the address contained in the source address. In simpler terms, if a router wouldn’t route a packet containing this address through a particular interface, it shouldn’t believe the address if it appears as a source address in a packet read from this interface.

实际上是默认开启的RPF反向路径检测丢弃了该数据包,需要关闭。RPF规则指出,不应该接受这样的数据包:其入站接口与查询路由表去往该源ip的出站接口不一致。简单来说,RS内核从eth1接口收到目的ip为192.168.33.200的数据包,但是查询路由表会发现本地去往192.168.33.200的出站接口为tunl0,接口不一致,于是丢弃。

同时也需要在上游交换机或者路由器上关闭RPF或者是源IP检测这样的功能,否则上游设备收到源IP为VIP的数据包后无法通过检测从而丢弃,导致TUN模式无法正常工作。

扩展LVS实现

LVS优化

hash table size

director会维护一个保存 <CIP, CPort, VIP, VPort, RIP, RPORT>五元组的哈希表用来保存来自client的连接信息。
其中

  • CIP: Client IP address
  • CPort: Client Port number
  • VIP: Virtual IP address
  • VPort: Virtual Port number
  • RIP: RealServer IP address
  • RPort: RealServer Port number.

默认大小使用ipvsadm -Ln可以看到为4096

1
2
3
4
5
6
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.33.200:80 rr persistent 50
-> 192.168.62.138:80 Tunnel 1 568 2
-> 192.168.150.129:80 Tunnel 1 503 768

The hash table speeds up the connection lookup and keeps state so that packets belonging to a connection from the client will be sent to the allocated realserver. If you are editing the .config by hand look for CONFIG_IP_MASQUERADE_VS_TAB_BITS.

这个哈希表用来加速连接的查找以及保存客户端到realserver的连接信息。这个哈希表的大小有编译时选项CONFIG_IP_VS_TAB_BITS设定,默认值为12,即2^12.

With CONFIG_IP_MASQUERADE_VS_TAB_BITS we specify not the max number of the entries (connections in your case) but the number of the rows in a hash table. This table has columns which are unlimited. You can set your table to 256 rows and to have 1,800,000 connections in 7000 columns average. But the lookup is slower. The lookup function chooses one row using hash function and starts to search all these 7000 entries for match. So, by increasing the number of rows we want to speedup the lookup. There is no connection limit. It depends on the free memory. Try to tune the number of rows in this way that the columns will not exceed 16 (average), for example. It is not fatal if the columns are more (average) but if your CPU is fast enough this is not a problem.

即hash表的大小和连接条目的最大大小无关,连接条目数取决于内存而不是hash表的size,设置太大反而会影响查询的速度。

在4096的大小下,测试6k个并发访问也是完全没有问题的。

如果您觉得这篇文章对您有帮助,不妨支持我一下!