MGR集群维护

1. 单主切换到多主

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# MGR切换模式需要重新启动组复制,因此需要在所有节点上先关闭组复制,
# 设置 group_replication_single_primary_mode=OFF 等参数,再启动组复制。
# 停止组复制(在所有MGR节点上执行):
stop group_replication;

# 单主模式关闭
set global group_replication_single_primary_mode=OFF;

# 如果是单主模式,因为不存在多主同时操作的可能,这个强制检查是可以关闭,因为已经不存在这样的操作,多主是必须要开的,不开的话数据就可能出现错乱了
set global group_replication_enforce_update_everywhere_checks=ON;

# 随便选择某个MGR节点执行 (比如这里选择在server-1节点):
set global group_replication_recovery_get_public_key=1;
SET GLOBAL group_replication_bootstrap_group=ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group=OFF;

# 然后在其他的MGR节点执行 (这里指server-2和server-3节点上执行):
set global group_replication_recovery_get_public_key=1;
START GROUP_REPLICATION;

# 查看MGR组信息 (在任意一个MGR节点上都可以查看)
SELECT * FROM performance_schema.replication_group_members;

2. 多主切换到单+主

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 停止组复制(在所有MGR节点上执行):
stop group_replication;

# 如果是单主模式,因为不存在多主同时操作的可能,这个强制检查是可以关闭,因为已经不存在这样的操作,多主是必须要开的,不开的话数据就可能出现错乱了
set global group_replication_enforce_update_everywhere_checks=OFF;

# 打开单主模式
set global group_replication_single_primary_mode=ON;

# 选择一个节点作为主节点, 在主节点上执行 (这里选择server-1节点作为主节点)
SET GLOBAL group_replication_bootstrap_group=ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group=OFF;

# 在其他剩余的节点, 也就是从库节点上执行 (这里从库节点指的就是server-2和server-3):
START GROUP_REPLICATION;

# 查看MGR组信息 (在任意一个MGR节点上都可以查看)
SELECT * FROM performance_schema.replication_group_members;

3. 常用命令及语句

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 查一下GTID,是否为之前设的那个group的uuid(558edd3c-02ec-11ea-9bb3-080027e39bd2)
mysql> show master status;
+----------------+----------+--------------+------------------+------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+----------------+----------+--------------+------------------+------------------------------------------+
| mgr-bin.000004 | 2208 | | | 558edd3c-02ec-11ea-9bb3-080027e39bd2:1-9 |
+----------------+----------+--------------+------------------+------------------------------------------+

# 查看group内所有成员的节点信息
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 4c668095-dd20-11e9-95ba-080027e39bc3 | server-2 | 3306 | ONLINE | SECONDARY | 8.0.17 |
| group_replication_applier | 4c668095-dd20-11e9-95ba-080027e39bd2 | server-1 | 3306 | ONLINE | PRIMARY | 8.0.17 |
| group_replication_applier | c90e9166-de11-11e9-9ee0-080027d8da86 | server-3 | 3306 | ONLINE | SECONDARY | 8.0.17 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

# 查看GROUP中的同步情况,当前复制状态
mysql> select * from performance_schema.replication_group_member_stats\G;
*************************** 1. row ***************************CHANNEL_NAME: group_replication_applierVIEW_ID: 15733926047947018:3MEMBER_ID: 4c668095-dd20-11e9-95ba-080027e39bc3COUNT_TRANSACTIONS_IN_QUEUE: 0COUNT_TRANSACTIONS_CHECKED: 4COUNT_CONFLICTS_DETECTED: 0COUNT_TRANSACTIONS_ROWS_VALIDATING: 1TRANSACTIONS_COMMITTED_ALL_MEMBERS: 558edd3c-02ec-11ea-9bb3-080027e39bd2:1-9LAST_CONFLICT_FREE_TRANSACTION: 558edd3c-02ec-11ea-9bb3-080027e39bd2:9
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0COUNT_TRANSACTIONS_REMOTE_APPLIED: 5COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************CHANNEL_NAME: group_replication_applierVIEW_ID: 15733926047947018:3MEMBER_ID: 4c668095-dd20-11e9-95ba-080027e39bd2COUNT_TRANSACTIONS_IN_QUEUE: 0COUNT_TRANSACTIONS_CHECKED: 4COUNT_CONFLICTS_DETECTED: 0COUNT_TRANSACTIONS_ROWS_VALIDATING: 1TRANSACTIONS_COMMITTED_ALL_MEMBERS: 558edd3c-02ec-11ea-9bb3-080027e39bd2:1-9LAST_CONFLICT_FREE_TRANSACTION: 558edd3c-02ec-11ea-9bb3-080027e39bd2:9
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0COUNT_TRANSACTIONS_REMOTE_APPLIED: 2COUNT_TRANSACTIONS_LOCAL_PROPOSED: 4COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************CHANNEL_NAME: group_replication_applierVIEW_ID: 15733926047947018:3MEMBER_ID: c90e9166-de11-11e9-9ee0-080027d8da86COUNT_TRANSACTIONS_IN_QUEUE: 0COUNT_TRANSACTIONS_CHECKED: 4COUNT_CONFLICTS_DETECTED: 0COUNT_TRANSACTIONS_ROWS_VALIDATING: 1TRANSACTIONS_COMMITTED_ALL_MEMBERS: 558edd3c-02ec-11ea-9bb3-080027e39bd2:1-9LAST_CONFLICT_FREE_TRANSACTION: 558edd3c-02ec-11ea-9bb3-080027e39bd2:9
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0COUNT_TRANSACTIONS_REMOTE_APPLIED: 4COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0

# 当前server中各个通道的使用情况
mysql> select * from performance_schema.replication_connection_status\G;
*************************** 1. row ***************************CHANNEL_NAME: group_replication_applierGROUP_NAME: 558edd3c-02ec-11ea-9bb3-080027e39bd2SOURCE_UUID: 558edd3c-02ec-11ea-9bb3-080027e39bd2THREAD_ID: NULLSERVICE_STATE: ONCOUNT_RECEIVED_HEARTBEATS: 0LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000RECEIVED_TRANSACTION_SET: 558edd3c-02ec-11ea-9bb3-080027e39bd2:1-9LAST_ERROR_NUMBER: 0LAST_ERROR_MESSAGE: LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000LAST_QUEUED_TRANSACTION: 558edd3c-02ec-11ea-9bb3-080027e39bd2:5LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 2019-11-10 21:35:36.917108LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 2019-11-10 21:35:36.917146QUEUEING_TRANSACTION: QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000

# 当前server中各个通道是否启用,on是启用
mysql> select * from performance_schema.replication_applier_status;
+---------------------------+---------------+-----------------+----------------------------+
| CHANNEL_NAME | SERVICE_STATE | REMAINING_DELAY | COUNT_TRANSACTIONS_RETRIES |
+---------------------------+---------------+-----------------+----------------------------+
| group_replication_applier | ON | NULL | 0 |
+---------------------------+---------------+-----------------+----------------------------+
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# reset master命令,MySQL主库的二进制日志文件将被全部删除,也可以说“重置主库二进制日志”
# 重置gtid_executed
reset master;

# 关闭日志记录
set SQL_LOG_BIN=0;

# 创建用户(网段192.168.64的可以访问)
CREATE USER 'mgr_repl'@'192.168.64.%' IDENTIFIED WITH sha256_password BY 'Mgr_repl@123456';
# alter USER 'mgr_repl'@'192.168.64.%' IDENTIFIED WITH sha256_password BY 'Mgr_repl@123456';

# 授权
GRANT REPLICATION SLAVE ON *.* TO 'mgr_repl'@'192.168.64.%';
GRANT BACKUP_ADMIN ON *.* TO 'mgr_repl'@'192.168.64.%';

# 刷新权限
flush privileges;

# 开启日志
set SQL_LOG_BIN=1;

# 构建group replication集群
change master to master_user='mgr_repl',master_password='Mgr_repl@123456' for channel 'group_replication_recovery';

# 设置group_replication_bootstrap_group为ON是为了标示以后加入集群的服务器以这台服务器为基准,以后加入的就不需要设置
set global group_replication_bootstrap_group=ON;

# 作为首个节点启动MGR集群
start group_replication;

# 关闭group_replication_bootstrap_group
set global group_replication_bootstrap_group=OFF;

# 查看mgr的状态,查询表performance_schema.replication_group_members
select * from performance_schema.replication_group_members;

# 查看是否读写
# 如果super_read_only是启动的,那么该成员仅提供读服务;
# 如果super_read_only是关闭的,并且 replication_group_members 中正常的成员n 满足 2n+1 > 整个GROUP成员个数,并且该成员的 member state是online,则该成员可提供读写服务
show global variables like 'super%';

# 读写
set global super_read_only=OFF;

# 只读
set global super_read_only=ON;

4. 从节点宕机恢复

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 1、在另一台从节点上mysqldump全量再恢复
mysqldump -h 192.168.64.34 -P 3306 -uroot -p --single-transaction --master-data=2 --flush-logs --flush-privileges --events --routines --all-databases > all_backup.sql

# 2、备份还原
systemctl restart mysqld
mysql -h 192.168.64.32 -uroot -p -P 3306 -e "reset master;"
mysql -h 192.168.192.168.64.32 -uroot -p -P 13306 < /home/sql_bak/all_backup.sql
# or
mysql -h 192.168.64.32 -uroot -p -P 3306
mysql> source /home/bak_mysql/all_backup.sql

# 3、添加复制用户,启动mgr
change master to master_user='mgr_repl',master_password='Mgr_repl@123456' for channel 'group_replication_recovery';
start GROUP_REPLICATION;
mysql> select * from performance_schema.replication_group_members;

5. 主备数据不一致同步数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# 恢复备库时先重置gtid_executed,再恢复备份文件
## 在主库上备份
mysqldump -h 192.168.64.33 -P 6002 -uroot -p --all-databases --single-transaction --quick >all_databases-202312212350.sql

## 将备份文件拷贝到有问题的主机上
cd /home/sql_bak
scp -P 22 root@192.168.64.32:/home/sql_bak/202312212350/all_databases-202312212350.sql .
## 从主库上发送到备库服务器
scp -P 22 all_databases-202312212350.sql root@192.168.64.33:/home/sql_bak
scp -P 22 all_databases-202312212350.sql root@192.168.64.34:/home/sql_bak

## 停止加入MGR集群
stop group_replication;

## 备库重置gtid_executed
reset master;

## 设置可读写
set global super_read_only=OFF;

## 备库恢复数据
mysql -h 192.168.64.33 -P 3306 -uroot -p </home/sql_bak/all_databases-202312212350.sql
mysql -h 192.168.64.34 -P 3306 -uroot -p </home/sql_bak/all_databases-202312212350.sql

## 加入MGR集群
start group_replication;

## 查看读写权限
show global variables like 'super%';

## 设置只读
set global super_read_only=ON;

## 查看集群状态
SELECT * FROM performance_schema.replication_group_members;

6. 节点长时间处于RECOVERING状态

1
2
3
4
5
6
7
8
9
10
# 查看日志,发现是用户密码加密插件问题Authentication plugin 'caching_sha2_password' reported error: Authentication requires secure connection.,也就是创建复制用户时,密码默认是mysql 8的加密方式

# 解决方法,更改密码加密方式
SET SQL_LOG_BIN=0;
alter USER 'mgr_repl'@'192.168.64.%' IDENTIFIED WITH sha256_password BY 'Mgr_repl@123456';
GRANT REPLICATION SLAVE ON *.* TO 'mgr_repl'@'192.168.64.%';
SET SQL_LOG_BIN=1;

# 再启动MGR就可以正常了
select * from performance_schema.replication_group_members;

7. 单主模式,恢复MGR-node1节点, 恢复后, 需要手动激活下该节点的组复制功能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 如果节点发生故障, 在恢复后需要重新加入到MGR集群里, 正确的做法是:
STOP GROUP_REPLICATION;
START GROUP_REPLICATION;

# 如果某个节点挂了, 则其他的节点继续进行同步。当故障节点恢复后, 只需要手动激活下该节点的组复制功能, 即可正常加入到MGR组复制集群内并自动同步其他节点数据。
START GROUP_REPLICATION;

# 如果是i/o复制出现异常,确定数据无误后
# 查找主库的gtid情况
mysql> show global variables like '%gtid%';
+----------------------------------------------+------------------------------------------+
| Variable_name | Value |
+----------------------------------------------+------------------------------------------+
| binlog_gtid_simple_recovery | ON |
| enforce_gtid_consistency | ON |
| group_replication_gtid_assignment_block_size | 1000000 |
| gtid_executed | 24f16c50-3508-11ee-8fa3-fa163e353519:1-10|
| gtid_executed_compression_period | 1000 |
| gtid_mode | ON |
| gtid_owned | |
| gtid_purged | |
| session_track_gtids | OFF |
+----------------------------------------------+------------------------------------------+
gtid_executed的值:24f16c50-3508-11ee-8fa3-fa163e353519:1-10,be530001-350a-11ee-b128-fa163e353519:1-3803

# 在有故障的从库中操作
stop GROUP_REPLICATION;
reset master;
set global gtid_purged='24f16c50-3508-11ee-8fa3-fa163e353519:1-10,be530001-350a-11ee-b128-fa163e353519:1-3803';

# 启动组复制
START GROUP_REPLICATION;

# 添加白名单网段,一定要注意: 先关闭 Group Replication
stop group_replication;
set global group_replication_ip_whitelist="127.0.0.1/32,192.168.64.0/24";
start group_replication;
show variables like "group_replication_ip_whitelist";

# 只读模式
Setting super_read_only=ON

8. 遇到问题

8.1 MGR集群整体宕机后,slave节点加入集群,发现节点状态一直处于RECOVERING然后变成MISS状态,查看节点error日志有如下报错

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2023-08-08T09:53:56.892258+08:00 25 [ERROR] [MY-010584] [Repl] Replica I/O for channel 'group_replication_recovery': Error connecting to source 'mgr_repl@V002012013-mysql-32:3306'. This was attempt 1/1, with a delay of 60 seconds between attempts. Message: Unknown MySQL server host 'mgr_repl@V002012013-mysql-32' (-2), Error_code: MY-002005
2023-08-08T09:53:56.895250+08:00 24 [ERROR] [MY-011582] [Repl] Plugin group_replication reported: 'There was an error when connecting to the donor server. Please check that group_replication_recovery channel credentials and all MEMBER_HOST column values of performance_schema.replication_group_members table are correct and DNS resolvable.'
2023-08-08T09:53:56.895285+08:00 24 [ERROR] [MY-011583] [Repl] Plugin group_replication reported: 'For details please check performance_schema.replication_connection_status table and error log messages of Replica I/O for channel group_replication_recovery.'
2023-08-08T09:53:57.777521+08:00 10 [System] [MY-014010] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been started.'

解决:
mgr_repl@V002012013-mysql-32:3306 添加hosts
还是报错的话在主库上:
# 设置group_replication_bootstrap_group为ON是为了标示以后加入集群的服务器以这台服务器为基准,以后加入的就不需要设置
set global group_replication_bootstrap_group=ON;

# 再在备库上启动启动组复制:
START GROUP_REPLICATION;:

# 再关闭group_replication_bootstrap_group
set global group_replication_bootstrap_group=OFF;