讲究的PG
一个Pool里设置的PG数量是预先设置的,PG的数量不是随意设置,需要根据OSD的个数及副本策略来确定:
Total PGs = ((Total_number_of_OSD * 100) / max_replication_count) / pool_count
自动缩放PLACEMENT GROUPS
放置组(PG)是Ceph分发数据的内部实现细节。可以通过启用pg-autoscaling来允许集群提出建议或根据集群的使用方式自动调整PG 。要为现有池设置自动缩放模式,请执行以下操作:
#语法:
ceph osd pool set <pool-name> pg_autoscale_mode <mode>
<mode>有以下三个选项可选:
- off:禁用此池的自动缩放。
- on:启用给定池的PG计数的自动调整。
- warn:应调整PG计数时发出健康警报
#举例:
[ceph@ceph05 ~]$ ceph osd pool set test pg_autoscale_mode on
set pool 3 pg_autoscale_mode to on
#可以使用以下方法配置pg_autoscale_mode应用于以后创建的任何池的默认值
ceph config set global osd_pool_default_pg_autoscale_mode <mode>
#Show all configuration option
[ceph@ceph04 ~]$ ceph config dump
WHO MASK LEVEL OPTION VALUE RO
global advanced osd_pool_default_pg_autoscale_mode on
查看PG缩放建议
#使用以下命令查看每个池,池的相对利用率以及对PG计数的任何建议更改:
[ceph@ceph06 ~]$ ceph osd pool autoscale-status
Error ENOTSUP: Module 'pg_autoscaler' is not enabled (required by command 'osd pool autoscale-status'): use `ceph mgr module enable pg_autoscaler` to enable it
[ceph@ceph06 ~]$ ceph mgr module enable pg_autoscaler #解决上面报错
[ceph@ceph04 ~]$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
test 1471 3.0 119.9G 0.0000 1.0 32 on
ecpool 0 1.5 119.9G 0.0000 1.0 12 warn
test22 400 2.0 119.9G 0.0000 1.0 128 32 warn
ecpool01 8192 1.5 119.9G 0.0000 1.0 12 warn
#我们注意到了AUTOSCALE为warn类型的test22建议把PG_NUM数量修改为32个,这里我们测试一下把test22这个pool启用PG计数自动调整,看下效果!
[ceph@ceph04 ~]$ ceph osd pool set test22 pg_autoscale_mode on
set pool 2 pg_autoscale_mode to on
#间隔一会后再执行下面命令查看一下状态,数量量越大的迁移时间可能越长,PG_NUM调整为NEW PG_NUM的数量需要的时间可能越长。
[ceph@ceph04 ~]$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
test 1471 3.0 119.9G 0.0000 1.0 32 on
ecpool 0 1.5 119.9G 0.0000 1.0 12 warn
test22 400 2.0 119.9G 0.0000 1.0 32 on
ecpool01 8192 1.5 119.9G 0.0000 1.0 12 warn
自动缩放规则
Ceph将查看整个系统的PG的总可用存储量和目标数量,查看每个池中存储了多少数据,并尝试相应地分配PG。该系统的方法相对保守,仅当当前PG(pg_num)数量比其认为的数量多3倍时才对池进行更改。
#每个OSD的PG的目标数量基于可 mon_target_pg_per_osd配置(默认值:100),可以通过以下方式进行调整:
[ceph@ceph04 ~]$ ceph config set global mon_target_pg_per_osd 100
[ceph@ceph04 ~]$ ceph config dump
WHO MASK LEVEL OPTION VALUE RO
global advanced mon_target_pg_per_osd 100
global advanced osd_pool_default_pg_autoscale_mode on
#由此可看出,当前环境内有三个osd,默认每个osd的目标数量为100,即总目标数量为300,如果上面test22设置PG_NUM为128,即三副本的总PG_NUM为384,以超过pg_num目标数量300,在pg_autoscale_mode为warn模式下所以会建议调整PG_NUM
指定pool池的大小
从一开始就使用更合适数量的PG,从而避免进行后续调整 pg_num以及在进行这些调整时与移动数据相关的开销。当然也可以在创建集群时就使用命令的可选参数或参数设置池的目标大小。–target-size-bytes
#方法1,直接指定pool池的容量:
[ceph@ceph04 ~]$ ceph osd pool set test target_size_bytes 10G
set pool 3 target_size_bytes to 10G
[ceph@ceph04 ~]$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
test 1471 10240M 3.0 119.9G 0.2500 1.0 32 on
ecpool 0 1.5 119.9G 0.0000 1.0 12 warn
test22 400 2.0 119.9G 0.0000 1.0 32 on
ecpool01 8192 1.5 119.9G 0.0000 1.0 12 warn
#方法2,指定pool池在集群所占的权重
[ceph@ceph04 ~]$ ceph osd pool set test22 target_size_ratio 0.5
set pool 2 target_size_ratio to 0.5
[ceph@ceph04 ~]$ ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
test 1471 10240M 3.0 119.9G 0.2500 1.0 32 on
ecpool 0 1.5 119.9G 0.0000 1.0 12 warn
test22 400 2.0 119.9G 0.0000 0.5000 1.0 32 on
ecpool01 8192 1.5 119.9G 0.0000 1.0 12 warn
指定pool的pg边界
ceph osd pool set <pool-name> pg_num_min <num>
创建pool时指定pg_num的数值:
ceph osd pool create {pool-name} [pg_num]
获取集群的PG统计信息
ceph pg dump
Scrub 测试
Scrub 是 Ceph 保持数据完整性的一个机制,类似于文件系统中的 fsck,它会发现存在的数据不一致。scrubbing 会影响集群性能。它分为两类:
- 一类是默认每天进行的,称为 light scrubbing,其周期由配置项 osd scrub min interval (默认24小时)和 osd scrub max interval (默认7天)决定。它通过检查对象的大小和属性来发现数据轻度不一致性问题。
- 另一种是默认每周进行的,称为 deep scrubbing,其周期由配置项 osd deep scrub interval (默认一周)决定。它通过读取数据并做 checksum 检查数据来发现数据深度不一致性问题。
查看默认的 osd scrub 的配置项:[ceph@ceph06 ~]$ ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show| grep scrub "mds_max_scrub_ops_in_progress": "5", "mon_scrub_inject_crc_mismatch": "0.000000", "mon_scrub_inject_missing_keys": "0.000000", "mon_scrub_interval": "86400", "mon_scrub_max_keys": "100", "mon_scrub_timeout": "300", "mon_warn_pg_not_deep_scrubbed_ratio": "0.750000", "mon_warn_pg_not_scrubbed_ratio": "0.500000", "osd_debug_deep_scrub_sleep": "0.000000", "osd_deep_scrub_interval": "604800.000000", "osd_deep_scrub_keys": "1024", "osd_deep_scrub_large_omap_object_key_threshold": "200000", "osd_deep_scrub_large_omap_object_value_sum_threshold": "1073741824", "osd_deep_scrub_randomize_ratio": "0.150000", "osd_deep_scrub_stride": "524288", "osd_deep_scrub_update_digest_min_age": "7200", "osd_max_scrubs": "1", "osd_op_queue_mclock_scrub_lim": "0.001000", "osd_op_queue_mclock_scrub_res": "0.000000", "osd_op_queue_mclock_scrub_wgt": "1.000000", "osd_requested_scrub_priority": "120", "osd_scrub_auto_repair": "false", "osd_scrub_auto_repair_num_errors": "5", "osd_scrub_backoff_ratio": "0.660000", "osd_scrub_begin_hour": "0", "osd_scrub_begin_week_day": "0", "osd_scrub_chunk_max": "25", "osd_scrub_chunk_min": "5", "osd_scrub_cost": "52428800", "osd_scrub_during_recovery": "false", "osd_scrub_end_hour": "24", "osd_scrub_end_week_day": "7", "osd_scrub_interval_randomize_ratio": "0.500000", "osd_scrub_invalid_stats": "true", "osd_scrub_load_threshold": "0.500000", "osd_scrub_max_interval": "604800.000000", "osd_scrub_max_preemptions": "5", "osd_scrub_min_interval": "86400.000000", "osd_scrub_priority": "5", "osd_scrub_sleep": "0.000000",
除了定时的scrub外,管理员也可以通过以下命令启动清理过程:
ceph pg scrub {pg-id}
#从特定池中scrub所有pg,请执行以下操作:
ceph osd pool scrub {pool-name}
PG的错误提示
inactive-PG active的时间过长(即,它无法处理读/写请求)。
unclean-PG clean的时间过长(即,它无法从以前的故障中完全恢复过来)。
stale-PG 状态尚未由一个ceph-osd更新,表示存储此放置组的所有节点都可能为down。
使用以下其中一项明确列出卡住的放置组:
ceph pg dump_stuck stale
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
对于产生stale状态的PG通常需要ceph-osd重新运行正确的守护程序
对于产生inactive状态的PG通常是一个对等问题,例:
ceph health detail
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
我们可以查询集群以确定PG为何正确标记down为:
ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2012-03-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2012-03-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2012-03-06 14:40:16.169513"}
]
}
更多PG错误示例请参考:https://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#failures-osd-peering