Cray System Management Documentation > Cray System Management (CSM) Administration Guide > security and authentication > Restrict Network Access to the ncn-images S3 Bucket

Restrict Network Access to the `ncn-images` S3 Bucket

The configuration documented in this procedure is intended to prevent user-facing dedicated nodes (UANs, Compute Nodes) from retrieving NCN image content from Ceph S3 services, as running on storage nodes.

Specifically, the controls enacted via this procedure should do the following:

Block HAProxy access to the ncn-images bucket if the client is not an NCN (NMN) or PXE booting from the MTL network. This via a HAProxy ACL on the storage servers.
Enable access logging for HAProxy.
Block Rados GW network access (port 8080) if the client is not an NCN (NMN) or originating from the HMN network. This via iptables rules on the storage servers.

Limitations

This is not designed to prevent UAIs (if in use) from retrieving NCN image content.

If a storage node is rebuilt, this procedure (for the rebuilt node) will need to be applied after the rebuild. The same is true if NCNs are added or removed from the system as it will change source IP ranges for clients.

Prerequisites and scope

Procedure should be executed after install or upgrade is otherwise complete, but prior to opening the system for user access.

Unless otherwise noted, the procedure should be run from ncn-m001 (not PIT).

This procedure was back-ported from CSM 1.2 and was tested on a CSM 1.0.11 system.

Procedure

Test connectivity before applying the ACL.

Save the following script to a file (for example, con_test.sh).

#!/bin/bash


SNCNS="$(grep 'ncn-s.*\.nmn' /etc/hosts | awk '{print $NF;}' | xargs)"
SCSNS_NMN="$(echo $SNCNS | xargs -n 1 | sed -e 's/$/.nmn/g')"
SCSNS_HMN="$(echo $SNCNS | xargs -n 1 | sed -e 's/$/.hmn/g')"
SCSNS_CAN="$(echo $SNCNS | xargs -n 1 | sed -e 's/$/.can/g')"

RADOS_HTTP_PORT="8080"
HAPROXY_HTTP_PORT="80"
HAPROXY_HTTPSPORT="443"

PASS="PASS"
FAIL="FAIL"

function rados_test
{
NODES="$1"
MSG="$2"
TTYPE="$3"

echo "[i] $MSG"
for n in $NODES
do
    echo -n "  RADOS $n: "
    if [ "$TTYPE" == "CONN_FAIL" ]
    then
        curl -sI --connect-timeout 2 http://${n}:${RADOS_HTTP_PORT}/ &> /dev/null
        rc=$?
        rc_pass=28
    else
        curl -I --connect-timeout 2 http://${n}:${RADOS_HTTP_PORT}/ 2>/dev/null | grep -q "200 OK"
        rc=$?
        rc_pass=0
    fi

    if [ $rc -eq $rc_pass ]
    then
        echo $PASS
    else
        echo $FAIL
    fi
done
}

function haproxy_test
{
NODES="$1"
MSG="$2"

echo "[i] $MSG"
for n in $NODES
do
    echo -n "  HAPROXY (CEPH) HTTP $n: "
    curl -I --connect-timeout 2 http://${n}:${HAPROXY_HTTP_PORT}/ncn-images/ 2>/dev/null | grep -q "x-amz-request-id"
    if [ $? -eq 0 ]
    then
        echo $PASS
    else
        echo $FAIL
    fi

    echo -n "  HAPROXY (CEPH) HTTPS $n: "
    curl -kI --connect-timeout 2 https://${n}:${HAPROXY_HTTPS_PORT}/ncn-images/ 2>/dev/null | grep -q "x-amz-request-id"
    if [ $? -eq 0 ]
    then
        echo $PASS
    else
        echo $FAIL
    fi

done
}

rados_test "$SCSNS_NMN" "MGMT RADOS over NMN"
rados_test "$SCSNS_HMN" "MGMT RADOS over HMN"
rados_test "$SCSNS_CAN" "MGMT RADOS over CAN" "CONN_FAIL"
haproxy_test "$SCSNS_NMN" "MGMT HAProxy over NMN"

Execute the script, if the ACLs have not been applied, results similar to the following will be returned (failures are expected):

[i] MGMT RADOS over NMN
RADOS ncn-s003.nmn: PASS
RADOS ncn-s002.nmn: PASS
RADOS ncn-s001.nmn: PASS
[i] MGMT RADOS over HMN
RADOS ncn-s003.hmn: PASS
RADOS ncn-s002.hmn: PASS
RADOS ncn-s001.hmn: PASS
[i] MGMT RADOS over CAN
RADOS ncn-s003.can: FAIL
RADOS ncn-s002.can: FAIL
RADOS ncn-s001.can: FAIL
[i] MGMT HAProxy over NMN
HAPROXY (CEPH) HTTP ncn-s003.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s003.nmn: PASS
HAPROXY (CEPH) HTTP ncn-s002.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s002.nmn: PASS
HAPROXY (CEPH) HTTP ncn-s001.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s001.nmn: PASS

Build an IP address list of NCNs on the NMN.

Cross-check to verify the count seems appropriate for the system in use.

ncn-m001# grep 'ncn-[mws].*.nmn' /etc/hosts | awk '{print $1;}' | sed -e 's/\./ /g' | sort -nk 4 | sed -e 's/ /\./g' | tee allowed_ncns.lst
10.252.1.4
10.252.1.5
10.252.1.6
10.252.1.7
10.252.1.8
10.252.1.9
10.252.1.10
10.252.1.11
10.252.1.12
10.252.1.13
10.252.1.14

Add the MTL subnet (needed for network boots of NCNs).
```
ncn-m001# echo '10.1.0.0/16' >> allowed_ncns.lst
```

Verify the allowed_ncns.lst contains contain NMN addresses for all management NCNs nodes and the MTL subnet (10.1.0.0/16).

ncn-m001# cat allowed_ncns.lst 
10.252.1.4
10.252.1.5
10.252.1.6
10.252.1.7
10.252.1.8
10.252.1.9
10.252.1.10
10.252.1.11
10.252.1.12
10.252.1.13
10.252.1.14
10.1.0.0/16

Confirm HAProxy configurations are identical across storage nodes.

Adjust the -w predicate to represent the full set of storage nodes for the system. Applies to this step and subsequent steps.

ncn-m001# pdsh -w ncn-s00[1-4] "cat /etc/haproxy/haproxy.cfg" | dshbak -c
----------------
ncn-s[001-004]
----------------
# Please do not change this file directly since it is managed by Ansible and will be overwritten
global
   log         127.0.0.1 local2

   chroot      /var/lib/haproxy
   pidfile     /var/run/haproxy.pid
   maxconn     8000
   user        haproxy
   group       haproxy
   daemon
   stats socket /var/lib/haproxy/stats
   tune.ssl.default-dh-param 4096
   ssl-default-bind-ciphers EECDH+AESGCM:EDH+AESGCM
   ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
defaults
   mode                    http
   log                     global
   option                  httplog
   option                  dontlognull
   option http-server-close
   option forwardfor       except 127.0.0.0/8
   option                  redispatch
   retries                 3
   timeout http-request    10s
   timeout queue           1m
   timeout connect         10s
   timeout client          1m
   timeout server          1m
   timeout http-keep-alive 10s
   timeout check           10s
   maxconn                 8000

frontend http-rgw-frontend
   bind *:80
   default_backend rgw-backend

frontend https-rgw-frontend
   bind *:443 ssl crt /etc/ceph/rgw.pem
   default_backend rgw-backend

backend rgw-backend
   option forwardfor
   balance static-rr
   option httpchk GET /
      server server-ncn-s001-rgw0 10.252.1.7:8080 check weight 100
      server server-ncn-s002-rgw0 10.252.1.6:8080 check weight 100
      server server-ncn-s003-rgw0 10.252.1.5:8080 check weight 100
      server server-ncn-s004-rgw0 10.252.1.4:8080 check weight 100

Create a backup of haproxy.cfg files on storage nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg-dist"

Grab a copy of haproxy.cfg to modify from a storage node, preserving permissions.
```
ncn-m001# scp -p ncn-s001:/etc/haproxy/haproxy.cfg . 
haproxy.cfg
```

Edit the haproxy.cfg, adding in the following ACLs and log directives to each front-end (a diff shown to illustrate changes necessary).

ncn-m001# diff -Naur haproxy.cfg-dist haproxy.cfg
--- haproxy.cfg-dist 2022-06-30 18:20:55.000000000 +0000
+++ haproxy.cfg   2022-07-07 16:56:40.000000000 +0000
@@ -1,6 +1,6 @@
# Please do not change this file directly since it is managed by Ansible and will be overwritten
global
-    log         127.0.0.1 local2
+    log         127.0.0.1:514 local0 info

   chroot      /var/lib/haproxy
   pidfile     /var/run/haproxy.pid
@@ -31,12 +31,22 @@
   maxconn                 8000

frontend http-rgw-frontend
+    log global
+    option httplog
   bind *:80
   default_backend rgw-backend
+    acl allow_ncns src -n -f /etc/haproxy/allowed_ncns.lst
+    acl restrict_ncn_images path_beg /ncn-images
+    http-request deny if restrict_ncn_images !allow_ncns 

frontend https-rgw-frontend
+    log global
+    option httplog
   bind *:443 ssl crt /etc/ceph/rgw.pem
   default_backend rgw-backend
+    acl allow_ncns src -n -f /etc/haproxy/allowed_ncns.lst
+    acl restrict_ncn_images path_beg /ncn-images
+    http-request deny if restrict_ncn_images !allow_ncns 

backend rgw-backend
   option forwardfor

Create a new rsyslog configuration for HAProxy to have it listen to UDP 514 on the local host.

With the log directive additions to HAProxy, and allowing a local host UDP 514 socket, access logging should work properly. Set permissions to 640 on the file.
```
ncn-m001# cat haproxy.conf 
# Collect log with UDP
$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 514

ncn-m001# chmod 0640 haproxy.conf 
```

Make sure HAProxy is running on storage nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "systemctl status haproxy" | grep "Active"
ncn-s001:      Active: active (running) since Thu 2022-07-07 17:38:49 UTC; 54min ago
ncn-s003:      Active: active (running) since Thu 2022-07-07 17:38:49 UTC; 54min ago
ncn-s002:      Active: active (running) since Thu 2022-07-07 17:38:49 UTC; 54min ago
ncn-s004:      Active: active (running) since Thu 2022-07-07 17:38:49 UTC; 54min ago

Determine where the HAProxy VIP currently resides (for awareness in the event debug is necessary).

ncn-m001# host rgw-vip
rgw-vip.nmn has address 10.252.1.3

ncn-m001# host rgw-vip.nmn
rgw-vip.nmn has address 10.252.1.3

ncn-m001# host 10.252.1.3
3.1.252.10.in-addr.arpa domain name pointer rgw-vip.
3.1.252.10.in-addr.arpa domain name pointer rgw-vip.local.
3.1.252.10.in-addr.arpa domain name pointer rgw-vip.local.local.
3.1.252.10.in-addr.arpa domain name pointer rgw-vip.nmn.
3.1.252.10.in-addr.arpa domain name pointer rgw-vip.nmn.local.

ncn-m001# ssh rgw-vip 'hostname'
ncn-s001

Propagate the rsyslog configuration out to all storage nodes.

ncn-m001# pdcp -w ncn-s00[1-4] haproxy.conf /etc/rsyslog.d/

Propagate the HAProxy configuration out to all storage nodes.

ncn-m001# pdcp -w ncn-s00[1-4] haproxy.cfg allowed_ncns.lst /etc/haproxy/

Verify the configurations are identical across storage nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "cat /etc/haproxy/haproxy.cfg" | dshbak -c
----------------
ncn-s[001-004]
----------------
# Please do not change this file directly since it is managed by Ansible and will be overwritten
global
   log         127.0.0.1 local2

   chroot      /var/lib/haproxy
   pidfile     /var/run/haproxy.pid
   maxconn     8000
   user        haproxy
   group       haproxy
   daemon
   stats socket /var/lib/haproxy/stats
   tune.ssl.default-dh-param 4096
   ssl-default-bind-ciphers EECDH+AESGCM:EDH+AESGCM
   ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets
defaults
   mode                    http
   log                     global
   option                  httplog
   option                  dontlognull
   option http-server-close
   option forwardfor       except 127.0.0.0/8
   option                  redispatch
   retries                 3
   timeout http-request    10s
   timeout queue           1m
   timeout connect         10s
   timeout client          1m
   timeout server          1m
   timeout http-keep-alive 10s
   timeout check           10s
   maxconn                 8000

frontend http-rgw-frontend
   bind *:80
   default_backend rgw-backend
   acl allow_ncns src -n -f /etc/haproxy/allowed_ncns.lst
   acl restrict_ncn_images path_beg /ncn-images
   http-request deny if restrict_ncn_images !allow_ncns 

frontend https-rgw-frontend
   bind *:443 ssl crt /etc/ceph/rgw.pem
   default_backend rgw-backend
   acl allow_ncns src -n -f /etc/haproxy/allowed_ncns.lst
   acl restrict_ncn_images path_beg /ncn-images
   http-request deny if restrict_ncn_images !allow_ncns 

backend rgw-backend
   option forwardfor
   balance static-rr
   option httpchk GET /
      server server-ncn-s001-rgw0 10.252.1.7:8080 check weight 100
      server server-ncn-s002-rgw0 10.252.1.6:8080 check weight 100
      server server-ncn-s003-rgw0 10.252.1.5:8080 check weight 100
      server server-ncn-s004-rgw0 10.252.1.4:8080 check weight 100

Restart rsyslog across all storage nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "systemctl restart rsyslog"
ncn-m001# pdsh -w ncn-s00[1-4] "systemctl status rsyslog" | grep Active
ncn-s001:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s002:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s003:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s004:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago

Restart HAProxy across all storage nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "systemctl restart haproxy"
ncn-m001# pdsh -w ncn-s00[1-4] "systemctl status haproxy" | grep Active
ncn-s001:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s002:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s003:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago
ncn-s004:      Active: active (running) since Thu 2022-07-07 13:50:39 UTC; 7s ago

Apply server-side iptables rules to storage nodes.

This is needed to prevent direct access to the Ceph Rados GW Service (not through HAProxy).

The process is written to support change on individual nodes, but could be scripted after analysis of the running firewall rule set (notably with respect to local modifications, if they exist).

This process must be completed on each storage node (steps 18 - 21).

Document where Rados GW is running (port wise). It should be the same across all storage nodes.

ncn-s001# ss -tnpl | grep rados
LISTEN    0         128                0.0.0.0:8080             0.0.0.0:*        users:(("radosgw",pid=25018,fd=77))                                            
LISTEN    0         128                   [::]:8080                [::]:*        users:(("radosgw",pid=25018,fd=78))

List existing iptables rules.

ncn-s001# iptables -L -nx -v
Chain INPUT (policy ACCEPT 399480930 packets, 1051007801113 bytes)
   pkts      bytes target     prot opt in     out     source               destination         
      0        0 DROP       tcp  --  *      *       0.0.0.0/0            10.102.4.135         tcp dpt:22

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
   pkts      bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 400599807 packets, 1035420933926 bytes)
   pkts      bytes target     prot opt in     out     source               destination

Run the following to add iptables rules for control.

The range should include all NMN NCN IP addresses generated for the HAProxy ACL step.

iptables -A INPUT -i vlan004 -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -i lo -p tcp --dport 8080 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -m iprange --src-range 10.252.1.4-10.252.1.12 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j LOG --log-prefix "RADOSGW-DROP"
iptables -A INPUT -p tcp --dport 8080 -j DROP

List iptables rules again, verify rules are in place.

ncn-s001# iptables -L -nx -v
Chain INPUT (policy ACCEPT 22144 packets, 28721015 bytes)
   pkts      bytes target     prot opt in     out     source               destination         
      0        0 DROP       tcp  --  *      *       0.0.0.0/0            10.102.4.135         tcp dpt:22
      0        0 ACCEPT     tcp  --  vlan004 *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8080
      85     4862 ACCEPT     tcp  --  lo     *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8080
   276    15438 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 source IP range 10.252.1.4-10.252.1.12
      0        0 LOG        tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 LOG flags 0 level 4 prefix "RADOSGW-DROP"
      0        0 DROP       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8080

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
   pkts      bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 22099 packets, 30141111 bytes)
   pkts      bytes target     prot opt in     out     source               destination

Test connectivity after applying the ACL.

Re-run the connectivity test. While the results will be similar, they should all now be passing:

ncn-m001# bash ./con_test.sh 
[i] MGMT RADOS over NMN
RADOS ncn-s001.nmn: PASS
RADOS ncn-s002.nmn: PASS
RADOS ncn-s003.nmn: PASS
[i] MGMT RADOS over HMN
RADOS ncn-s001.hmn: PASS
RADOS ncn-s002.hmn: PASS
RADOS ncn-s003.hmn: PASS
[i] MGMT RADOS over CAN
RADOS ncn-s001.can: PASS
RADOS ncn-s002.can: PASS
RADOS ncn-s003.can: PASS
[i] MGMT HAProxy over NMN
HAPROXY (CEPH) HTTP ncn-s001.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s001.nmn: PASS
HAPROXY (CEPH) HTTP ncn-s002.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s002.nmn: PASS
HAPROXY (CEPH) HTTP ncn-s003.nmn: PASS
HAPROXY (CEPH) HTTPS ncn-s003.nmn: PASS

Validate no connection can be made to HAProxy for ncn-images or Ceph RADOS GW (at all) from compute nodes and UANs.

Use rgw-vip as it will resolve to one of the storage nodes.

nid000002# host rgw-vip
rgw-vip has address 10.252.1.3

nid000002# curl http://rgw-vip/ncn-images/
<html><body><h1>403 Forbidden</h1>
Request forbidden by administrative rules.
</body></html>

nid000002# curl -k https://rgw-vip/ncn-images/
<html><body><h1>403 Forbidden</h1>
Request forbidden by administrative rules.
</body></html>

nid000002# curl --connect-timeout 2 rgw-vip:8080
curl: (28) Connection timed out after 2001 milliseconds

Look for a 403 response in the HAProxy logs:

ncn-m001# pdsh -N -w ncn-s00[1-4] "cd /var/log && zgrep -h -i -E 'haproxy.*frontend' messages || exit 0" | grep "ncn-images"
2022-07-13T13:57:08+00:00 xxx-ncn-s001.local haproxy[43591]: 10.252.1.13:50238 [13/Jul/2022:13:57:08.363] http-rgw-frontend http-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-07-13T14:01:11+00:00 xxx-ncn-s001.local haproxy[43591]: 10.252.1.13:50240 [13/Jul/2022:14:01:11.038] http-rgw-frontend http-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
...

In the firewall logs, the Ceph RADOS GW traffic will be dropped on the storage node. For example:

ncn-m001# pdsh -N -w ncn-s00[1-4] "grep RADOSGW /var/log/firewall"
2022-07-13T14:02:03.418750+00:00 xxx-ncn-s001 kernel: [4397628.546654] RADOSGW-DROPIN=vlan002 OUT= MAC=b8:59:9f:f9:1d:22:a4:bf:01:3f:6f:91:08:00 SRC=10.252.1.13 DST=10.252.1.3 LEN=52 TOS=0x00 PREC=0x00 TTL=64 ID=9727 DF PROTO=TCP SPT=59278 DPT=8080 WINDOW=42340 RES=0x00 SYN URGP=0 
...

For further validation, the following script can be saved to a UAN or compute node with a storage node count as an input.

This will test cross-network select access that should not be possible based on a correctly configured switch ACL posture, as well.

nid000002# cat user_con_test.sh 
CURL_O="--connect-timeout 2 -f"
NODE_COUNT="$1"

function curl_rept
{
   echo -n "[i] $1 -> "
   $1 &> /dev/null
   if [ $? -ne 0 ]
   then
      echo "PASS"
   else
      echo "FAIL"
   fi
   return
} 

for n in `seq 1 $NODE_COUNT`
do
   for t in nmn can hmn
   do
      curl_rept "curl $CURL_O ncn-s00${n}.${t}:8080" # ceph rados
      curl_rept "curl $CURL_O http://ncn-s00${n}.${t}/ncn-images/" # ncn images, http
      curl_rept "curl $CURL_O -k  https://ncn-s00${n}.${t}/ncn-images/" # ncn images, https
   done
done

nid000002# bash ./user_con_test.sh 4
[i] curl --connect-timeout 2 -f ncn-s001.nmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s001.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s001.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s001.can:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s001.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s001.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s001.hmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s001.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s001.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s002.nmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s002.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s002.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s002.can:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s002.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s002.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s002.hmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s002.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s002.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s003.nmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s003.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s003.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s003.can:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s003.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s003.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s003.hmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s003.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s003.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s004.nmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s004.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s004.nmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s004.can:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s004.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s004.can/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f ncn-s004.hmn:8080 -> PASS
[i] curl --connect-timeout 2 -f http://ncn-s004.hmn/ncn-images/ -> PASS
[i] curl --connect-timeout 2 -f -k https://ncn-s004.hmn/ncn-images/ -> PASS

Save the iptables rule set on all storage nodes and make it persistent across reboots.

Create a directory to hold the iptables configuration.

ncn-m001# pdsh -w ncn-s00[1-4] "mkdir --mode=750 /etc/iptables"

Create a one-shot systemd service to load iptables on system boot.

ncn-m001# cat << EOF > metal-iptables.service
[Unit]
Description=Loads Metal iptables config
After=local-fs.target network.service

[Service]
Type=oneshot
ExecStart=/usr/sbin/iptables-restore /etc/iptables/metal.conf
Restart=no
RemainAfterExit=no

[Install]
WantedBy=multi-user.target
EOF

ncn-m001# chmod 640 metal-iptables.service

Distribute the one-shot systemd service to the storage nodes.

ncn-m001# pdcp -w ncn-s00[1-4] metal-iptables.service /usr/lib/systemd/system

Enable the service.

ncn-m001# pdsh -w ncn-s00[1-4] "systemctl enable metal-iptables.service"

Use iptables-save to commit running rules to the persistent configuration.

ncn-m001# pdsh -w ncn-s00[1-4] "iptables-save -f /etc/iptables/metal.conf"

Execute the one-shot systemd service.

ncn-m001# pdsh -w ncn-s00[1-4] "systemctl start metal-iptables.service"

Verify the rule set is consistent across nodes.

ncn-m001# pdsh -w ncn-s00[1-4] "cat /etc/iptables/metal.conf" | grep "8080" | dshbak -c
----------------
ncn-s[001-004]
----------------
-A INPUT -i vlan004 -p tcp -m tcp --dport 8080 -j ACCEPT
-A INPUT -i lo -p tcp -m tcp --dport 8080 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 8080 -m iprange --src-range 10.252.1.4-10.252.1.14 -j ACCEPT
-A INPUT -p tcp -m tcp --dport 8080 -j LOG --log-prefix RADOSGW-DROP
-A INPUT -p tcp -m tcp --dport 8080 -j DROP

Troubleshooting

NOTE: If SMA log forwarders are not yet running, then it might be necessary to temporarily disable the /etc/rsyslog.d/01-cray-rsyslog.conf rule (for logs to flow to the local nodes without delay). Restart rsyslog if this action is required.

Look for RADOSGW drops in /var/log/firewall on storage nodes, note that the connectivity test will attempt access on the CAN.

ncn-m001# pdsh -N -w ncn-s00[1-4] "grep RADOSGW /var/log/firewall" | grep vlan007 | head -3
2022-08-01T21:22:01.049443+00:00 ncn-s003 kernel: [13242021.397679] RADOSGW-DROPIN=vlan007 OUT= MAC=14:02:ec:d9:79:d0:94:40:c9:5f:9a:84:08:00 SRC=10.103.13.13 DST=10.103.13.5 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=35159 DF PROTO=TCP SPT=60482 DPT=8080 WINDOW=42340 RES=0x00 SYN URGP=0 
2022-08-01T21:22:05.061945+00:00 ncn-s001 kernel: [13248180.144514] RADOSGW-DROPIN=vlan007 OUT= MAC=14:02:ec:da:bc:68:94:40:c9:5f:9a:84:08:00 SRC=10.103.13.13 DST=10.103.13.7 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=10604 DF PROTO=TCP SPT=43034 DPT=8080 WINDOW=42340 RES=0x00 SYN URGP=0 
2022-08-01T21:22:02.047541+00:00 ncn-s003 kernel: [13242022.399499] RADOSGW-DROPIN=vlan007 OUT= MAC=14:02:ec:d9:79:d0:94:40:c9:5f:9a:84:08:00 SRC=10.103.13.13 DST=10.103.13.5 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=35160 DF PROTO=TCP SPT=60482 DPT=8080 WINDOW=42340 RES=0x00 SYN URGP=0 
...

Look for HAProxy access logs in /var/log/messages on storage nodes that have HTTP 403 responses (or other responses depending upon context).

ncn-m001# pdsh -N -w ncn-s00[1-3] "cd /var/log && zgrep -h 'haproxy.*frontend' messages || exit 0" | grep " 403 " | sort -k 1
2022-08-01T21:36:28+00:00 localhost haproxy[20903]: 10.252.1.20:37248 [01/Aug/2022:21:36:28.679] http-rgw-frontend http-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-08-01T21:36:57+00:00 localhost haproxy[20903]: 10.252.1.20:53358 [01/Aug/2022:21:36:57.898] https-rgw-frontend~ https-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-08-01T21:39:35+00:00 localhost haproxy[20903]: 10.252.1.20:40400 [01/Aug/2022:21:39:35.141] https-rgw-frontend~ https-rgw-frontend/<NOSRV> 1/-1/-1/-1/1 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-08-01T21:39:35+00:00 localhost haproxy[20903]: 10.252.1.20:57530 [01/Aug/2022:21:39:35.134] http-rgw-frontend http-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-08-01T21:39:37+00:00 localhost haproxy[20903]: 10.252.1.20:34828 [01/Aug/2022:21:39:37.152] http-rgw-frontend http-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"
2022-08-01T21:39:37+00:00 localhost haproxy[20903]: 10.252.1.20:57896 [01/Aug/2022:21:39:37.159] https-rgw-frontend~ https-rgw-frontend/<NOSRV> 0/-1/-1/-1/0 403 212 - - PR-- 1/1/0/0/0 0/0 "GET /ncn-images/ HTTP/1.1"