<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Feb 25, 2017 at 12:47 AM, Clint Byrum <span dir="ltr"><<a href="mailto:clint@fewbar.com" target="_blank">clint@fewbar.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Excerpts from joehuang's message of 2017-02-25 04:09:45 +0000:<br>

<span class="gmail-">> Hello, Matt,<br>

><br>

> Thank you for your reply, just as what you mentioned, for the slow changed data, aync. replication should work. My concerns is that the impact of replication delay, for example (though it's quite low chance to happen):<br>

><br>

> 1) Add new user/group/role in RegionOne, before the new user/group/role are replicated to RegionTwo, the new user begin to access RegionTwo service, then because the data has not arrived yet, the user's request to RegionTwo may be rejected for the token vaildation failed in local KeyStone.<br>

><br>

<br>

</span>I think this is entirely acceptable. You can even check with your<br>

monitoring system to find out what the current replication lag is to<br>

each region, and notify the user of how long it may take.<br>

<span class="gmail-"><br>

> 2)In token revoke case. If we remove the user'role in RegionOne, the token in RegionOne will be invalid immediately, but before the remove operation replicated to the RegionTwo, the user can still use the token to access the services in RegionTwo. Although it may last in very short interval.<br>

><br>

> Is there someone can evaluate the security risk is affordable or not.<br>

><br>

<br>

</span>The simple answer is that the window between a revocation event being<br>

created, and being ubiquitous, is whatever the maximum replication lag<br>

is between regions. So if you usually have 5 seconds of replication lag,<br>

it will be 5 seconds. If you have a really write-heavy day, and you<br>

suddenly have 5 minutes of replication lag, it will be 5 minutes.<br>

<br>

The complicated component is that in async replication, reducing<br>

replication lag is expensive. You don't have many options here. Reducing<br>

writes on the master is one of them, but that isn't easy! Another is<br>

filtering out tables on slaves so that you only replicate the tables<br>

that you will be reading. But if there are lots of replication events,<br>

that doesn't help.<br></blockquote><div><br></div><div>This is a good point and something that was much more prevalent with UUID tokens. We still write *all* the data from a UUID token to the database, which includes the user, project, scope, possibly the service catalog, etc... When validating a UUID token, it would be pulled from the database and returned to the user. The information in the UUID token wasn't confirmed at validation time. For example, if you authenticated for a UUID token scoped to a project with the `admin` role, the role and project information persisted in the database would reflect that. If your `admin` role assignment was removed from the project and you validated the token, the token reference in the database would still contain `admin` scope on the project. At the time the approach to fixing this was to create a revocation event that would match specific attributes of that token (i.e. the `admin` role on that specific project). As a result, the token validation process would pull the token from the backend, then pass it to the revocation API and ask if the token was revoked based on any pre-existing revocation events.</div><div><br></div><div>The fernet approach to solving this was fundamentally different because we didn't have a token reference to pull from the backend that represented the authorization context at authentication time (which we did have with UUID). Instead, what we can do at validation time is decrypt the token and ask the assignment API for role assignments given a user and project [0] and raise a 401 if that user has no roles on the project [1]. So, by rebuilding the authorization context at validation time, we no longer need to rely on revocation events to enforce role revocation (but we do need them to enforce revocation for other things with fernet). The tradeoff is that performance degrades if you're using fernet without caching because we have to rebuild all of that information, instead of just returning a reference from the database. This led to us making significant improvements to our caching implementation in keystone so that we can improve token validation time overall, especially for fernet. As of last release UUID tokens are now validated the same exact way as fernet tokens are. Our team also made some improvements listing and comparing token references in the revocation API [2] [3] (thanks to Richard, Clint, and Ron for driving a lot of that work!).</div><div><br></div><div>Since both token formats rebuild the authorization context at validation time, we can remove some revocation events that are no longer needed. This means we won't be storing as many revocation events on role removal from domains and projects. Instead we will only rely on the revocation API to invalidate tokens for cases like specific token revocation or password changes (the new design of validation does role assignment enforcement for us automatically). This should reduce the amount of data being replicated due to massive amounts of revocation events.</div><div><br></div><div>We do still have some more work to do on this front, but I can dig into it and see what's left.</div><div> </div><div><br></div><div>[0] <a href="https://github.com/openstack/keystone/blob/724c9b7cd91a97fca8af7f7ec4ec8ef3d450681b/keystone/token/providers/common.py#L142-L150">https://github.com/openstack/keystone/blob/724c9b7cd91a97fca8af7f7ec4ec8ef3d450681b/keystone/token/providers/common.py#L142-L150</a></div><div>[1] <a href="https://github.com/openstack/keystone/blob/724c9b7cd91a97fca8af7f7ec4ec8ef3d450681b/keystone/token/providers/common.py#L327-L340">https://github.com/openstack/keystone/blob/724c9b7cd91a97fca8af7f7ec4ec8ef3d450681b/keystone/token/providers/common.py#L327-L340</a></div><div>[2] <a href="https://github.com/openstack/keystone/commit/9e84371461831880ce5736e9888c7d9648e3a77b">https://github.com/openstack/keystone/commit/9e84371461831880ce5736e9888c7d9648e3a77b</a></div><div>[3] <a href="https://github.com/openstack/keystone/commit/477189d0c51a3fd2f8709827fb635934f0e74200">https://github.com/openstack/keystone/commit/477189d0c51a3fd2f8709827fb635934f0e74200</a></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

One decent option is to switch to semi-sync replication:<br>

<br>

<a href="https://dev.mysql.com/doc/refman/5.7/en/replication-semisync.html" rel="noreferrer" target="_blank">https://dev.mysql.com/doc/<wbr>refman/5.7/en/replication-<wbr>semisync.html</a><br>

<br>

That will at least make sure your writes aren't acknowledged until the<br>

binlogs have been transferred everywhere. But if your master can take<br>

writes a lot faster than your slaves, you may never catch up applying , no matter<br>

how fast the binlogs are transferred.<br>

<br>

The key is to evaluate your requirements and think through these<br>

solutions. Good luck! :)<br>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

<span class="gmail-im gmail-HOEnZb">OpenStack Development Mailing List (not for usage questions)<br>

</span><div class="gmail-HOEnZb"><div class="gmail-h5">Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

</div></div></blockquote></div><br></div></div>