Office 365 Outage

The outage affecting Office 365 portal signin yesterday, also eventually spread to email services hosted by Exchange online.

This was the reported outage data:

Status:

Service restored

User impact:

Users may have been unable to access the Office 365 service.

Latest message:

Title: Can't sign in User Impact: Users may have been unable to access the Office 365 service. More info: Users may have been unable to sign in to Exchange Online, SharePoint Online, Skype for Business, Intune, Power BI, Microsoft Teams, Yammer and the Office 365 portal. Admins affected by this issue may not have had access to the Admin center. However, updates were provided on https://status.office.com and via the twitter handle @Office365Status. Final status: A subset of infrastructure that handles authentication requests became degraded, causing user sign in failures. We recovered the affected infrastructure to mitigate impact and monitored the service to confirm resolution. Scope of impact: This issue could have potentially affected any of your users if they were routed through the affected infrastructure. Start time: Friday, April 6, 2018, at 8:30 AM UTC End time: Friday, April 6, 2018, at 11:30 AM UTC Preliminary root cause: A subset of infrastructure that handles authentication requests became degraded. Next steps: - We're analyzing performance data and trends on the affected systems to help prevent this problem from happening again. We'll publish a preliminary post-incident report within 48 hours.

Updated:

2018-04-06 17:02 (UTC)

Start time:

2018-04-06 08:30 (UTC)

End time:

2018-04-06 17:02 (UTC)

Skype For Business Trusted Server model

Multiple domains in the Office 365 tenancy can present problems when users try and login to Skype For Business 2015 and their SIP URI is different from the SIP domain.

To resolve this, create a group policy object using the Office 2016 ADMX templates and specifically, the Skype For Business policies contained within.

These are used to add the SIP domain where it differs from the user's SIP sign-in name.

The policy is Computer or User based and will mean that the annoying popup, announcing the SIP domain, sign-in name disparity, is not displayed.

A simple solution, but implementation is very often complicated by the CAB bureaucracy in place within your organisation.

Skype Online Smart Auto Contacts

A cautionary tale...

A little known option in Skype for Business online has caused some of our customers to become less than happy with the product.

I'm talking about the pop-ups that announce the change in status of some of their contacts, which can appear at the most inopportune moments.

The option is described in the literature as the Smart Auto Contacts list and is a binary, organisational option. This is as precise as the granularity gets in this option so I would like to see Microsoft make this more obvious to the tenant admin in the Office 365 options.

This option is on by default and cause analysis of the manager attribute of the user's AD account at first Skype client logon. If this attribute is populated along with the team members on the manager's AD account tab then these will be added to the Skype client by way of a group called 'My Group', with all members having the 'Tag for Status Change Alerts'. This information is only assessed once and therefore unaffected by AD changes.

We have issued some five and half thousand licenses to our users and also set a GPO to enable Skype logon, which when used with the appropriate on-prem AD user attribute SIP address, forces all of users to auto logon to the Skype client.

This has the effect of reducing the option to minimise the impact of smart auto contacts by deactivation to almost zero. Sure, we could wipe the Skype licenses from all of our users, turn off the smart auto contacts list option and issue all of our licenses again, but the potential negative customer feedback doesn't seem worthwhile.

We'll look for alternative solutions to deactivate this option on all clients PC's via the Windows registry , as messy as that might be.

Azure AD Connect Synch Errors

We were experiencing increasing long synch times between our on-premise active directory and the Azure cloud databases, which became apparent during the use of our user account creation tools.

Since we are in hybrid mode, we have a mixed user population, some cloud, some on prem home mailboxes and depending on whether or not a team has been onboarded yet will determine where their mailbox is homed.

The script that creates the user account and mail enables it has to wait for AAD to synch before it can write the Exchange GUID of the cloud mailbox back to the remote mailbox on prem. This pause while synching started out being thirty minutes and ended up being more than three hours in some cases.

Some investigation time later and we identified this error on the Azure AD synch log:

The Azure hosted AD domain controllers had been reduced in size to A1 and stripped of the Windows GUI to offset this performance throttling, but it seems that at peak times of operation, the DCs couldn't cope with replication duties and synch pull requests from the Azure AD synch engine.

The issue was resolved by increasing the size of the VM back to A2 (from 1 core, 1.75GB RAM -> 2 core, 4GB RAM). Now we are not seeing any Azure AD Connect synch errors and the object synch intervals to Azure are within reasonable timeframes.

OnPrem DL Membership

OnPrem DL Membership Management

The hybrid configuration of an Office 365 environment poses many challenges.
This is especially true when the organisation spans many domains and partner entities.
With 6000 user and 7000 functional mailboxes to onboard to EXO, the business process for the migration to the cloud is complex and greatly exacerbated by a shared services model that entangles the dependancies of multi-departmental shared mailboxes.

To partially mitigate the online user losing the ability to manage onprem distribution groups, we are now asking our users to manage their own DL's via the DSQuery method:

C:\windows\system32\rundll32.exe dsquery,OpenQueryWindow

This is a less than ideal solution but at least we can continue to provide a means for end users to manage DL members across the online \ onprem boundary.

Ideally we would have batched the users to avoid this problem but in any large organisation this is impractical, especially if the configuration was not designed with this in mind.

Office 365 - Licensed Shared Mailboxes

We run a hybrid mode AD \ Exchange 2010 instance and are keen to keep licenses applied to only those resources that absolutely need them. As part of general housekeeping we need to ensure that shared mailboxes are not consuming a license and so I created this script to check for mailboxes that are of the type shared and are also assigned an Office 365 license.

#Get Your Office 365 credentials
$creds= get-credential -Message "Enter your Office 365 username and password"

#Connect to Azure and EXO
Connect-AzureAD -credential $creds
Connect-ExchangeOnline

#Grab the data that we require

$all_ncc_exo_users_exp = Get-AzureADUser -All $true|select UserPrincipalName -ExpandProperty AssignedPlans|Where-Object {$_.CapabilityStatus -eq "Enabled" -and $_.Service -eq "exchange"} |select -ExpandProperty UserPrincipalName -Unique

$licensed_sharedmbs = $all_ncc_exo_users_exp |%{get-aduser -filter {userprincipalname -eq $_} -Properties distinguishedname }|where {($_ -match "generic") -and ($_ -notmatch "kitchen")}

write-host $licensed_sharedmbs|fl name, userprincipalname

As always, use at your own risk and post i the comments if you need a hand.

Strong Authentication method

G'day readers. Tonight's challenge was an upgrade to latest version of our MS 2FA servers so we can add another two on-prem 2FA servers to the cluster, eradicate our Freja appliance and replace with the Microsoft equivalent 2FA service. This will save our organisation a considerable sum of money and in the current climate of fidiciary restraint, this is a good thing!

We had already done the ground work with the installation of the two new 2FA servers onto fresh Windows OS installs and to add new hosts to the 2FA cluster, then the existing need to be at the same version level.

The upgrades were completed by removing one endpoint from the load balancing function after ensuring that the remaining 2FA instance would have the master designation.

Get-AzureVM -ServiceName <service_name> -Name <vm_name> | Remove-AzureEndpoint -Name <endpoint_name> | Update-AzureVM

This allows the quiesced 2FA host to be upgraded via the GUI, 'Check for Updates' actually did nothing so we were forced to download the latest 2FA version as per this:

https://docs.microsoft.com/en-us/azure/multi-factor-authentication/multi-factor-authentication-get-started-server

and run the MSI installer which is what the 'Check for updates' GUI action is supposed to do.

Running this MSI did indeed update the 2FA instance to the latest version, at which point a restart of the application then insisted that we also update the AD FS adapter. This all appeared to complete without any issues, not even a prompt for a restart of the OS. Time to test ADFS login still works and as we are still getting 2FA SMS messages, then the service is still authenticating users.

Next step is to add the updated 2FA host back to the load balancer set. Powershell ISE commands do the heavy lifting of the end point creation:

Get-AzureVM -ServiceName <service_name> -Name <vm_name> | Add-AzureEndpoint -Name <endpoint_name> -Protocol <protocol> -LocalPort <local_port> -PublicPort <public_port> -DefaultProbe -InternalLoadBalancerName <name> -LBSetName <name> | Update-AzureVM

Test ADFS 2FA messages are still being sent and we can now redirect the target of our upgrade work to the second 2FA server, make the upgraded server the master 2FA host. Remove the second 2FA endpoint from the load balancer and run its 2FA MSI upgrade process. Then, update its ADFS adapter and add its endpoint back to the load balancer,
Finally test the AD FS authentication process is still up and sending 2FA SMS messages and we can consider this process complete.
Here is where the upgrade process fell over in a big pile of steaming dung.
We shut down one of the 2FA servers to simulate endpoint failover and tried ADFS logon, only to get a rather helpful error back from the user portal stating that an error had occurred.
Check the log files on the AD FS server and we can see there is a repetitive log entry of:

Event 364:
Exception details:
Microsoft.IdentityServer.RequestFailedException: No strong authentication method found for the request from https://XXXXXX. [redacted]
at Microsoft.IdentityServer.Web.PassiveProtocolListener

We have an obscure error that Google research is yet to become acquainted with and an in-progress upgrade that cannot easily be backed out of. Squeaky seat time.

As we had a custom claims rule in place, the MMC gui was unable to display the existing claims rule. This was removed with the following Powershell command:

Get-ADFSRelyingPartyTrust

Effectively resetting the claims rule back to an factory default condition so we can see the GUI is configured correctly.

Review of the log files on the 2FA server suggest that the AD FS adapter being requested is from the previous version of the 2FA server.

Check in the Program Files folder of the 2FA server and there are two Powershell commands to unregister and re-register the AD FS adapters into the 2FA application.

After running these un and register Powershell commands then the ADFS adapters are now recognised by 2FA server and we are no longer seeing An error has occurred message on the user portal when logging on via ADFS.

Wifi SSID deployment

InTune wifi SSID deployment

Managing devices within the organisation is getting easier with Microsoft's InTune service, especially if like us, you have a hybrid on-prem\ cloud environment. However, Microsoft need to up their game as this is a very competitive market and if my experiences with the portal are anything to go by, the app needs a lot of work before it can challenge the likes of Sophos or Mobile Iron for domination of this arena.

Today was a good example of what should be a simple process, becoming nigh on impossible to troubleshoot due to the lack of log files in the InTune dashboard.

I'd have thought that an InTune custom configuration policy would be a simple task to create and deploy but after eight hours of trying I've had to admit that this one will either have to wait for a call to Microsoft's product support team or wait for the InTune admin centre to be updated with specific policies that support wifi profile delivery with device enrolment.

Even Config manager 2012 R2 lacks complete support for wifi profile deployment, with the inability to assign a password for the SSID, this looks like a glaring oversight of this functionality. Granted, this can be overcome with workarounds but if a product isn't ready for production release then don't release it until it is.

The one hundred eyed giant from hell

Argus

Yesterday, I learned why project planning and adequate lead time for implementation is so vital. We were instructed by the higher powers to create an instance of these two apps, have it working and in production in two working days. Necessitated by a multi million pound bid (as they always are) that absolutely has to be started on Monday morning.

Especially galling was the fact that this has been on the order books for more than a month but because the pipeline is just barely scrutinised, and by those with less than a complete technical understanding, it got missed.

Luckily it was a standard client\server model and they had sprung for technical support and so much time was spent talking to the support guys, who to their credit were patient, professional and happy to help. The accompanying documentation looked to have been written with the intent of driving technical delivery teams to distraction, or to give customers no choice but to pay for the technical support function because there were several supposedly insurmountable obstacles that turned out to either be ignored altogether or worked around with a different approach. This it seems is the major competence any technical delivery specialist must have in spades to excel.

We got the thing installed, and were about to start on what we thought was a mod to this app when it became apparent that this was actually another app in its own right, requiring as much if not more work to get it working. Luckily, the database team were accommodating, if not frustrated by the whole unfortunate episode. As it stands, we still require permission to add another app to the host server and deliver the clients to the user groups too. Then there's the documentation to write and the discussion about lessons learned, which will be an interesting read for obvious reasons!

Technical learns from this activity were this command:

netsh http add urlacl url=https://+:8103/ user="Domain\Username

This is what it does:

Namespace reservation assigns the rights for a portion of the HTTP URL namespace to a particular group of users. A reservation gives those users the right to create services that listen on that portion of the namespace. Reservations are URL prefixes, meaning that the reservation covers all sub-paths of the reservation path. Namespace reservations permit two ways to use wildcards. The HTTP Server API documentation describes the order of resolution between namespace claims that involve wildcards.

We're scheduled to complete the second app install on Monday so we'll see what happens when this tale continues.