Monday, May 12, 2014

SharePoint 2013: 500 Internal Server Error and Security Token Service Failure

Problem

You connect to the primary website for your customers and experience the following response:
You check farm server logs, and, on one of the web front ends (WFE), you see application errors like the following:
Log Name:      Application
Source:        Microsoft-SharePoint Products-SharePoint Foundation
Date:          [date/time]
Event ID:      8306
Task Category: Claims Authentication
Level:         Error
Keywords:      
User:          [search or other service account]
Computer:      [name of problematic WFE]
Description:
An exception occurred when trying to issue security token: The server was 
unable to process the request due to an internal error.  For more 
information about the error, either turn on IncludeExceptionDetailInFaults 
(either from ServiceBehaviorAttribute or from the servicedebug configuration 
behavior) on the server in order to send the exception information back to 
the client, or turn on tracing as per the Microsoft .NET Framework SDK 
documentation and inspect the server trace logs..
Event Xml:
...
This error may feature various service accounts due to different farm services being affected.  You then connect to Central Administration, review the health report, and see the following rule violation:
The Security Token Service is not available
and the server triggering this rule violation is one of the farm's WFEs.

Solution
  1. Remote into the problematic WFE.
  2. Open IIS Manager, and then navigate to Application Pools.
  3. View the state of the SecurityTokenServiceApplicationPool.
    • If it is started, stop and restart it.
    • If it is stopped, start it.
  4. Open a command prompt as administrator.
  5. Perform an IISReset /NoForce
  6. Connect to the primary customer site again.
References
Notes
  • About this posting: this posting consolidates my notes regarding troubleshooting and resolving one cause of an HTTP 500 error.  In this scenario, the HTTP 500 error is associated with an issue involving the Security Token Service.
  • Central Admin may not be affected: If you don't host CA on one of your WFEs, it won't be affected by this issue.  This is the case here, where the farm's Central Administration site is not hosted on one of the WFEs (it employs the traditional rather than streamlined topology).  This scenario helps demonstrate why you may want to choose the traditional topology over a streamlined one: had CA also been hosted on one of the WFEs, it too would have been affected by the SecurityTokenService issue and would not have been available to provide useful troubleshooting information.
  • What to do if it's not Security Token Service: If you're looking at this posting, and your HTTP 500 Internal Server experience is not associated with a Security Token Service Failure, then glance over this checklist:
    • In Services, verify SharePoint Timer Service (SPTimerV4) is running
    • In IIS, verify website is started
    • In IIS, verify SharePoint Web Services Root application pool is started.
    I have experienced that if the SharePoint Web Services Root application pool is stopped on one of the WFEs, even though they're in NLB configuration, this will bring down the entire website.  I don't understand why, and I haven't had the time to explore it further.
  • Some users see HTTP 500 but other are still able to access: farm topologies employing some form of network load balancing (NLB) will present seemingly inconsistent experiences here to the administrator trying to troubleshoot.  This is due to NLB routing some requests to the failing WFE and other requests to the remaining healthy WFEs.
  • PersistedNavigationTermSetSyncJobDefinition failure: if you are using managed navigation, and the security token service fails on one of the farm WFEs, you may see this error appearing in the server log at fairly regular intervals (eg, every 15 minutes or so):
    Log Name:      Application
    Source:        Microsoft-SharePoint Products-SharePoint Foundation
    Date:          [date/time]
    Event ID:      6398
    Task Category: Timer
    Level:         Critical
    Keywords:      
    User:          [farm service account]
    Computer:      [WFE on which security token service is failing]
    Description:
    The Execute method of job definition Microsoft.SharePoint.Publishing.Internal.
    PersistedNavigationTermSetSyncJobDefinition 
    (ID 5cfde201-6fd8-4ee3-8920-95a7d017a22f) 
    threw an exception. More information is included below.
    
    The server was unable to process the request due to an internal error.  
    For more information about the error, either turn on 
    IncludeExceptionDetailInFaults (either from ServiceBehaviorAttribute or 
    from the servicedebug configuration behavior) on the server in order 
    to send the exception information back to the client, or turn on tracing 
    as per the Microsoft .NET Framework SDK documentation and inspect the 
    server trace logs.
    Event Xml:
    
    and this one also this one, though it may occur at varying times:
    Log Name:      Application
    Source:        Microsoft-SharePoint Products-SharePoint Server
    Date:          [date/time]
    Event ID:      8088
    Task Category: Taxonomy
    Level:         Warning
    Keywords:      
    User:          NT AUTHORITY\IUSR
    Computer:      [problematic WFE]
    Description:
    The Managed Metadata Service 'Managed Metadata Service' is inaccessible.
    Event Xml:
    ...
    
  • SharePoint Check: I developed a simple PowerShell script that interrogates farm services, IIS features and other services and presents the results listed in the shell.  One thing it returns is the state of the IIS application pools.  

No comments: