C# .NET - Character encoding problems - Asked By sam sinfield on 14-Mar-12 08:05 AM

Earn up to 30 extra points for answering this tough question.

Hi,

I have a mssql db table that stores html content in an nvarchar(max) field, a stored procedure that retrieves the content. A legacy dotnet 1.1 app that has a class that executes the stored procedure to retrieve the HTML data.
The customer has started adding Welsh pages and I have a field that contains the ŵ (w with a circumflex) character.
When the data is retrieved from an admin page, the character is displayed correctly but when the html content is shown on the front end, it has been changed to a w (circumflex removed).
I've stepped through the code from the front end and when it gets to the method that calls the stored procedure, the character is reverted to a w as soon as it comes from the database.

When I step through the code from the admin page, the character is retrieved from the db correctly

The stored procedure is simply
select * from pages where pageid=@pageId

and the code that calls the stored procedure is:
SqlCommand cmd = conn.CreateCommand();
cmd.CommandType = CommandType.StoredProcedure;
cmd.CommandText = "sp_RetrievePage";

SqlParameter parInput = cmd.Parameters.Add("@pageId",SqlDbType.UniqueIdentifier);
parInput.Value = pageId;


conn.Open();
using (SqlDataReader dr = cmd.ExecuteReader())
{
while (dr.Read()){
string pagecontent = dr.IsDBNull(dr.GetOrdinal("pagecontent")) ? "" : dr.GetString(dr.GetOrdinal("pagecontent"));
}
}

machine.config has globalization set to utf-8 and so does the web.config.

I'm at a loss as to why the character is being altered to w on one page and stays as ŵ on another within the same application.
If I execute the stored procedure directly in sql management studio, it displays the characte correcly, as ŵ

Can anyone please help?????

Parag Satpute replied to sam sinfield on 14-Mar-12 10:46 PM
Is there any client side scripting involved in displaying of the w with circumflex....
If yes, then the encoding for the client machine is different...

If there is no client side scripting involved in displaying of the w with circumflex then you have to check if your code that forms the webpage has correct UTF-8 encoding referencing and Secondly, if the webpage code does not use specific font which does not have circumflex characters because if the end user's computer does not have the desired font then it would not show the desired character...
Somesh Yadav replied to sam sinfield on 15-Mar-12 01:21 AM
hi try the three solutions,

solution:-

Is your SQL statement passing the value in to the db as a unicode value?

e.g. basic example

INSERT YourTable VALUES ('Unicode Value')

should be:

INSERT YourTable VALUES (N'Unicode Value')

Obviously I'd recommend you use parameterised SQL/stored procedure and define the parameter as an NVARCHAR, but you get what I'm saying.


solution2:

Work out what's going on by taking each piece of the pipeline separately.

See my article on http://pobox.com/%7Eskeet/csharp/debuggingunicode.html and diagnose where things are going wrong. Places that you may be seeing problems:

  • Getting the data from the user
  • Updating the database
  • Fetching from the database
  • Displaying the data you've fetched

Check each of these independently, never relying on any particular font etc. Print out the Unicode code points involved (as per the article).

Oh, and wherever you get to specify it, use an encoding which can encode everything - I'd suggest UTF-8.


solution3:-

Windows Forms apps

When entering text into a windows forms textbox you need to use a unicode font such as Arial Unicode MS.

Web page

Make sure you are using UTF-8 as your response codepage. Great article from Microsoft on exactly this http://support.microsoft.com/kb/893663

UTF-8 will correctly encode any characters. Keep in mind, NText and NVarChar in your DB are UTF-16 datatypes, so viewing the data from Query Analyser might will show it correctly.

In your SQL

If you're constructing the SQL dynamically, make sure you use the N prefix e.g.

INSERT INTO TABLE (Name, Number) VALUES (N'MyName', 1)
sam sinfield replied to Somesh Yadav on 15-Mar-12 10:38 AM
Hi,

Thanks for your reply.
Stored procedures are used throughout with parameters and the column is nvarchar(max)
When the stored procedure to update the row is called, the character output is 0175, however when the stored procedure is called to retrieve the data from the front end, the character output as 0077.
What i'm really struggling to get my head around, is when the same stored procedure is called from the admin page to retrieve the data, the character output is 0175!!

I can't understand how the same stored procedure called from the same class can output 2 different characters for the same field depending on whether its called from the admin page or front end when UTF-8 has been specified throughout!

Another strange thing is that another page on the front end also retrieves some other page content but from another table which also contains a ŵ character, yet that one shows ok, so can't be an issue with font.

Its a legacy system and the table and stored procedures may have been varchars before the customer started adding welsh content but I've checked them all and they are now nvarchars, I thought the database may have been caching the stored procedures with the varchars so have also run:
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
to clear the cache and buffer in the database but it has made no difference. Have i missed anything?

I've spent days doing everything I can think of to figure out whats causing this problem and getting no-where :-( I'm flummoxed!!

Cheers
Sam
sam sinfield replied to sam sinfield on 15-Mar-12 11:18 AM
fixed it!!
thanks for all the input, there was an isapi re-write rule on the admin page so it was actually going to a different page than it should have been!!!!!!! doh!!!