Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I prevent FormatException: Unfinished UTF-8 octet sequence

I have downloaded a Wikipedia dump and I am trying to read it line by line. But when doing the utf8-decode I get the following error

12633: FormatException: Unfinished UTF-8 octet sequence (at offset 65536)

Stacktrace :#0      _Utf8Decoder.convertSingle (dart:convert-patch/convert_patch.dart:1789:7)
#1      Utf8Decoder.convert (dart:convert/utf.dart:351:42)
#2      Utf8Codec.decode (dart:convert/utf.dart:63:20)
#3      _MapStream._handleData (dart:async/stream_pipe.dart:213:31)
#4      _ForwardingStreamSubscription._handleData (dart:async/stream_pipe.dart:153:13)
#5      _RootZone.runUnaryGuarded (dart:async/zone.dart:1618:10)
#6      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:341:11)
#7      _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271:7)
#8      _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:774:19)
#9      _StreamController._add (dart:async/stream_controller.dart:648:7)
#10     _StreamController.add (dart:async/stream_controller.dart:596:5)
#11     _FileStream._readBlock.<anonymous closure> (dart:io/file_impl.dart:98:19)
<asynchronous suspension>

That is this line

ar جزر_غالاباغوس 1 0

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

So I tried saving the file utf-8 encoded with this button

enter image description here

But that does not seem to work

This is my code

final filePath = p.join(
    Directory.current.path,
    'bin\\migrate_most_views\\data\\pageviews-20220416-170000',
  );
  final file = File(filePath);

  logger.stderr('exporting pageviews...');

  StreamSubscription? reader;
  int lineNumer = 0;
  reader = file.openRead().map(utf8.decode).transform(LineSplitter()).listen(
    (line) {
      final page = MostViewedPageDaily.fromLine(line);
      db.collection('page_views').insert(page.toMap());

      lineNumer++;
      if (lineNumer % 1000 == 0) {
        logger.stdout('inserting at line $lineNumer');
      }
    },
    onDone: () {
      logger.stdout('Reader read $lineNumer lines');
      reader?.cancel();
      exit(0);
    },
    onError: (error, stackTrace) {
      final message = '$lineNumer: $error\n\nStacktrace :$stackTrace';
      logger.stdout(logger.ansi.error(message));
      exit(1);
    },
    cancelOnError: true,
  );

What can I do?

I downloaded the file from here

https://dumps.wikimedia.org/other/pageviews/2022/2022-04/pageviews-20220417-010000.gz

>Solution :

You should use file.openRead().transform(utf8.decoder) instead of file.openRead().map(utf8.decode). (Also note the argument difference: utf8.decoder is a Utf8Decoder object, and utf8.decode is a method tear-off.)

The Stream.map documentation specifically discusses this:

Unlike transform, this method does not treat the stream as chunks of a single value. Instead each event is converted independently of the previous and following events, which may not always be correct. For example, UTF-8 encoding, or decoding, will give wrong results if a surrogate pair, or a multibyte UTF-8 encoding, is split into separate events, and those events are attempted encoded or decoded independently.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading